OpenClaw’s Process Supervisor Fix Is Boring in Exactly the Way Production Agents Need

Agent platforms love to talk about autonomy. Then a timeout arrives, and you find out whether the runtime behaves like an operator or like someone yanked the power cord.

OpenClaw PR #85865 is squarely in the unglamorous category of fixes that production systems need. It changes process-supervisor cancellation from a direct-child SIGTERM path into a tree-aware, two-phase shutdown: ask the process tree to exit, wait up to five seconds, then preserve the hard-kill fallback for anything that ignores the signal. If you run shell-heavy skills, cron jobs, local coding agents, PTY-backed tools, or subprocess wrappers, this is the kind of plumbing that determines whether cancellation means “clean stop” or “mysterious state loss.”

The linked issue #66399 has been open since April and describes the user-visible damage: exec-tool timeouts and isolated cron jobs receive immediate SIGKILL, losing temp files, partial writes, locks, and in-progress agent state. One production operator reported more than 30 isolated cron jobs across six specialist agents where daily timeouts caused SIGKILL-driven state loss. That is not a theoretical race condition. That is a scheduler quietly turning work into confetti.

The process tree is the real unit of execution

The important detail in PR #85865 is not just “send SIGTERM before SIGKILL.” Plenty of systems do that badly. The important detail is that cancellation is routed through killProcessTree, the same family-aware path used for hard cancellation. Child and PTY adapters call adapter.kill("SIGTERM", { graceMs: 5000 }), giving descendants the same graceful-exit chance as the direct child.

That distinction matters because agent tools rarely remain a single process. A shell command spawns a worker. A coding agent launches a helper. A wrapper forks a language server. A build process starts test runners. If the supervisor signals only the immediate child, the parent can exit cleanly while a descendant keeps running, writing, holding locks, or burning CPU outside the runtime’s mental model.

The PR’s proof captures exactly that failure. Before the fix, the parent marker was written, the descendant marker was missing, and descendantAliveAfterWait was true. After the fix, both parent and descendant markers were written and descendantAliveAfterWait was false. There is also a fallback proof: a SIGTERM-ignoring child observed the signal, ignored it, and was later force-killed with SIGKILL. That is the contract you want. Cooperative processes get a chance to clean up. Broken or hostile processes do not get to linger indefinitely.

The prior art cited in the research notes is telling. Go’s exec.Cmd.WaitDelay exists because cancellation is not a boolean. It is an escalation path. The same pattern appears in production supervisors, container runtimes, CI systems, and job queues: terminate, drain, then kill. Agent frameworks do not get an exemption because the thing driving the process is a model.

Graceful shutdown is also a security boundary

There is a subtle security tradeoff here. A runtime that always SIGKILLs first destroys state and makes tools unreliable. A runtime that waits indefinitely lets untrusted code outlive policy. PR #85865 lands in the sensible middle: a bounded five-second window followed by hard cancellation. The labels on the PR — merge-risk: security-boundary and merge-risk: availability — are not decorative. Cancellation semantics touch both.

For agent operators, that duality is worth internalizing. A timeout is not just an inconvenience. It is a point where the platform reasserts authority over code it allowed to run. If that assertion is too brutal, you lose data and corrupt workflows. If it is too soft, the agent can leave behind orphaned subprocesses that no longer match the UI, transcript, or policy layer. The runtime has to be firm and predictable.

This is also why process-supervisor behavior belongs in AI agent security checklists. People tend to focus on prompt injection, secret handling, MCP permissions, browser isolation, and network egress. Those are critical, but agents also execute ordinary programs. Ordinary programs have children. Children survive bad supervisors. A sandbox that cannot reliably terminate the process tree is not a sandbox; it is a suggestion with a PID.

Practitioners should map this to their own workloads. If your OpenClaw setup runs periodic jobs, local build tools, browser automation, ffmpeg, language servers, package managers, or nested coding-agent harnesses, cancellation behavior is part of your reliability budget. Audit what happens when a job times out. Does it flush state? Does it leave temp directories? Do descendants survive? Are partial artifacts marked partial, or do they look complete? The boring test is the one that will save you later: run a tool that traps SIGTERM, writes a marker, sleeps, and spawns a child that does the same. Then cancel it and see what is left.

The broader industry lesson is that agent runtimes are rediscovering decades of operating-systems hygiene under a layer of chat UX. That is not an insult. It is the job. If an agent can run commands on your behalf, it inherits the responsibilities of a job runner, process supervisor, audit system, and policy engine. The magic only works if the boring pieces are correct.

PR #85865 is boring in exactly the way production agents need. It does not make OpenClaw smarter. It makes failure less destructive, which is often more valuable. The healthiest agent platforms will be the ones that treat cancellation, draining, and process-tree cleanup as first-class runtime contracts — not cleanup code someone hopes never runs.

Sources: OpenClaw PR #85865, OpenClaw issue #66399, Go os/exec Cmd.WaitDelay documentation, AugmentCode multica PR #947