openclaw

A Stuck Childless Codex Subagent Shows Why Capacity Is an Observability Problem

Anatoliy Kolodkin

01 Jun 2026 • 4 min read

OpenClaw issue #89069 looks, at first glance, like a capacity bug: a Codex-native subagent got stuck initializing and exhausted unified exec capacity. The more useful reading is that capacity is an observability problem. A limit is only helpful if the platform can explain what is holding the slot, why it is still alive, and whether it is doing work or merely occupying the accounting system.

The reported task shape is telling: runtime=subagent, task_kind=codex-native, child_session_key=null, a run_id beginning with codex-thread:, status running, and progress summary “Codex native subagent is initializing.” It stayed in that state for roughly 707.7 seconds before manual cancellation. That is long enough to hurt availability and short enough that OpenClaw’s installed 30-minute stale reconciliation never had a chance to mark it lost.

The scarce resource was not an OS process

The environment details matter. The reporter was running OpenClaw 2026.5.26 (10ad3aa) on macOS/Darwin arm64 through the Gateway LaunchAgent ai.openclaw.gateway. The embedded @openai/codex version was 0.130.0, while the standalone Codex CLI was 0.135.0 and the Codex Desktop app was 0.131.0-alpha.9. The reporter explicitly called out that upgrading the standalone Codex CLI did not upgrade the Codex app-server used by OpenClaw Gateway.

That is an important operational trap. Many users will check the CLI version, conclude Codex is current, and miss that OpenClaw may be speaking to an embedded app-server with a different protocol, lifecycle behavior, or bug set. In an app-connected engineering agent, “Codex version” is not one value. There is the CLI, the Desktop app, the embedded app-server, and the OpenClaw adapter. If those drift apart, the failure can look like a generic runtime hang even when the root cause is version mismatch or protocol expectation drift.

The stuck task was manually canceled with openclaw tasks cancel. After cancellation, openclaw tasks --status running --json returned count 0, and ps did not show remaining process rows. The reporter also ruled out the obvious session-store explanation: openclaw status showed sessions=94, comfortably below the default 500 cap. In other words, the user-facing “capacity exhausted” symptom was not explained by a pile of shell children or a full session store. The scarce thing was a logical runtime slot held by an initializing task with no child session.

Thirty minutes is too long for initialization limbo

The source-level detail is the real editorial point. The installed code includes CHILDLESS_CODEX_NATIVE_RECONCILE_GRACE_MS = 30 * 60_000. Tests reportedly keep this task shape running at 10 minutes and mark it lost only after 31 minutes. A 30-minute grace window might be defensible for active work that has a child session, recent progress, or tool output. It is much harder to justify for a childless Codex-native task still in initialization.

Initialization has a different expected latency profile from execution. A task actively applying a large patch can run for a long time. A task that never obtains a child session, never transitions phases, and only says it is initializing should be treated as a different lifecycle state. At five or ten minutes, the platform should at least warn. At some configurable threshold, it should reconcile earlier, release the slot, and preserve diagnostics about the failure.

The community feedback in the issue points in the same direction. ClawSweeper kept the issue open and confirmed the source-level mismatch: current main and the latest release keep childless codex-native task records alive for the 30-minute reconcile window, while the reported case exhausted capacity earlier. Another practitioner running multi-agent OpenClaw on a VPS agreed that version mismatch is often the root cause and suggested a watchdog that cancels tasks stuck initializing for more than 10 minutes. They also noted that Gateway restart races can show up under Linux/systemd, not only macOS LaunchAgent.

Monitor phase age, not just counts

The practitioner fix is straightforward: monitor logical task age and phase, not only process count. Alert on task_kind=codex-native, child_session_key=null, an initialization progress summary, and age beyond a conservative threshold. Also record the embedded Codex app-server version next to standalone Codex CLI and Desktop versions in diagnostics. If the platform can show all three, operators can stop debugging ghosts.

This is also where “cost controls” should be understood more broadly than token spend. Agent platforms need budgets for tokens, tool calls, wall-clock time, concurrent runs, runtime slots, memory, and user attention. A stuck initialization consumes a scarce concurrency slot without producing useful work. It can block subsequent runs, trigger misleading capacity errors, and encourage users to restart gateways or kill processes blindly. That is not just inefficient. It is how state corruption and duplicate work happen.

OpenClaw’s broader June 1 release direction — bounding timers, retries, local probes, generated-content polling, and stale session behavior — fits this issue neatly. The next step is to make lifecycle categories more explicit. “Running” is too broad if it covers both active execution and never-started initialization. A useful task system should distinguish queued, initializing, connected, executing, waiting-on-tool, waiting-on-provider, waiting-on-user, lost, and canceled. Those labels are not UI polish. They are the difference between an operator fixing the right thing and rebooting the box because the platform shrugged.

The take: capacity limits are only useful when the platform can name who is holding the slot. OpenClaw should treat stuck initialization as a first-class lifecycle failure, not a generic stale task waiting 30 minutes for cleanup. If an agent cannot start, say that quickly, release the resource, and leave enough evidence for the next fix.

Sources: OpenClaw issue #89069, OpenClaw v2026.6.1-beta.1 release, OpenClaw v2026.5.31-beta.4 release, OpenClaw PR #89082

The scarce resource was not an OS process

Thirty minutes is too long for initialization limbo

Monitor phase age, not just counts

Sign up for more like this.