Codex Startup Bugs Are Capacity Bugs Once Agents Run on Cron
Codex integrations do not fail where product demos tell you they fail. They fail before the interesting part starts: during worker spawn, during protocol handshakes, during OAuth profile routing, during the quiet little interval where the platform is supposed to turn a scheduled job into an actual agent turn. PR #89442 is small, but it points at the right layer of the stack. OpenClaw is no longer debugging “can Codex answer?” It is debugging whether Codex-backed work can be started, bounded, cleaned up, and accounted for when nobody is sitting in the terminal watching it.
The bug behind the patch was filed as issue #84567 against OpenClaw 2026.5.18. An isolated cron job using payload.kind: agentTurn, sessionTarget: isolated, lightContext: true, and openai/gpt-5.5 through the bundled openai-codex plugin would fail after roughly 60 seconds with the vague diagnostic: cron: isolated agent setup timed out before runner start. The same job completed in about 26 seconds when routed through claude-cli/claude-opus-4-7. That comparison is the smoking gun. Cron was not generically broken. The Codex harness startup path had a boundary that was not being timed out at the right level.
The timeout was one layer too far out
PR #89442, created June 2 at 12:37 UTC, identifies the specific pre-runner gap: the bundled Codex app-server client must complete an initialize / initialized handshake before OpenClaw can safely send later app-server methods such as thread/start. OpenClaw already wrapped the thread startup path, but not the shared app-server initialization boundary. If that handshake stalled, the outer isolated-run watchdog eventually reported that setup timed out before the runner started. Accurate in the broadest possible sense; useless in the way production diagnostics need to be useful.
The proposed fix threads the existing startup timeout budget into the shared app-server client factory, closes and clears stalled stdio workers, and reports the failure as codex app-server initialize timed out. It also adds regression coverage for a never-responding initialize handshake. GitHub API metadata shows the PR at +34/-2 across three files, labeled extensions: codex, P1, proof: supplied, and status: needs proof. ClawSweeper source-reproduced the path but asked for stronger real-behavior proof, which is exactly the right kind of skepticism for timeout code around cached shared clients.
The distinction matters because shared-client bugs are contagious. A worker that never finishes initialization is not merely a failed run. It can occupy a slot, poison a cache entry, leak process state, and make the next scheduled run inherit a stale failure. In a direct chat, that looks annoying. In cron, it becomes capacity drift: scheduled work keeps arriving, setup keeps stalling, and the operator sees “the agent timed out” instead of “the Codex app-server never completed initialize.” One diagnosis leads to model fiddling. The other leads to protocol and lifecycle debugging.
Startup latency is part of the capacity budget
The original reporter’s configuration is also telling. This was not a toy one-off prompt. It used OAuth profile routing, no API-key fallback, an 1,800-second job timeout, and isolated session targeting. In other words: the sort of setup people use when they want Codex to behave like an engineering worker, not a chat autocomplete. A 30-minute turn budget does not mean a 60-second startup stall is acceptable. Those are different budgets. The model can spend time thinking; the control plane should not spend a minute discovering that an app-server protocol handshake is wedged.
That is the operational lesson for teams running Codex-backed OpenClaw jobs. Instrument startup by phase. Track schedule dispatch, isolated session creation, auth/profile resolution, worker spawn, app-server initialize, initialized, thread/start, first model event, and runner start. If all of that is collapsed into “setup timed out,” you do not have observability. You have a stopwatch taped to a distributed system.
The follow-up detail from #84567 sharpens the point. A commenter reported that upgrading bundled @openai/codex from 0.130.0 to 0.132.0 changed the failure mode from a roughly 60-second setup hang to a 62ms harness failure. That suggests protocol or version compatibility was part of the risk. Again, the important thing is not whether 0.132.0 magically fixed the workflow. It is that different Codex harness versions can fail at different startup boundaries, and the platform needs diagnostics granular enough to tell operators which boundary moved.
There is also a FinOps angle hiding in the plumbing. Coding-agent cost controls are usually framed as token budgets, model routing, or kill switches. Those matter, but capacity waste is not always token spend. A stuck shared Codex worker can burn scheduler slots, retries, logs, operator attention, and wall-clock time without producing a single billable completion. The budget you need is not just “how many tokens may this agent use?” It is also “how long may each startup phase consume resources before it is forcefully retired?”
PR #89442 is not a glamorous patch. It will not make Codex smarter, and it will not show up in a benchmark chart. But it is the sort of fix that makes app-connected coding agents less fragile: push the timeout to the first blocking protocol step, close the worker when it stalls, clear the shared cache, and emit a message that names the failed boundary. That is how integrations graduate from “works when fresh” to “survives being scheduled.”
The editorial take: Codex as an engineering agent is not just a model-routing story. It is an app-server lifecycle story. If OpenClaw wants cron-driven Codex work to be trusted, every startup phase has to be bounded and named. “Before runner start” was a bucket. Production systems need a stack trace.
Sources: OpenClaw PR #89442, OpenClaw issue #84567, OpenAI Codex