openclaw

Codex Timeouts Should Not Poison OpenClaw’s Auth and Failover State

Anatoliy Kolodkin

25 May 2026 • 4 min read

A timeout is evidence. The hard part is knowing what it is evidence of.

OpenClaw pull request #86476 fixes a subtle but consequential boundary leak in the project’s Codex integration. When the app-server turn-completion watchdog fires, OpenClaw has been letting that timeout escape into generic provider, credential, and failover logic. One stalled Codex turn can therefore look like broader evidence that the shared Codex app-server should be retired, the auth profile should be marked failed, or the runtime should drift toward public OpenAI/provider fallback behavior.

The PR’s policy is the right one: a Codex harness-owned timeout should stay inside the Codex harness boundary.

One slow turn should not become global routing evidence

The patch was created on May 25 at 2026-05-25T12:22:58Z and is labeled with the right kind of caution: agents, extensions: codex, P1, merge-risk: compatibility, merge-risk: auth-provider, merge-risk: availability, and status: needs proof. Those labels are not bureaucracy. They describe the exact blast radius. This is not a UI polish fix; it changes how OpenClaw interprets failure and decides what to do next.

The PR follows #85958, which moved Codex compaction back to the Codex boundary. #86476 addresses the remaining leak: app-server turn-completion watchdog timeouts still escaped into OpenClaw’s generic provider and auth handling. After the fix, a timed-out Codex turn is interrupted for that turn, but OpenClaw no longer kills the shared app-server as a retry strategy, no longer rotates auth profiles because of that harness-owned timeout, and no longer falls through to generic OpenAI/provider fallback paths.

That is the difference between typed recovery and superstition. If a browser navigation times out, you do not mark the user’s GitHub token invalid. If an MCP tool stalls, you do not conclude that the model provider is down. If a Codex app-server watchdog fires, you should not automatically poison the global auth and failover state. A layered agent runtime has to preserve the meaning of failure as it crosses boundaries.

Coding-agent comparisons increasingly depend on operational behavior

Developers still compare coding agents as if the main axis is model quality: Codex versus Claude Code versus Copilot versus Gemini CLI versus Qwen. That still matters. But daily-driver trust is now just as much about operating surface. Does the tool recover cleanly after a stalled turn? Does it keep credentials scoped? Does it avoid surprise fallback? Does it preserve the session and the app-server when one request goes sideways? Does it tell the user what actually failed?

#86476 sits exactly in that layer. A Codex harness timeout could mean several things: the model is slow, the harness is wedged, the app-server is overloaded, a watchdog threshold is too low, the prompt stage is blocked, or a local process is unhealthy. It is not automatically evidence that an OpenAI auth profile is bad. It is not automatically evidence that a public provider fallback is appropriate. It is a harness-local event until health signals prove otherwise.

The PR’s verification claims match that framing. The author lists targeted tests for failover-policy, run.codex-app-server-recovery, and Codex app-server run-attempt, plus formatting checks, git diff --check, pnpm check:changed, and pnpm build. The app-server timeout test asserts the timed-out turn is interrupted while the app-server client stays open. The embedded-runner recovery test asserts the timeout no longer marks the auth profile failed. Failover-policy tests assert harness-owned prompt and assistant timeouts do not rotate or fall back.

ClawSweeper’s caution is also warranted. Its review says current main clearly retires the Codex app-server client on timed-out turns and routes timeout evidence into auth/failover policy. It requested live behavior proof before merge because the diff touches three failure-routing surfaces: app-server timeout recovery, auth-profile cooldown marking, and model/profile failover policy. That is the right review posture. Boundary fixes are good, but they can create the opposite bug if they retain a truly wedged app-server too aggressively.

The right policy is scoped recovery, not never restart

The nuance matters. “Do not let harness-local watchdogs poison global auth and routing state” is not the same as “never restart the app-server.” If the app-server is actually unhealthy, OpenClaw should retire or restart it. But that decision should come from app-server health signals, repeated scoped failures, or explicit recovery policy — not from a generic provider-failover reflex triggered by a single timed-out turn.

This is the same principle mature infrastructure applies everywhere else. A request timeout does not necessarily mean the database credentials are invalid. A worker crash does not necessarily mean the message broker is down. A slow RPC does not necessarily mean the account should be cooled down. You classify failures before you take broad corrective action. Agent runtimes need that discipline because they sit across more boundaries than normal applications: local harnesses, remote providers, browser sandboxes, MCP tools, shell commands, auth stores, and chat channels.

For operators, the practical takeaway is to inspect failover logs after Codex timeouts. If one timed-out turn causes auth-profile cooldowns, model fallback, or shared-client retirement, you are not merely debugging Codex. You are debugging OpenClaw’s interpretation layer. That distinction will save time. It also affects governance: if an organization requires Codex traffic to stay inside a specific harness or app-server boundary, surprise fallback to a public provider is not just inconvenient. It may violate the deployment model.

For OpenClaw, the ideal end state is a typed failure taxonomy visible to both runtime policy and operators: harness timeout, provider 401, provider 429, app-server unhealthy, prompt watchdog, assistant watchdog, tool timeout, cancellation, and user abort should not collapse into one bucket. Each class should have its own recovery path, audit trail, and fallback eligibility. The point is not to make failures prettier. The point is to stop one boundary’s problem from mutating into another boundary’s decision.

#86476 is small enough to look like plumbing. It is actually part of the broader governance story every coding-agent platform is going to need. As these systems become more layered, the hard question is no longer only “did the model answer?” It is “did the runtime understand what failed?”

On that axis, keeping Codex timeouts inside the Codex harness is the right move. Every cough is not credential failure. A serious runtime should know the difference.

Sources: OpenClaw PR #86476, OpenClaw PR #85958, OpenClaw v2026.5.24-beta.2 release, OpenClaw issue #84880

One slow turn should not become global routing evidence

Coding-agent comparisons increasingly depend on operational behavior

The right policy is scoped recovery, not never restart

Sign up for more like this.