openclaw

Codex Replay Recovery Shows Why Agent Sessions Need Poisoned-State Escape Hatches

Anatoliy Kolodkin

22 May 2026 • 4 min read

The most interesting phrase in OpenClaw PR #85362 is not invalid_encrypted_content. That is just the error string. The interesting phrase is “replay-invalid session state,” because it names the actual disease instead of the symptom.

Long-running coding agents accumulate state the way old services accumulate cron jobs: transcripts, provider session IDs, reasoning metadata, encrypted content handles, queued follow-ups, replay hints, summaries, tool results, and enough glue to make a normal retry look deceptively simple. That state is valuable because it gives continuity. It is dangerous because one corrupted or provider-incompatible slice can poison the whole lane. When every follow-up reuses the same bad session file, the user does not have a model problem. They have a state-management problem wearing a model-error mask.

PR #85362 was opened on May 22 at 13:06 UTC, shortly after the related PR #85294. It changes six files with 145 additions and one deletion. The specific failure class is OpenAI/Codex returning a 400 error: code=invalid_encrypted_content, “Encrypted content could not be decrypted or parsed.” After the fix, OpenClaw classifies that as replay-invalid session state, returns a stale-session recovery message instead of generic runner-failure copy, and rotates the poisoned reply session through the existing reset path.

Retries are not recovery when the lane is poisoned

The easy bad fix for this class of failure is to catch the exception and tell the user to retry. That feels reasonable because many provider errors are transient: rate limits, network failures, temporary upstream weirdness, billing blips. But replay-invalid state is different. If the next run uses the same poisoned session ID or session file, the next run is not a fresh attempt. It is a deterministic walk back into the same wall.

The right recovery pattern has three pieces, and #85362 is interesting because it hits all three. First, classify the error precisely enough to distinguish session poison from a generic provider outage. Second, rotate the affected state using a code path the runtime already understands, rather than inventing a one-off cleanup branch. Third, tell the user what happened in recoverable terms. “Something went wrong” invites useless retry loops. “This session had stale encrypted replay state; I reset the lane” gives the operator a mental model they can act on.

The focused e2e proof injects the invalid_encrypted_content failure from runEmbeddedPiAgent, asserts that the stored sessionId and sessionFile move away from poisoned values, retargets the follow-up run to the new session, and verifies refreshQueuedFollowupSession receives the old and new session IDs and files. Unit coverage includes two files and 255 passing tests; the focused e2e slice passes one test with 50 skipped; TypeScript core test compilation also passes.

That proof is narrow, but narrow is not a flaw here. Session recovery code should be tested at the contract boundary: does the runtime identify the poisoned state, rotate it, and retarget follow-ups consistently? Live provider tests still matter, but they are expensive and credential-bound. The PR is explicit that it does not test a live Telegram/OpenClaw VM turn or live OpenAI Codex backend request. Good. Publish the caveat instead of pretending a mock is production.

The Codex comparison nobody benchmarks

This matters for the endless Codex-versus-Claude-versus-Cursor-versus-local-agent comparison because daily-driver reliability is not just model quality or token price. A coding agent can be smart, cheap, and still miserable if its state semantics fail under real workflows. Developers do not experience “provider replay contract mismatch.” They experience a follow-up that suddenly cannot continue, a task lane that keeps failing, and a recovery message that does not tell them whether they should reset, retry, or throw the machine out the window.

PR #85294 provides the immediate context. Earlier the same day, it stopped native openai-codex Responses requests from replaying prior reasoning items while preserving reasoning replay for non-native Responses backends. The reported shape was nativeCodexReplayIncludesReasoning: false while standardResponsesReplayIncludesReasoning: true. In plain English: different backends have different replay contracts, and the runtime has to respect those differences instead of assuming “Responses API” means one universal state model.

That is the bigger framework lesson. Agent platforms increasingly sit between multiple providers, each with its own continuity model: encrypted handles, server-side sessions, reasoning items, local transcripts, tool-call replay, and transport-specific sanitizers. The orchestration layer has to normalize enough for users to get a coherent product while preserving enough provider-specific truth to avoid corrupting state. That is a hard interface. It deserves typed recovery paths, not generic catch blocks.

For operators, the practical guidance is to verify this after it ships. Run a Codex-backed OpenClaw lane through a multi-turn follow-up sequence, especially one that uses reasoning and queued continuation. If an invalid_encrypted_content class failure appears, the platform should rotate the lane and make the next follow-up target the fresh session. Also watch user-facing copy. A technically correct reset that tells the user nothing useful is still an operational failure.

For platform builders, the lesson is to treat session state like a mutable database, not a pile of strings. It needs schema, ownership, migration, invalidation, and recovery. It also needs failure classes. “Provider error” is too broad. Billing failure, refusal, unsupported parameter, context overflow, stale encrypted replay state, and transport serialization bugs deserve different operator paths.

The PR’s current labels include triage: mock-only-proof, which is exactly the note a maintainer should keep attached until live evidence arrives. But the shape is right. It turns an opaque upstream 400 into a typed runtime event and routes recovery through existing reset machinery. That is what mature agent systems do: they stop asking users to interpret provider entrails.

The story is not another OpenAI error code. It is that agent sessions can become poisoned state machines, and serious runtimes need escape hatches that reset the right thing without burning the whole conversation to the ground.

Sources: OpenClaw PR #85362, OpenClaw PR #85294, OpenClaw v2026.5.20 release, OpenClaw issue #84880

Retries are not recovery when the lane is poisoned

The Codex comparison nobody benchmarks

Sign up for more like this.