openclaw

Anthropic Thinking Signatures Are Expiring, and OpenClaw's Replay Recovery Needs to Catch Up

Anatoliy Kolodkin

29 May 2026 • 3 min read

Extended thinking started as a model feature. In agent platforms, it becomes a storage problem, a replay problem, and eventually an operations problem. OpenClaw issue #88020 is the cleanest version of that lesson: after 45 to 60 minutes of Claude agent work, older thinking-block signatures can become invalid, Anthropic rejects the next request, and OpenClaw hard-fails the session instead of stripping stale thinking artifacts and retrying.

The reported error is precise enough to be useful: invalid_request_error: messages.1.content.440: Invalid signature in thinking block. The failure then returns stopReason: error, totalTokens: 0, and a runtime around 300 milliseconds. That last number matters. A fast zero-token failure after a long session is not the model suddenly becoming lazy. It is usually replay state poisoning the request before inference even begins.

Reasoning traces are now part of session state

Anthropic’s extended-thinking documentation describes thinking content blocks with signatures and model-specific behavior. Those signatures are not decorative. They are provider-owned artifacts that can affect whether a replayed conversation remains valid. Once OpenClaw persists them into a trajectory, every future turn inherits a dependency on whether those blocks are still valid, whether the provider still accepts them, and whether the runtime knows how to repair the request when they are not.

That is the key shift. Reasoning traces used to feel like model-internal ephemera. In an agent runtime, they become durable state. Durable state needs lifecycle rules: when to preserve, when to strip, when to summarize, when to hide from the model, when to expose to audit logs, and how to recover when a provider rejects it. If the runtime stores provider-specific reasoning artifacts but treats them like ordinary text, it inherits provider-specific failure modes without provider-specific recovery.

OpenClaw already has a function named stripInvalidThinkingSignatures, so the architecture clearly anticipated some signature churn. The immediate proposed fix in the issue is to add patterns matching Invalid ... signature ... thinking and signature ... thinking block to REPLAY_INVALID_RE, routing the error into the existing replay_invalid repair path. Bryan Baer reported that this one-line classifier-style change worked locally on 2026.5.27. That is useful, but it is probably not the whole fix.

The transport path matters as much as the regex

ClawSweeper’s review narrowed the remaining risk: provider transports can surface this rejection as a terminal stream error event that the recovery wrapper forwards instead of retrying. In plain English, the runtime may have the right repair function and the right regex, but the error can arrive through a path that bypasses the repair function entirely. That is a familiar agent-platform pattern. Recovery code exists; the hard part is ensuring every failure path reaches it.

A commenter confirmed a related failure on amazon-bedrock/global.anthropic.claude-sonnet-4-6 using OpenClaw v2026.5.22 on macOS arm64 in Telegram direct chat. Their trajectory had 64 thinking blocks at failure, and they said thinking blocks appeared even without explicit thinking configuration. The manual workaround was trajectory cleanup and a Gateway restart. That detail matters because it shows the bug is not limited to one direct Anthropic integration knob. Provider wrappers and defaults can introduce thinking artifacts even when a user does not believe they opted into them.

Adjacent OpenClaw issue #88019 shows the same class of replay invariant problem on Azure Responses: replaying a msg_* item without its paired reasoning item can poison a persistent session. The provider is different, the artifact name is different, but the platform lesson is the same. Modern reasoning models increasingly expose structured intermediate state. Agent runtimes that persist and replay that state need compatibility contracts, not hopeful concatenation.

What practitioners should do before blaming the model

If you run Claude agents through OpenClaw, watch for the signature of replay invalidation: long session, sudden fast failure, zero tokens, provider error mentioning thinking, signature, reasoning, previous response, or invalid replay. Capture the provider, model ID, transport path, message index, content block index, thinking-block count, and whether the retry path stripped signatures. Those details are not debugging trivia. They are the difference between “Claude failed” and “the runtime replayed stale provider state.”

For platform authors, build fixtures that simulate old signed thinking blocks and verify both direct request rejection and streamed terminal error shapes. Test direct Anthropic, Bedrock/global wrappers, and any compatibility providers that claim to emulate Claude behavior. A thrown error and a stream terminal event should converge on the same recovery policy if the provider message is semantically identical. Otherwise, the platform will recover in one integration and brick sessions in another.

There is also a product-design question here: how much provider-specific reasoning state should be retained in user-visible trajectories at all? The model may need some continuity. Operators may need auditability. But replaying every signed block forever is a liability if signatures expire or schemas drift. A better long-term shape may distinguish raw provider artifacts from normalized agent memory: preserve raw traces for audit where permitted, but replay only the minimal provider-valid context needed for the next turn.

This is where the new Claude Opus 4.8 support surface becomes relevant. Model upgrades are not just better benchmarks. They change thinking semantics, tool expectations, and replay contracts. If an agent platform wants to route across Claude, Codex, Copilot, Azure Responses, Bedrock, and local models, it needs provider-state hygiene as a first-class part of model routing. Otherwise the comparison table lies by omission.

LGTM take: this is the new replay tax of reasoning models. If an agent runtime stores provider-specific thinking artifacts, it also owns their expiry, sanitization, and retry semantics.

Sources: OpenClaw issue #88020, Anthropic extended thinking documentation, OpenClaw issue #88019, OpenClaw issue #84484

Reasoning traces are now part of session state

The transport path matters as much as the regex

What practitioners should do before blaming the model

Sign up for more like this.