OpenClaw’s Token Accounting Bug Turned Compaction Into a Per-Message Tax

OpenClaw’s Token Accounting Bug Turned Compaction Into a Per-Message Tax

OpenClaw issue #82576 is a reminder that token accounting is not analytics garnish. In an agent runtime, token state drives compaction, compaction drives context continuity, and context continuity decides whether a long-running assistant feels reliable or starts shredding its own transcript every time someone sends a message.

The bug is sharp because it is small. After compaction correctly sets totalTokens and marks totalTokensFresh=true, the next persistSessionUsageUpdate call can overwrite totalTokens with undefined when no fresh context snapshot exists. That stale write forces the preflight compaction guard to estimate the whole transcript again on the next dispatch. On smaller-context routes, that estimate can exceed threshold and trigger auto-compaction on every single message.

That is not a dashboard discrepancy. That is a runtime loop.

The report came from a real OpenClaw v2026.5.16-beta.1 install running a long Feishu direct-message conversation with image analysis on MiniMax M2.7-highspeed. The model’s context window is 204,800 tokens. The reporter’s state store showed totalTokens: None, totalTokensFresh: False, and compactionCount: 6. Logs showed five compaction rotations on five consecutive Feishu messages within six minutes: 19:30:07, 19:32:31, 19:34:02, 19:35:07, and 19:35:56.

The bug is really two facts collapsed into one

The affected path is src/auto-reply/reply/session-usage.ts. When hasFreshContextSnapshot is false, totalTokens is undefined, but the old write path still persisted that value into the session store. PR #82578 fixes the behavior by writing patch.totalTokens only when there is a fresh context snapshot. Otherwise, OpenClaw preserves the last known count while setting totalTokensFresh=false.

That distinction is the whole design lesson. “This count is stale” and “there is no count” are different facts. The runtime needs to know both. If freshness is uncertain, the platform may choose to re-estimate later. If the count is erased, every downstream guardrail sees an unknown state and may take the most conservative path. In this case, conservative meant expensive and repetitive: re-estimate, cross threshold, compact, persist, corrupt the count, repeat.

For a human operator, the visible symptom is maddening. The agent may still respond. Nothing obviously crashes. But every turn pays a compaction tax, and the conversation keeps being rotated under pressure. Latency goes up, state churn increases, and the assistant’s continuity becomes less predictable. The model did not get worse. The runtime’s accounting did.

Context window size is not a product guarantee

The MiniMax detail is important because the same bug can be invisible on a larger-context model. The report notes that deepseek-v4-flash with a 1M-token context window did not trip the same behavior. That does not mean the bug is harmless there; it means the threshold math did not make the failure obvious.

This is a useful warning for teams comparing agent model routes by context window and price. A 204K-token route might be perfectly reasonable if the runtime has accurate token accounting and sane compaction behavior. It can become unusable if stale usage updates erase durable counts and force repeated compaction. Conversely, a 1M-token route may hide platform bugs long enough for them to reach production. Big context windows are useful. They are not a substitute for correct state transitions.

Agent platforms now maintain several overlapping token concepts: provider-reported input and output usage, cached tokens, cache writes when providers expose them, estimated transcript tokens, compacted context size, fresh snapshot state, stale snapshot state, and durable session totals. Operators want one chart. The runtime has to preserve the messy truth underneath it. Simplifying those states too aggressively is how observability becomes a control-plane bug.

What engineers should actually check

If you run OpenClaw or a similar long-lived agent stack, repeated compaction logs should be treated as a reliability signal, not background noise. Search for conversations where every user message triggers preflight compaction. Inspect the session store for missing or stale token totals. Compare behavior across model routes with different context windows. If a smaller-context model suddenly looks expensive or unstable, do not assume the model is the problem until the runtime’s token state has been audited.

For maintainers, PR #82578 is the right shape because it preserves the last known durable value while marking the value stale. That gives the system room to be cautious without destroying useful state. It also creates a cleaner contract for future context-engine work: external or pluggable context managers need accurate host-side accounting, or they will be blamed for bad decisions made by persistence semantics.

There is also a product lesson here. Usage telemetry should not only answer “how many tokens did we spend?” It should help explain why the runtime made a lifecycle decision. Why did this session compact? Which token count was used? Was it fresh, estimated, provider-reported, or carried forward from the last known compaction? If the platform cannot explain that, the operator cannot distinguish healthy maintenance from a loop.

The wider agent industry should take note because this class of bug will not stay OpenClaw-specific. Any runtime that supports long conversations, channel agents, multimodal context, memory, and compaction will need durable accounting semantics. A chat app can tolerate sloppy token charts. An agent platform uses those charts to decide when to rewrite its own working memory.

That is why issue #82576 matters. Token accounting is runtime infrastructure. When it lies, the agent does not merely misreport usage. It starts doing expensive maintenance work on every turn and calls it safety.

Sources: OpenClaw issue #82576, PR #82578, OpenClaw v2026.5.16-beta.2 release, PR #82351