Tool Results Can Poison a Session One Acceptable Chunk at a Time

Tool Results Can Poison a Session One Acceptable Chunk at a Time

The expensive agent failure is rarely one giant mistake. It is usually twenty individually acceptable decisions replayed forever. OpenClaw PR #87639 is a good example: the runtime already caps individual persisted toolResult messages, but a long-running tool can append many results that each fit under the per-message limit. Later, the session history replays the whole pile and blows the model’s context window even when the user replies with a single character.

That is not a prompt problem. It is a persistence and replay policy problem. If a user sends 1 and the model says the input exceeds the context window, the input is not really 1. The input is 1 plus whatever the runtime quietly dragged forward from prior tools, summaries, system instructions, schema materialization, memory, and session state. Token budgets live in the runtime, not in the user’s text box.

PR #87639 was created on May 28 at 2026-05-28T12:50:30Z and updated at 13:01:24Z. It is still open, and ClawSweeper currently wants stronger real-behavior proof before merge. That caution is healthy. Context-window bugs are exactly the kind of issue that can pass a focused unit test and still fail in the messy, long-lived sessions where users actually hit them. But the failure mode is real enough and common enough that the proposed design deserves attention.

Per-message caps are table stakes, not token governance

The patch touches three files with a reported +115/-4 diff: src/agents/embedded-agent-runner/tool-result-truncation.ts, src/agents/session-tool-result-guard.ts, and src/agents/session-tool-result-guard.test.ts. The core idea is to preserve the existing per-result cap while adding an aggregate cap for persisted tool-result text on the active session branch. Once the aggregate budget is exceeded, older entries are rewritten with truncation notices.

The reproduction shape in the brief is intentionally small: eight process toolResult messages, each around 850 characters, with maxToolResultChars=1000. Before the patch, the aggregate persisted output would be roughly 6,824 characters. After the patch, aggregate persisted output is bounded at 3,998 characters against a 4,000-character aggregate budget, with a truncation notice present. The individual chunks were never “too large.” The session was.

The real-world case was uglier: a Windows/Telegram OpenClaw 2026.5.26 direct bot session where repeated process tool result chunks of roughly 30,000 characters each poisoned session history. Later Telegram replies like 1 failed with “Your input exceeds the context window of this model.” That is the token-budget version of a clogged drain. Each chunk fits through the pipe. The system still floods because nobody limits accumulation.

This is why per-call or per-message limits are only the first control. Agent systems replay history. They summarize, compact, branch, restore, and rehydrate. The cost and context impact of a tool result is not its size when produced; it is its size multiplied by how often and where it is replayed. A 30K-character process log might be tolerable once as immediate evidence. It is not tolerable as permanent prompt furniture.

Cost controls belong where waste is introduced

The industry keeps trying to solve agent cost with dashboards. Dashboards are useful, but they are accounting. By the time the graph tells you a session burned through a context window, the runtime has already made poor decisions for thousands of tokens. The better control is closer to the source: what gets persisted, what gets replayed, what gets summarized, what gets projected, and what remains available only as an out-of-band artifact.

PR #87639 operates at the right layer because persisted tool results are where this particular waste enters the system. It does not ask the user to reset more often. It does not blame the model for having a finite context window. It changes the replay payload. That is the difference between governance and vibes.

The truncation notice is not cosmetic. Silent deletion would create a debugging nightmare. If a future answer depends on a tool result that was shortened, the operator needs to know the model saw a truncated history. The right long-term design likely includes configurable aggregate budgets by tool class, visible “history truncated” diagnostics, and a raw artifact store that can preserve full logs outside the model prompt. The model does not need every byte of a long process log forever. The operator might. Those are different storage tiers.

This same pattern shows up across OpenClaw’s recent runtime work. The May 26 compaction circuit-breaker PR stopped repeated summarizer failures from burning tokens chunk after chunk. The OpenRouter context-window issue showed what happens when provider routing reserves almost the entire model context for output and leaves no room for prompt, tools, or history. The common lesson is that token governance is not a billing feature. It is execution-path policy.

Long-lived sessions need garbage collection with receipts

Engineers should think of tool results the way they think about logs and build artifacts. Fresh logs are useful. Infinite logs in every request are a denial-of-service against your own system. A long build log should be summarized in the model context, linked as an artifact, and retrievable on demand. The assistant needs enough evidence to reason. It does not need to relitigate every line of stdout every time someone says “continue.”

The hard part is deciding what to keep. Truncating older entries is a practical default, but some tool outputs contain handles, file IDs, page tokens, error messages, or final results that future turns need. A smarter runtime will eventually do schema-aware projection: keep continuation IDs and final status, summarize bulk text, drop repetitive progress noise, and store the full raw body elsewhere. That is more work than a byte cap, but the cap is the necessary guardrail that prevents the current session from becoming unrecoverable.

Practitioners can apply the lesson now. Audit sessions that use verbose tools: process, web_fetch, browser automation, MCP servers, image/video generators, CI log readers, and anything that streams progress. If tiny follow-up messages trigger context-window errors, do not inspect the latest user prompt first. Inspect accumulated persisted tool results. Check whether compaction is summarizing the right material or repeatedly replaying raw logs. Look for tools that emit periodic chunks where only the final chunk matters.

For tool authors, this is also a design warning. Return structured status and artifact references, not only large text blobs. If your tool can generate 100K characters of output, give the model a concise result plus a handle to fetch more. If the only interface is “dump everything into content,” the orchestrator has to choose between losing detail and poisoning the session. Better tool contracts reduce that tradeoff.

The ClawSweeper caveat matters: the PR still needs stronger real-behavior proof before merge. Good. The right proof would run through an actual session replay path where repeated tool chunks previously caused overflow, then show that a tiny follow-up remains within budget while full raw artifacts are still recoverable somewhere. Agent runtime fixes should be judged by behavior, not just source plausibility.

The editorial take is simple: per-call token caps are table stakes. Session-level replay budgets are where agents stay usable. OpenClaw’s bug is not exotic; it is what happens when a chat transcript becomes an append-only warehouse for tool output and the model is asked to carry the warehouse on every turn. That is not intelligence. That is bad garbage collection with an LLM bill attached.

Sources: OpenClaw PR #87639, Issue #86880, PR #86900, OpenClaw v2026.5.27 release