openclaw

Auto-Compaction Is Crashing Active OpenClaw Replies Because the Lock Contract Changed Under It

Anatoliy Kolodkin

20 May 2026 • 4 min read

Compaction is supposed to be the boring maintenance job that keeps long-lived agents useful. It trims, rewrites, summarizes, and preserves enough context that the assistant can keep going without dragging the entire transcript through every turn. In OpenClaw issue #84746, that maintenance job became the thing killing active replies.

The reported regression is blunt: after OpenClaw’s 5.18 transcript-lock scope change, auto-compaction can crash embedded responses with EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released. The reporter says the pattern moved from zero crashes on 2026.5.7 to 38 on the first day of 2026.5.18. That is not a cosmetic regression. It is context maintenance turning into a concurrency bug users can feel.

The lock optimization had a hidden writer

The issue was filed at 2026-05-21T00:38:49Z against OpenClaw 2026.5.18 through 2026.5.19-beta.2. The environment is a macOS Apple Silicon MacBook Air using a global npm install, with 15 agents and multi-channel usage across iMessage group chats, Telegram, and Slack. The reporter’s comparison point matters: 2026.5.7 reportedly had zero crashes, while 5.18 produced 38 SessionTakeoverError incidents in one day.

The crash pattern correlates one-to-one with embedded run auto-compaction start log entries. The reporter explicitly rules out crons, concurrent messages, and provider failures as the primary trigger. A representative log sequence shows auto-compaction beginning at 17:56:06, followed by a lane task error at 17:56:23 on an iMessage group lane: EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released. The issue says 24 auto-compaction events fired in one day, roughly every 55 minutes, and every compaction that overlapped an active model call crashed that response.

The reported root cause is the #13744 behavior introduced in 5.18: the embedded run releases its coarse transcript lock before model I/O, while persistence and cleanup have separate locks. The goal is reasonable. Holding a coarse transcript lock throughout model I/O can block other work and create timeouts. But once the prompt lock is released, every component that can mutate the session file becomes part of the concurrency contract. Auto-compaction is one of those components.

That is the whole bug in one sentence: OpenClaw optimized lock scope, but compaction remained a writer to the same ground the active run was standing on.

Agent transcripts are not log files

The instinct to compact automatically is correct. Long-running agents need it. But agent transcripts are not passive logs. They are prompt material, replay state, memory input, user-visible continuity, and sometimes provider-side resume context. Rewriting them while a model response is in flight is not the same as rotating an access log under a web server.

This is the operational difference many agent systems are still learning. In a normal application, a background maintenance job can often run at fixed intervals and take a lock around the data it changes. In an agent runtime, “the data” may be actively participating in a model call that has already constructed context, may be about to write tool results, may be holding partial progress, or may be expected to append a final response into the same durable lane. The compactor is not outside the execution path. It is a concurrent editor.

The human symptom is worse than the stack trace. A user sends a message in Slack or iMessage. The model starts working. A maintenance timer fires. The active response dies. The user gets no reply. Maybe the lane wedges until a gateway restart. From the user’s perspective, the agent is flaky. From the operator’s perspective, the provider looks suspicious. From the logs, the real culprit is a housekeeping process that did not know how to negotiate with an active embedded run.

The previous fix reduced noise, not the failure

Related issue #83510 documented an earlier shape of the same class: session-file mutation during a released prompt lock was counted as a model failure and retried across fallback providers, eventually exhausting the chain and producing misleading “ALL PROVIDERS DOWN” alerts. That issue cited 61 occurrences of session file changed while embedded prompt lock was released in logs. The current report says 5.19’s fix stops the takeover error from consuming model fallbacks and generating false provider-down alerts, but the active response still dies.

That is progress, but it is not enough. Removing a misleading alert is good. Preserving the reply is better. The underlying product contract is simple: compaction should make long-lived sessions more reliable, not introduce a periodic chance of response loss.

PR #84153, included in the same release context, adds a 30-second fail-open timeout for compaction hooks so never-settling plugin lifecycle code does not freeze compaction forever. That patch and #84746 point at the same conclusion from different angles. Compaction is now part of runtime scheduling. Hooks can hang it. Transcript writes can race it. Active responses can be killed by it. This subsystem is no longer background plumbing.

What operators should do while the contract is unsettled

The reporter’s workaround is pragmatic: set compaction.mode: "manual" and run compaction during quiet hours when no conversations are active. That is not a satisfying long-term product posture, but it is a reasonable mitigation for high-traffic channel agents on affected versions. If you upgraded from 5.7-era builds to 5.18 or 5.19 and run active embedded/channel agents, inspect logs for EmbeddedAttemptSessionTakeoverError and embedded run auto-compaction start. If they line up, disable auto-compaction or move it to controlled windows.

Maintainers have a few design options. The simplest is to defer auto-compaction whenever an embedded model call is active for that session. A stricter option is to cover compaction writes with the same lock that protects the model I/O critical section, though that risks reintroducing blocking behavior the 5.18 change was meant to avoid. A more ambitious option is a safe rebase protocol where an active run can survive a compaction by validating and reconciling the transcript mutation. That path is harder because tool calls, partial outputs, and provider resume state make “just retry” dangerous.

The best near-term rule is boring: if a compactor can mutate the transcript, it must coordinate with the execution scheduler. Fixed timers are not enough. Idle detection, active-run awareness, and explicit recovery semantics are table stakes for long-lived agents.

The editorial take: agent memory maintenance is no longer housekeeping. It is part of the runtime. If compaction can edit the transcript while the model is mid-flight, it is a concurrent writer with product impact. Treat it that way, or users will keep discovering your maintenance window one dropped reply at a time.

Sources: OpenClaw issue #84746, issue #83510, compaction hook timeout PR #84153, OpenClaw v2026.5.19 release, OpenClaw v2026.5.20-beta.1 release

The lock optimization had a hidden writer

Agent transcripts are not log files

The previous fix reduced noise, not the failure

What operators should do while the contract is unsettled

Sign up for more like this.