The Session Lock Bug Is the Kind of Failure That Separates Agent Demos From Agent Operations

The Session Lock Bug Is the Kind of Failure That Separates Agent Demos From Agent Operations

The most important OpenClaw reliability story today is not a model getting confused. It is a lock file that can outlive the failed run that created it. That sounds mundane until you realize what it means for an agent platform: one timed-out thought can keep holding the pen while every later thought queues up, stalls, fails, and leaks more state.

Issue #86014 reports that after an embedded OpenClaw agent run times out, the Gateway can retain the session .jsonl write lock indefinitely. The visible symptom looks like ordinary agent flakiness: a run timed out, a tool stalled, a response failed. Underneath, the session remains locked by the Gateway process itself, making future turns wait and fail. This is exactly the class of bug that separates agent demos from agent operations.

A stuck lock is not a small failure in an agent runtime

The issue was created on May 24 at 2026-05-24T10:37:52Z and reproduced on OpenClaw 2026.5.22 and 2026.5.20. The reported environment is not exotic: Docker, a Debian-based image, init: true with tini as PID 1, Node.js v24.14.0, and OpenClaw running embedded agent work through cron agentTurn or subagent spawn.

The reproduction path is operationally realistic. Configure a finite timeoutSeconds, trigger an embedded agent run, let it exceed the timeout or hit a stalled/erroring tool call, then inspect the session state. The reporter captured an embedded run timeout at 2026-05-24T10:10:19.763Z, followed by SessionWriteLockTimeoutError at 2026-05-24T10:11:51.382Z and repeated embedded-agent failures against the same locked session.

The lock file was still present 19 minutes after timeout. Its contents showed pid: 7, the Gateway process itself, a maxHoldMs value of 1020000, and a lock path under /root/.openclaw/agents/main/sessions/...jsonl.lock. In other words, this was not a dead child process leaving trash behind. The main runtime still believed it owned the write lock, or at least had not released it through the timeout path.

The impact is ugly in exactly the way production incidents are ugly: stuck sessions block for 60 seconds and fail, retained session context leaks memory, RSS reportedly grows 40–160 MB per minute with dozens of cron-spawned agentTurn sessions, and the default runRetries.max: 160 can amplify churn. The suggested workarounds are pure operator survival: delete .lock files older than two minutes, reduce retries from 160 to 32, use finite timeouts, and restart the container.

Timeouts are normal, so cleanup cannot be optional

The architectural lesson is blunt: in agent systems, timeouts are not edge cases. They are normal operating conditions. Tools hang. Browser automation stalls. Telegram long-polling can block. Model streams slow down. Subprocesses ignore SIGTERM. Network APIs half-fail. Users send overlapping turns. A runtime that treats cleanup as the happy-path epilogue rather than a guaranteed transaction boundary will eventually wedge itself.

Traditional web applications have their own cleanup problems, but agent platforms add more moving parts per request. A single turn may write conversation state, launch child runs, stream tool calls, invoke subprocesses, update memory, send channel replies, schedule retries, and append logs. If any cancellation path skips a finally block, leaves a lock open, or misclassifies an aborted child as successful, the next turn inherits the damage.

That is why the related cleanup PRs matter. PR #85860 treats aborted subagent runs as terminal so parent sessions do not misreport aborted child runs as success. PR #85865 gives subprocess cancellations a five-second graceful shutdown window and routes SIGTERM through process-tree/group cleanup. These are not glamorous patches. They are the shape of a platform learning that cancellation is a first-class execution mode, not an interruption to the real work.

The session lock bug is the same lesson at the transcript layer. Conversation state is not just a log. It is a coordination primitive. If the platform serializes writes through .jsonl.lock, then releasing that lock is part of the turn contract. A timed-out turn that keeps the lock has not failed cleanly. It has converted one failure into a persistent degraded state.

The Windows reports rhyme with the same failure class

Two same-day issues point at adjacent availability pain. Issue #86044 reports that after upgrading from 2026.5.20 to 2026.5.22, Windows CLI commands such as openclaw --version regressed from under one second to 30–60 seconds, while doctor --non-interactive hung with zero output. Downgrading restored normal behavior. The reporter suspects provider auth-state pre-warm.

Issue #86031 describes a Gateway on Windows that remains bound to 127.0.0.1:18789 while local health and status probes time out. Logs show eventLoopDelayP99Ms=136902.1, eventLoopUtilization=0.999, Telegram getUpdates stuck for 139.83s, and transport rebuilds while the process still appears “running.”

These are not identical to the lock bug, but they rhyme. In all three cases, superficial liveness is misleading. The Gateway process exists. A socket may be bound. A command may technically be executing. But the system is not responsive in the way an operator needs. A bound port is not health. A process table entry is not readiness. A transcript entry is not delivery. Agent runtimes need health checks that measure actual responsiveness under degraded provider, channel, and session-state conditions.

What operators should do before this bites them

First, monitor Gateway health with request latency, not just process presence. If health or status endpoints stop responding promptly while the process remains alive, page the runtime, not the model. Second, alert on stale .jsonl.lock files and SessionWriteLockTimeoutError logs. A stale lock is not housekeeping noise; it is a session availability incident.

Third, put sane ceilings on retries for cron-driven agentTurn workloads. A retry value of 160 may be survivable when failures are independent. It is dangerous when the failure mode is a stuck shared resource. Retrying into a locked session does not create resilience. It creates load and memory pressure.

Fourth, verify cancellation behavior for tools and subprocesses. Child processes should run in process groups where possible, receive graceful shutdown signals, and be killed after a bounded window. The platform should treat timeout cleanup as a testable behavior: lock released, child processes gone, parent state marked terminal, user-visible status accurate.

Fifth, stage 2026.5.22 carefully on Windows and provider-heavy configurations. The release includes valuable Gateway startup work, but the same area can also introduce readiness regressions if auth pre-warm or channel polling runs too early on CLI paths. Measure command startup time, doctor latency, health latency, and event-loop delay before and after.

The broader lesson is that agent orchestration does not fail like a normal web app. It often fails by half-completing an operation while leaving behind state that future operations must respect. A web request can die and be retried. An agent turn can die while holding the transcript lock, while a child process keeps running, while the channel adapter thinks a reply is pending, while retries keep feeding the same broken lane.

My take: the important OpenClaw story is not that one lock file got stuck. It is that production agent platforms live or die by cleanup paths. If your agent can time out, abort, spawn children, call tools, and write durable conversation state, then cancellation is part of the runtime contract. Anything less is a demo with a pager attached.

Sources: OpenClaw issue #86014, issue #86044, issue #86031, PR #85860, PR #85865