OpenClaw’s Session-Lock Fix Shows Why Provider Hangs Become Platform Outages
The easiest agent reliability bug to underestimate is the one that looks like a normal timeout. A model provider hangs, the user sees an error, everybody shrugs, and the next turn should be fine. In OpenClaw PR #90065, the next turn was not fine. The provider hang left the session file lock wedged long enough that every later attempt failed instantly. That is not a timeout. That is a small platform outage wearing a chat error.
The PR was opened June 3 at 2026-06-03T23:59:13Z and updated minutes later at 2026-06-04T00:07:06Z. The failure mode is specific and ugly: OpenClaw’s embedded agent runner can hit a turn timeout, start its abort path, and then wait indefinitely for retained transcript writes to become idle. If the underlying provider stream accepts a connection and then ignores cancellation, the abort path does not fully settle. The session write lock stays held. Subsequent user turns fail with SessionWriteLockTimeoutError until the watchdog eventually reclaims the .jsonl.lock, creating a 5–30 minute dead zone.
The representative trace in the PR tells the story in numbers: embedded run timeout at timeoutMs=120000, abort settle timed out at timeoutMs=2000, and then a lane task error after durationMs=61545 with SessionWriteLockTimeoutError: session file locked (timeout 60000ms). To the user, all of that collapses into “Something went wrong while processing your request.” To the runtime, it is a retained lock, a still-live controller, and a transcript write path that has not admitted the old owner is gone.
AbortController is not a reliability strategy
The trigger is a flaky provider endpoint that hangs streamGenerateContent even after runAbortController.abort(). That detail matters because it is the gap between JavaScript cancellation as an API and cancellation as an operational guarantee. Sending abort is easy. Proving every downstream stream, socket, parser, callback, transcript writer, and cleanup path will promptly honor it is much harder.
Prior fixes had narrowed related lock problems. The PR explicitly references #87278, #88623, and #89811 as handling abort paths that settle. This patch exists because the retained transcript write path did not settle when the provider never really let go. That is the category of bug agent platforms should fear: not the happy-path exception, but the half-dead async operation that continues owning state after the user-visible turn is already over.
The fix has three useful pieces. First, drain acquisition and retained-idle wait are each bounded by OPENCLAW_EMBEDDED_ABORT_SETTLE_TIMEOUT_MS, defaulting to two seconds or 250 milliseconds under OPENCLAW_TEST_FAST. There is also a per-controller override via abortReleaseTimeoutMs. Second, when the bound expires, the underlying file lock is force-released. Third, and most importantly, the old controller is poisoned through takeoverDetected. Later withSessionWriteLock calls throw EmbeddedAttemptSessionTakeoverError, and cleanup degrades to a noop lock.
That poison step is the difference between a risky recovery and a reckless one. Force-releasing a lock is dangerous if the old owner can still write. Without poisoning, the platform trades downtime for possible transcript corruption. By marking the old controller as having lost ownership, OpenClaw makes the recovery explicit: the session must move forward, but the orphaned writer no longer gets to pretend it still owns the file.
The provider is part of your failure domain
This PR is also a useful corrective to how teams evaluate agent stacks. Model providers are usually discussed in terms of quality, price, latency, and context window. Those are table stakes. For production agents, provider behavior under failure is just as important. What happens when the stream hangs? When the local llama.cpp server accepts a request but stops yielding tokens? When a reverse proxy buffers forever? When a hosted endpoint keeps the TCP connection open after the client aborts? When OAuth refresh stalls while the runtime is holding a queue lock?
If the answer is “the agent waits,” then the provider has become a platform dependency with unbounded blast radius. If the answer is “the agent aborts, but cleanup waits for the provider to cooperate,” the dependency is still too powerful. The right invariant is sharper: no single provider stream should be able to hold a user’s session hostage after the runtime has decided the turn is over.
OpenClaw’s patch points toward that invariant, but it also shows how hard it is to retrofit. Agent sessions are not stateless completions. They accumulate transcript events, tool results, memory references, pending writes, channel delivery state, and sometimes child-agent coordination. A file lock is not just a file lock; it is the boundary that protects the continuity of the run. When that boundary is held across untrusted async work, the platform is betting session availability on someone else’s cancellation semantics.
ClawSweeper’s repo-native review reportedly called the patch silver shellfish quality while blocking merge pending real behavior proof from a real setup. That skepticism is correct. A forced release path around session state deserves evidence, not applause. The test list is encouraging: OPENCLAW_TEST_FAST=1 pnpm vitest run src/agents/embedded-agent-runner/run/attempt.session-lock.test.ts, 90/90 in the full file, with focused coverage for never-settling retained write force release and graceful release when retained writes settle within the bound. But the critical proof is whether the runtime behaves under an actually wedged provider, not just a cleanly simulated one.
What engineers should audit
If you build or operate agent systems, use this PR as a checklist. Find every lock held across a model call or tool call: session files, workspace leases, browser profiles, approval state, queue ownership, transcript appenders, provider fanout, and child-agent result paths. For each one, ask four questions. Can an external dependency hang while this lock is held? Is cleanup bounded? If the lock is force-released, is the old owner poisoned? Will the user or operator see a recovery signal that distinguishes “provider hung” from “session corrupted”?
Cost governance belongs in this conversation too. A hung provider is not only a reliability issue; it can waste tokens, pin concurrency, and trigger retries that make the platform look busier than it is. The same controls that prevent a user session from wedging also help keep runaway agent work from becoming an invisible bill. Bounded aborts, explicit ownership, poisoned stale controllers, and observable recovery are budget controls as much as correctness controls.
The take is not that OpenClaw had a lock bug. Every serious agent runtime will have some version of this bug. The useful story is the invariant OpenClaw is trying to encode: graceful when possible, bounded when necessary, poisoned after forced release. That is the shape of production-grade agent cleanup. Anything less lets one bad provider stream turn a two-minute timeout into half an hour of dead air.
Sources: OpenClaw PR #90065, OpenClaw PR #89811, OpenClaw PR #89673