Manual Abort Should Not Leave an OpenClaw Session Locked for Five Minutes
A stop button is not a UX flourish in an agent runtime. It is the operator’s escape hatch when a model is looping, a provider is hanging, a tool is too slow, or the assistant has wandered into the weeds. Which is why PR #88623 matters: OpenClaw’s manual abort path could leave a retained session lock behind, making the user’s recovery action the thing that prevented recovery.
The bug is wonderfully small in the way production reliability bugs often are. Timeout aborts released retained embedded-attempt locks. Manual aborts called the same abort path with isTimeout=false, skipped releaseHeldLockForAbort(), and left later turns waiting until a SessionWriteLockTimeoutError, watchdog expiry, or gateway restart. The user clicks stop, sends another message, and the agent lane feels dead.
The linked issue, #88600, was filed on May 31 against OpenClaw 2026.5.27, Node v22.22.1, Ubuntu, and DeepSeek deepseek-v4-pro. The reproduction is exactly what any operator might do: start an agent conversation, manually stop it mid-response, then immediately send another message to the same agent. The failure chain documented by the reporter is blunt: manual abort skips release; cleanup tries acquireForCleanup(); fallback lock acquisition waits 60 seconds; cleanupEmbeddedAttemptResources() never runs; the lock is freed only by the watchdog, with maxHoldMs defaulting to 300 seconds, or by restarting the gateway.
The escape hatch has to be more reliable than the happy path
Abort paths deserve more respect than they usually get. Engineers often test normal completion, timeout, and maybe shutdown. Manual abort gets treated like a UI edge case. In an agent platform, that is backwards. Manual abort is a control-plane operation invoked precisely when the rest of the system is already suspect.
If a normal completion path releases locks but abort does not, the runtime has a bad invariant. If timeout abort releases retained locks but manual abort does not, the runtime has split semantics where the human-triggered recovery path is weaker than the machine-triggered one. That is not just annoying. It erodes trust in the whole orchestration layer, because the user learns that “stop” might mean “wait five minutes or restart the service.”
The timing numbers are not academic. A 60-second lock wait is long enough for a chat interface to feel broken. A 300-second watchdog is long enough for a human to assume the gateway is wedged, file a bug, restart the process, or abandon the workflow. Agent products live inside conversational expectations; a minute of unexplained silence is not a minor delay. It is a failed turn.
PR #88623 shares the lock-release and warning path for manual abort while keeping timeout abandonment handling separate. That is the right shape: release what the abort owns, surface release failures without throwing the runtime into a worse state, and preserve distinct handling for timeout-specific abandonment behavior. The reported verification includes five targeted test files across three Vitest shards, pnpm tsgo:core, formatting, and autoreview. Added coverage includes manual abort lock release, non-throwing release failures, controller/idempotency behavior, drain coordination, and cleanup release behavior.
Session locks are a cost-control problem too
This bug also belongs in the cost-governance conversation, even though it does not look like a billing bug at first glance. A retained lock after abort can cause retries, duplicate attempts, human restarts, abandoned context, and unnecessary provider calls as users try to unstick the lane. Runaway cost is not only “the model generated too many tokens.” It is also operational ambiguity that causes humans and schedulers to repeat work.
Good agent cost controls need more than token budgets. They need cancellation semantics, lock visibility, retry boundaries, idempotent cleanup, and clear state transitions. If the user aborts, the next turn should know whether the previous attempt was canceled, whether tools were still running, whether locks were released, whether partial output was committed, and whether cleanup is complete. Otherwise, the platform is asking the user to reason about hidden runtime state from a chat box.
For operators, the practical checklist is straightforward. After upgrading, test manual abort in the channels you actually use: Slack, Telegram, terminal, cron-triggered agents, and any embedded or subagent flows. Stop a run mid-response and immediately send a follow-up. Watch lock acquisition logs, session status, active-run counts, and whether cleanup emits warnings. If a channel still goes quiet for 60 seconds, treat that as a runtime bug, not user impatience.
For platform authors, abort-path coverage should be first-class. Test user abort, timeout abort, provider disconnect, tool cancellation, compaction cancellation, channel retry during abort, and gateway shutdown while cleanup is pending. Every path should answer the same questions: which locks are held, which controllers have fired, which subscriptions are drained, which session files are writable, and what can the next turn safely assume?
The larger industry lesson is that agent reliability is not just about making agents finish tasks. It is about making agents stop tasks safely. Autonomous systems need brakes that work under stress. If the brake locks the steering wheel for five minutes, the spec is wrong even if the engine is impressive.
LGTM take: stop buttons are production infrastructure. If abort does not release the lock, the runtime has no safe way to recover from the exact failures abort exists to handle.
Sources: OpenClaw PR #88623, issue #88600, OpenClaw v2026.5.30-beta.1 release, PR #88625