OpenClaw’s Stuck-Tool Recovery Patch Is the Kind of Runtime Observability Agents Actually Need
Agent observability has a bad habit of stopping one step too early. It tells you a session is “processing.” Then it tells you the queue depth. Then, if the platform is unusually honest, it classifies the problem as a blocked tool call. Useful. Also not enough. The operator does not need a prettier label for a wedged runtime; they need the system to recover without turning a service restart into a hostage negotiation.
OpenClaw PR #82369, opened and merged in the first hour of May 16, is small in scope but important in category. It fixes a deadlock class where active embedded runs with stale native tool calls could be detected by diagnostics but not recovered. The triggering issue, #81976, came from a live Telegram agent turn where a bash tool ran openclaw sessions --agent dex --limit 10 --json. That command sounds harmless: read session state, produce JSON, move on. In practice, it apparently queried the same gateway/session system already executing the tool, and the active run got stuck.
The diagnostic loop saw the problem. The issue captured repeated reports like state=processing, age=142s, queueDepth=1, reason=blocked_tool_call, classification=blocked_tool_call, activeWorkKind=tool_call, lastProgress=codex_app_server:notification:rawResponseItem/completed, and activeTool=bash. That is good instrumentation. The failure was the next field: recovery=none.
Seeing the wedge is not the same as owning it
The wedge survived normal restart drain attempts. According to the issue, gateway restart could keep draining active embedded runs until the outer service manager eventually sent SIGKILL. That is exactly the sort of failure mode that separates a clever agent demo from a platform operators trust. If a runtime can detect a stale tool call but refuses to abort it, restart becomes theatre. The service is “draining” work that is no longer making progress, while the human waits for the inevitable kill signal.
PR #82369 changes six files with 186 additions and 16 deletions. The patch adds abort recovery for active embedded runs when diagnostics detect a stale native tool call beyond the stuck-session abort threshold. It also caps gateway restart recovery for active embedded runs, while preserving the existing drain budget for normal active tasks. That distinction matters. You do not want a platform casually killing legitimate long-running work. You also do not want it treating a stale native tool call as sacred just because it is technically active.
The verification list is the right kind of boring: focused Vitest suites for diagnostic session attention, diagnostics, stuck-session recovery runtime, and gateway run-loop restart behavior; codex review --uncommitted; git diff --check; and a live Codex harness test using OPENCLAW_LIVE_CODEX_HARNESS=1. The maintainer comment after merge says the important part plainly: diagnostics already identified the stuck turn correctly, but active embedded runs with a native tool call had no abort recovery path. The fix lets diagnostics abort once native tool-call age and session-progress age exceed threshold, and gives restart recovery a bounded grace window.
Self-inspection tools are control-plane calls, not ordinary shell commands
The specific repro has a larger lesson. Letting an agent call the platform that is currently running it is useful, and risky. openclaw sessions, openclaw status, and similar commands look like read-only introspection. But if they route through the same gateway, queue, auth, or session machinery as the active tool call, they can create reentrancy pressure. The agent is asking the control plane about itself from inside a tool execution managed by that same control plane. That can work. It also deserves circuit breakers.
Teams running OpenClaw should audit skills and tools that shell out to OpenClaw CLI commands from inside active agent turns. The goal is not to ban introspection. The goal is to classify it correctly. Platform self-inspection should use reentrancy-safe read paths where available, short timeouts where not, and recovery policies that assume nested control-plane calls can wedge. If an agent needs session state as part of ordinary reasoning, that should eventually be a designed API with backpressure and timeouts, not an accidental shell command that depends on the gateway being in a friendly mood.
Related work reinforces the same theme. PR #82378, opened minutes later, addresses a Codex completion-edge case where raw custom_tool_call_output notifications without a matching app-server item/completed should keep the short turn-completion watchdog armed instead of waiting for terminal idle timeout. That is not the same bug, but it is the same maturity curve: agent runtimes have to reason about partial protocol progress, missing terminal signals, and tool-call age. “The model emitted something” is not always progress. “The tool is active” is not always liveness.
For operators, the immediate action is straightforward: move to a build containing #82369 if you run Codex-heavy or tool-heavy OpenClaw agents, especially in always-on Telegram or channel contexts. Then test recovery, not just observability. Create a safe stuck-tool scenario in staging. Confirm diagnostics classify it. Confirm the abort path fires. Confirm restart has a bounded drain window. If your runbook still ends with “restart and hope systemd kills it eventually,” the platform has not actually absorbed the lesson.
The industry point is broader than OpenClaw. Agent observability should close the loop. Runtime audit logs, stuck-turn detection, idle watchdogs, and restart drain policy are one design surface, not four dashboards. Classification without recovery is just a nicely formatted incident report. #82369 is the right direction because it turns “we can see the agent is wedged” into “we can abort the wedged embedded run without killing the whole gateway.” That is what production reliability looks like in this category: fewer heroic humans, more bounded failure modes.
The editorial take: the agent-runtime maturity test is not whether you can observe a stuck turn. It is whether the system can safely unwind it before the operator reaches for SIGKILL. OpenClaw just moved one failure mode from pager lore into runtime policy. More of that, please.
Sources: OpenClaw PR #82369, OpenClaw issue #81976, OpenClaw PR #82378