OpenClaw's MCP Cron Leak Fix Shows Why Agent Observability Has to Include Tool Runtime Lifecycles

OpenClaw's MCP Cron Leak Fix Shows Why Agent Observability Has to Include Tool Runtime Lifecycles

MCP makes agent systems more useful by standardizing tool access. It also gives every scheduled agent run more runtime lifecycle to clean up. OpenClaw PR #87981 is a useful reminder that “the cron timed out” is not the end of the story if the cron launched MCP servers, opened transports, created client sessions, and left child processes behind.

The bug fixed by the PR is an availability failure with a slow-burn shape. Isolated cron runs with MCP servers could time out, disappear from OpenClaw’s active-run registry, and leave their MCP runtimes alive. The scheduler then saw empty concurrency slots and launched more runs. The machine, meanwhile, saw accumulating orphaned runtimes and transports. Eventually Gateway could become unreachable even while systemd still reported the service as active. That is the exact operational smell that turns “agents are autonomous” into “why is this box wedged again?”

The registry said empty; the process table disagreed

The linked issue, #87821, was created at 2026-05-29T00:00:40Z and described OpenClaw 2026.5.27 Gateway becoming unreachable after several hours with isolated cron jobs. PR #87981 followed at 2026-05-29T09:33:22Z. The summary is direct: isolated cron runs create a fresh session and MCP runtime per execution; timeout cleanup removed the run from the active registry but did not retire the associated MCP runtime. With default concurrency at 8, the scheduler could keep creating new work while old resources remained alive.

That mismatch is the important failure amplifier. Many systems track logical work and physical resources separately. That is fine until cleanup updates one ledger and not the other. In this case, the logical ledger said “slot free.” The physical runtime still held MCP state. After enough repetitions, the system is not overloaded because one run went bad; it is overloaded because every “handled” timeout leaked a little more process reality.

The patch adds MCP runtime retirement on both timeout cleanup and normal isolated-run disposal, then bounds disposeSession with a 5,000ms deadline. Later updates added a test for the nasty case: an MCP server that ignores shutdown signals. After commit 689c879cfe, the timeout path actively closes the transport and client instead of merely resolving a wrapper while leaving cleanup hanging in the background. That distinction matters. Polite shutdown is an attempt. Force-close is the reliability strategy when the other side stops cooperating.

MCP observability has to count lifecycles, not just calls

The official MCP specification describes stateful host/client/server connections over JSON-RPC, with tools, resources, prompts, sampling, roots, elicitation, progress, cancellation, error reporting, and logging. That is a lot of useful surface area. It is also a lot of state to account for when a scheduled run times out. If an observability system stops at “cron run failed” or “tool call timed out,” it misses the runtime residue that causes the next incident.

The PR’s proof environment is refreshingly operational: Docker image openclaw-patched:test, Linux arm64, Node.js v24.14.0, Anthropic-compatible gateway, isolated cron every 60 seconds, timeoutSeconds=20, delivery disabled, and a stdio MCP server named proof-echo. The patch surface is also not hand-wavy: ClawSweeper summarized +43 source lines and +427 test lines, with 97 tests listed in the PR body across five test suites. That ratio is the right shape for lifecycle bugs. The code change can be small; proving all exits clean up is where the work lives.

For operators, the metrics to collect are straightforward: active-run registry size, MCP runtime count, child process count, transport count, disposal latency, forced-close count, cron timeout count, and Gateway health latency. Alert on contradictions, not just thresholds. If active runs are zero but MCP runtime or child process counts remain nonzero, something is leaking. If forced-close count spikes, a server may be ignoring shutdown or a tool path may be timing out too aggressively. If Gateway health is slow while systemd says active, look for resource leaks before assuming the service manager is lying.

The security angle is resource ownership

This is an availability bug, but it belongs in the same conversation as MCP security checklists. Tool runtimes are not just APIs; they are authority-bearing resources. An orphaned MCP server may hold file handles, credentials, sockets, local process state, or provider sessions. Even if it does nothing malicious, it is still runtime authority outside the scheduler’s accounting. That is a governance failure.

The practitioner takeaway is to design cleanup paths as a matrix, not a callback. Success, model error, provider timeout, user cancellation, cron timeout, Gateway restart, crash recovery, hung transport, and manual dispose should all converge on explicit resource retirement. If the only tested path is “happy run finishes and then dispose,” the platform is testing the path that least resembles production.

This also changes how teams should evaluate MCP-heavy agent platforms. Ask whether the runtime exposes MCP lifecycle state, not just whether it supports MCP. Ask whether cleanup has deadlines. Ask whether hung stdio servers can be force-closed. Ask whether timeout cleanup and normal disposal share the same resource retirement logic. Ask whether cron isolation creates truly isolated lifecycles or just fresh logical sessions attached to forgotten physical resources.

PR #87981 is not a flashy feature. Good. Flashy features are not what keep a Gateway alive after six hours of scheduled agent work. The useful work is making sure every tool runtime the agent creates is either owned, observed, or retired. Anything else is an incident with a nicer protocol name.

LGTM take: the story is MCP lifecycle observability. Tool protocols are not just API surfaces; they are runtime resources. If an agent platform cannot account for them after timeout, “agent observability” is mostly vibes.

Sources: OpenClaw PR #87981, OpenClaw issue #87821, Model Context Protocol specification, OpenClaw v2026.5.28-beta.2