openclaw

OpenClaw’s Harness Error Is Lying Under Load, and That Is Worse Than the Failure

Anatoliy Kolodkin

24 May 2026 • 4 min read

The bug in OpenClaw issue #86239 is not that a Telegram message failed under load. Systems fail under load. The bug is that the runtime appeared to tell the operator the wrong story about why it failed: MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered, while other claude-cli work was succeeding in the same window.

That distinction matters. A missing harness is a configuration or startup problem. A harness that is present but unreachable because the Gateway event loop is saturated is an operations problem. They lead to different remediations, different alerts, and different human instincts at 2 a.m. OpenClaw currently seems capable of making those two situations look the same, which is exactly how an agent platform turns a recoverable overload event into a debugging séance.

The error string contradicts the runtime evidence

The report was filed against OpenClaw 2026.5.22, commit a374c3a, running a Telegram polling direct-message lane on a Linux VPS. The configured backend was cliBackends.claude-cli, with a single agents.list[main] entry using claude-cli/claude-opus-4-7 and Sonnet/Haiku fallbacks. Three inbound Telegram messages failed consecutively after roughly 19.6s, 20.7s, and 27.9s, each returning the same missing-harness error. About a minute later, another inbound message succeeded without operator intervention.

The awkward part is the concurrent evidence. During the same failure window, cron-triggered cli exec calls successfully used provider=claude-cli. One Haiku turn completed in 16553ms; one Sonnet turn completed in 6843ms. That does not look like a harness that vanished from the registry. It looks like a hot path under pressure failing to resolve or reach a registered harness quickly enough, then collapsing to the wrong error class.

The liveness metrics make the theory stronger. The load trigger was a long Claude Opus resume-session turn with durationMs=205933 and rawLines=2079, while simultaneous cron-driven CLI execs were also active. At the same second as the third failure, Telegram getMe hit a fetch-timeout after 10000ms, but the observed elapsed time was 16117ms. The log annotated the timer as delayed by 6117ms and called out likely event-loop starvation. The issue also records eventLoopUtilization=0.999 and eventLoopDelayP99Ms=7159.7. That is not a mildly busy process. That is a control plane trying to breathe through a coffee stirrer.

This is where agent platforms need to stop borrowing mental models from chat apps. In a chat app, a slow response is annoying. In an agent platform, the same Gateway may own inbound channel polling, harness dispatch, cron, resume sessions, session state, tool routing, Web UI sockets, and whatever background bookkeeping keeps the thing alive. A single long model turn that emits 2,079 raw lines is not just expensive text generation; it is runtime pressure on the same process that still needs to answer Telegram and classify the next dispatch failure honestly.

A related issue, #86242, shows the same pattern from another direction. On Windows 11 with Ollama and qwen3.5:9b using a 40k context window, OpenClaw produced liveness warnings with eventLoopDelayP99Ms=21156.1, eventLoopUtilization=1, and cpuCoreRatio=1.001. Telegram timed out and the Web UI/WebSocket path could disconnect. That issue was closed as a satellite of the canonical event-loop starvation tracker #83366, but it is useful evidence because it removes Claude CLI from the center of the story. Local inference can starve the same Gateway surface too.

The product lesson is not “never run local models” or “do not use Claude Opus.” The lesson is that heavy work and control-plane responsiveness need a contract. If embedded inference, CLI harness streaming, or giant resume-session replays can monopolize the event loop, then operator-facing dispatch errors must carry liveness context. A missing harness under normal load and a missing harness under eventLoopUtilization=0.999 are different incidents.

There is also a comparison point for teams evaluating coding agents. “Supports Claude Code,” “supports Ollama,” and “supports Telegram” are feature-table claims. The real question is whether those features share a failure domain. If a local 9B model can delay channel timers by twenty seconds, then the system is not merely local-first; it is local-inference-coupled-to-your-control-plane. That may be acceptable for a hobbyist box. It is a weak default for enterprise automation, where a Slack or Telegram ingress lane should not become collateral damage because a background coding turn got ambitious.

What operators should do now

If you run OpenClaw and see MissingAgentHarnessError, do not stop at checking whether the harness is configured. Correlate the error with eventLoopDelayP99Ms, eventLoopUtilization, delayed fetch timers, active model lanes, and cron concurrency. If the error appears during high event-loop pressure, treat it as a possible starvation symptom until proven otherwise. Restarting or reinstalling harness packages may only mask the real issue.

Second, isolate heavy work where possible. Long-running resume sessions, local inference with huge context windows, and concurrent cron bursts should not all compete with channel I/O in the same timing-sensitive path. If isolation is not available, at least stagger cron jobs and cap parallel CLI work. The goal is not perfect utilization; it is preserving enough headroom for the Gateway to tell the truth while under load.

Third, improve your alert labels. “Harness missing” is not actionable enough if the runtime was melting. Alerts should include the active lane, model/harness name, recent event-loop delay, concurrent job count, and whether external API timers were delayed beyond their own timeout. That turns a misleading exception into an incident with a shape.

ClawSweeper kept #86239 open for maintainer review and noted that the exact starvation reproduction was not yet proven on current main. That caution is fair. But the evidence already points to the deeper product requirement: OpenClaw needs a failure taxonomy that separates absent, late, unreachable, and starved. Agent operators can live with failures. What they cannot live with is a runtime that lies, even accidentally, about which failure happened.

Sources: OpenClaw issue #86239, OpenClaw issue #86242, OpenClaw issue #83366, OpenClaw issue #86227

The error string contradicts the runtime evidence

Control planes should not share fate with heavy turns

What operators should do now

Sign up for more like this.