OpenClaw’s WhatsApp Stall Shows Why Agent Platforms Need Delivery Semantics, Not Just Model Timeouts

OpenClaw’s WhatsApp Stall Shows Why Agent Platforms Need Delivery Semantics, Not Just Model Timeouts

Agent reliability work keeps making the same point in different costumes: the runner state is not the product state. Issue #84569 reports a WhatsApp direct-session failure where a long model call stalls, a follow-up message queues behind it, the run terminates as an incomplete turn with payloads=0, and the human never receives the fallback error. Internally, OpenClaw appears to know something went wrong. Externally, WhatsApp shows a black hole. Users do not read your run registry. They read the chat.

This is why “model timeout” is too small a frame. The product contract is not “did the model eventually stop?” or “did the embedded runner create an error payload?” The contract is: did the human receive a response, a retry notice, a visible failure, or a still-working signal in the same channel where they asked? If not, the system failed, even if every internal subsystem has a defensible explanation.

The logs describe a channel contract breaking

The issue was opened May 20 at 12:22 UTC against OpenClaw 2026.5.18 (50a2481) and carries the labels that matter: P1, impact:session-state, impact:message-loss, and source-repro. The reported path is a WhatsApp direct session. A long-running model call is active. Subsequent inbound messages queue behind that active work. Eventually the turn is detected as incomplete, but no outbound WhatsApp send logs appear after the incomplete-turn detection.

The timeline is concrete. At 11:54:00, an inbound WhatsApp message arrives. At 11:56:25, liveness logs show the long-running session at age=142s, queueDepth=1, and activeWorkKind=model_call. At 11:57:55, the session is classified as stalled_agent_run at age=232s. At 11:58:06, OpenClaw detects an incomplete turn with stopReason=stop and payloads=0. At 11:58:24, the WhatsApp web connection closes with status 428. From the user’s point of view, nothing useful happens.

ClawSweeper’s source inspection adds the important bridge: current main still filters WhatsApp isError reply payloads before delivery, even though the embedded runner returns a user-facing error payload for incomplete turns. That means the system can do the right thing at the runner boundary and still fail at the channel boundary. The fallback exists in the wrong layer. Or, more precisely, the fallback is not receipt-backed all the way to the channel.

Silence is the most expensive failure mode

WhatsApp is a particularly unforgiving place for this bug because messaging channels are social interfaces, not logs. They have mobile expectations, connection state, typing indicators, rate limits, and humans who interpret silence as either indifference or failure. A long model call at 142 seconds with a queued follow-up is already beyond the comfort zone for most assistants. By 232 seconds, the runtime should either be confidently progressing, visibly waiting, or sending a clear fallback. Dropping the error payload is the worst compromise: the backend has enough information to recover, but the user gets the UX of abandonment.

The nearby PR #84371 is useful context because it addresses the same class of problem for generated media. In that case, image, video, or music generation can complete successfully while the normal requester-agent/message-tool path fails to attach or deliver the artifact. The proposed fix keeps the task active until completion delivery is confirmed, adds duplicate guards, and falls back to direct channel delivery with receipts. The WhatsApp stall needs the same mental model. A run is not terminal until the originating channel has a durable, observable outcome.

For operators, this suggests three staging tests that should become standard. First, force a long model call and send a second WhatsApp message mid-turn; verify the queued message does not trigger context loss, queue starvation, or silent failure. Second, force an incomplete turn and confirm the originating channel receives exactly one fallback error, not zero and not three. Third, interrupt the WhatsApp sidecar during recovery and verify retry or durable failure accounting. If your logs show an error payload but no outbound send attempt, you do not have a reliability feature. You have an internal confession.

For platform builders, the architecture target is end-to-end delivery semantics. Create payload. Route payload. Attempt channel send. Capture platform receipt or failure. Mirror the result into the transcript/audit log. Retry only where idempotency is safe. Surface the terminal state to the parent session or user. That chain should apply to normal replies, error replies, async media artifacts, subagent completions, and queued follow-up notices. The fact that a payload is marked isError should make delivery more important, not less. Error messages are often the only recovery affordance the user gets.

There is also a design question around “still working” messages. Agents should not spam users with progress theater, but channel silence during multi-minute model calls creates exactly the behavior operators hate: users send another message, the queue deepens, and the runtime now has to handle concurrency under degraded conditions. A lightweight, channel-specific “still working; your next message is queued” state can be less annoying than a black box. The key is honesty: do not fake progress, do not claim work is done, and do not hide stalled runs behind typing indicators that expire.

The editorial take is simple: “completed” is not a runner state. It is a delivery state. OpenClaw is rapidly becoming a multi-channel agent runtime, which means every channel adapter is part of the reliability boundary. WhatsApp failures are not less important because they sit outside the model. They are more important because they are where the human learns whether the agent exists.

Sources: OpenClaw issue #84569, PR #84371, issue #84053, issue #84489