openclaw

Subagent Completion Results Need a Durable Outbox, Not Three Fast Retries and Hope

Anatoliy Kolodkin

25 May 2026 • 4 min read

OpenClaw’s latest subagent bug is not interesting because a retry loop failed. Retry loops fail all the time. It is interesting because the platform treated a completed subagent result like an optional notification instead of a piece of workflow state that was already owed to the user.

That distinction is the line between a chat interface with helpers and a real multi-agent runtime. In issue #86488, the parent agent tells a Feishu group it is researching, suspends with sessions_yield, and waits while a child agent does the work. The subagent finishes roughly 85 seconds later. Then the announcement path back into the parent fails three times in rapid succession with completion agent did not deliver through the message tool. OpenClaw reaches maxAnnounceRetryCount: 3, gives up, and later discards the suspended delivery as expired.

The bad part is not the first failure. The bad part is that the system had a completed result and no durable obligation to deliver it.

The result was produced; the platform just lost the handoff

The report is unusually useful because it includes operational detail rather than vibes. The environment was OpenClaw 2026.5.19, Feishu group chat, claude-opus-4-7, macOS Darwin 25.4.0. The logs show a clean sequence: Feishu dispatch completes with queuedFinal=true, the subagent completion direct announce fails, retries happen at 04:51:21 and 04:51:23, the registry gives up with announce give up (retry-limit) retries=3 endedAgo=85s, and at 04:57:20 the suspended delivery is discarded as expired.

Labels on the issue tell the same story in maintainer language: P1, impact:session-state, impact:message-loss, clawsweeper:source-repro, and issue-rating: 🦞 diamond lobster. ClawSweeper did not wave it away as a Feishu adapter problem. Its source review says current main has partial durable delivery state for successful keep-mode completions, but the retry-limit path can still move results into suspended delivery that normal resume skips and the sweeper can later discard.

That is exactly the wrong default for an orchestration primitive. Once a parent delegates work to a subagent, the completion is not a best-effort chat message. It is a workflow output. If the notification path is unavailable, the result should become pending, inspectable, and redeliverable. It should not vanish because three attempts happened inside a tiny window while the parent re-entry path was unhealthy.

Three fast retries are not durability

The obvious workaround is to raise the retry count. That would be the wrong lesson. More immediate retries mostly buy more model calls, more log noise, and the same loss mode delayed by a few seconds. A durable delivery contract needs a different shape: persist the completed result before attempting the announcement, track delivery state separately from task state, use backoff, survive Gateway restarts, and expose stuck completions to the operator.

This is where the reporter’s comparison to LangGraph persistence and the transactional outbox pattern lands. In ordinary distributed systems, side effects fail. You do not write “try HTTP three times and hope” as the only record that an order confirmation, email, webhook, or downstream event must be delivered. You write the obligation to a durable outbox and let a delivery worker process it until success, expiry by policy, or manual intervention. Agent runtimes are now sophisticated enough to need the same boring machinery.

The parallel is not academic. A parent agent suspension is effectively a workflow checkpoint. A subagent completion is an event. The delivery back to the parent is a side effect. If OpenClaw wants sessions_yield to be more than a convenience wrapper, it needs to make those states explicit. “Child completed but parent announcement failed” is a first-class state, not a transient log line.

There is also an idempotency problem hiding here. A correct fix cannot simply replay messages blindly. If an announcement succeeds but the acknowledgment path fails, redelivery could duplicate the completion. The runtime needs delivery receipts or deterministic message ids so a parent can safely accept a completion once. Again: this is workflow-engine territory, not prompt-engineering territory.

What operators should demand from multi-agent platforms

For teams building on OpenClaw today, the lesson is to audit any workflow where subagents do meaningful work while the parent is suspended. The test is not “does it work in the happy path?” The test is: what happens if the child completes while the channel adapter is flaky, the parent re-entry model turn fails, memory pressure is high, or the Gateway restarts? Can you see the completed child result? Can you redeliver it? Does it have a trace id? Is it attached to the original parent task, or only implied by logs?

Those questions matter because silent completion loss corrupts the human operator’s mental model. A visible subagent failure is annoying but honest. A completed subagent whose output never reaches the user is worse: the platform did the expensive part, threw away the value, and left everyone believing the work might still be in progress. That is how trust erodes in systems people otherwise want to automate.

OpenClaw should treat this as a product problem, not only a bug. The eventual fix should include a pending-completions view, traceable delivery attempts, configurable retry policy with backoff, startup recovery for undelivered outputs, and a manual redeliver action. Operators should be able to answer “what work completed but has not been delivered?” without grepping logs at 4 a.m.

The broader industry should pay attention because every multi-agent framework is walking toward this same cliff. Delegation demos are easy: agent A asks agent B, agent B replies, everyone claps. Production delegation is harder: agent B finishes after the user leaves, the Slack token refreshes, the parent session compacts, the process restarts, the channel rejects a message, and the system still needs to know that a result exists. The difference between those two worlds is not model intelligence. It is state management.

OpenClaw’s bug is fixable. The important part is naming the contract correctly. A completed subagent result is not a notification. It is durable work product awaiting delivery.

Sources: OpenClaw issue #86488, OpenClaw PR #86491, LangGraph persistence docs, Transactional Outbox pattern

The result was produced; the platform just lost the handoff

Three fast retries are not durability

What operators should demand from multi-agent platforms

Sign up for more like this.