Qwen Code’s Nightly Fixes the Tool-Call Wedge That Breaks Local Agents
Local coding agents do not fail like web apps. They fail like terminals: halfway through a tool call, after the network hiccups, while a background process is still running, with a transcript that was valid five seconds ago and now gets rejected by the provider because one message is missing. Qwen Code’s May 22 nightly is worth paying attention to because it fixes exactly that class of failure.
The release note looks modest: four bullets, one of them a version-string sync. But the important change is PR #4176, “close tool_use↔tool_result invariant across all failure paths.” That sounds like protocol bookkeeping because it is. It is also the difference between an agent that recovers from real-world turbulence and an agent that wedges until a developer hand-edits JSONL history like a mechanic crawling under a running car.
Qwen Code v0.16.0-nightly.20260522.48b0a8bfc shipped on May 22 at 14:18 UTC. The repository has real adoption pressure behind it — roughly 24,600 GitHub stars, 2,400 forks, and more than 800 open issues during research — and it sits in the increasingly important open/local agent lane. That lane has different economics and different failure modes than hosted tools like Claude Code, Codex, Gemini CLI, or Copilot. Users are more likely to run mixed providers, Anthropic-compatible gateways, local model adapters, flaky network paths, and custom shells. Transcript repair is not a nice-to-have in that world. It is part of the runtime.
The bug is boring until it eats the session
The failure mode documented in PR #4176 is precise. During an Anthropic-compatible server-sent-events stream — the PR names DeepSeek and api.anthropic.com as examples — the stream can drop after a tool_use content block stops but before the terminal message_stop. By then, the function call has already been yielded to the runtime. The CLI records that a tool was requested, schedules the tool, and eventually tries to submit the tool result back into the conversation.
The problem: the assistant turn that contained the original tool_use never made it into persisted history, because the stream threw before the normal history push. The next request now contains a user-side tool_result with no matching assistant-side tool_use immediately before it. Backends quite reasonably reject that with HTTP 400: each tool_result must correspond to a prior tool_use. Retry cannot reliably save the user because the missing assistant message is gone. The session is no longer merely confused; it is structurally invalid.
This is one of the under-discussed differences between chatbots and agents. A chatbot can lose a few streamed tokens and continue. A tool-using agent has a wire protocol with invariants. If the transcript violates those invariants, the model does not get a chance to be clever. The backend refuses the request before the intelligence layer enters the room.
Qwen’s fix treats history as a contract, not a log
The fix has two layers. First, if a stream errors after a function call has already been yielded, Qwen now persists the partial assistant turn before rethrowing. Plain text-only partial turns are intentionally not persisted, because keeping partial prose can poison a retry or duplicate output. Tool calls are different. Once the runtime has acted on a tool call, history must remember that the call happened.
Second, Qwen adds repair logic for cases where the transcript is already damaged: process crashes, Ctrl+Y retries while tools are in flight, out-of-memory exits, resumed sessions, and even manual or external JSONL edits. repairOrphanedToolUseTurns(history) walks history and synthesizes an error-typed functionResponse for every unpaired function call. That is an important design choice. It does not pretend the tool succeeded. It preserves the shape of the protocol so the model can see an honest failure and decide whether to retry.
The wiring is also more thoughtful than the average “cleanup pass.” Repair runs when chat starts, which covers crash and resume paths. It runs immediately after user content is pushed, so a real retry payload can close its own tool pair before the synthesizer inserts an error. And handleCompletedTools deduplicates against chat history before submitting late tool results, which prevents an in-flight scheduler from double-submitting a call that repair has already marked as failed.
The PR explicitly compares this to Claude Code’s yieldMissingToolResultBlocks, while noting that Qwen’s React scheduler runs out of band from the stream loop. That difference matters. Hosted and local agents are converging on the same invariant, but their internal architectures force different repairs. The ecosystem is learning, painfully, that “agent runtime” is a distributed system in miniature: stream parser, scheduler, transcript store, provider adapter, shell, UI, and retry loop all have to agree on what happened.
Open agents need stronger recovery, not weaker polish
There is a temptation to grade local/open agents mostly on model quality and hardware economics: how well does Qwen compare to Claude, how much VRAM does it need, can it run cheaply enough for daily use? Those questions matter. But they are not the whole product. A coding agent that uses tools must survive interruptions that ordinary editors barely notice.
Network drops are not edge cases. Developers work on trains, corporate VPNs, conference Wi-Fi, overloaded provider endpoints, self-hosted gateways, and laptops that sleep at the worst possible time. Ctrl+Y or retry while a tool is still running is not exotic either; it is exactly what impatient humans do when a terminal looks stuck. If those paths corrupt the transcript, the agent becomes a glass cannon: impressive when the demo path holds, expensive when it doesn’t.
Qwen’s test evidence is the right kind of boring. The PR cites passing suites across geminiChat, client, and useGeminiStream tests; later updates mention all 8,186 core tests, clean TypeScript, and CI coverage including lint, platform tests, and CodeQL. More interesting than the numbers are the cases: dangling tool calls on resume, partial parallel tool coverage, idempotence when history is already paired, retry races where real tool results should beat synthetic errors, and scheduler dedup before early return. That is what agent reliability work looks like. Not a benchmark chart. A pile of race conditions killed one by one.
The nightly also includes a smaller but telling notebook fix: preserving tab and mixed-whitespace .ipynb formatting. That belongs in the same story. Agents do not only touch clean source files. They edit notebooks, generated files, lockfiles, YAML, hidden metadata, and all the annoying artifacts where a technically valid rewrite can still ruin a review. Formatting preservation is diff hygiene, and diff hygiene is trust.
What engineers should do now
If you run Qwen Code in serious workflows, upgrade a disposable environment first and test the failure paths on purpose. Start a tool call and interrupt the network. Kill the process after the model emits a tool call but before the tool result is submitted. Resume the session. Retry while a tool is still running. Run the same workflow through Anthropic-compatible and OpenAI-compatible providers if you use adapters. Then inspect the transcript, not just the final answer.
The acceptance criterion should be simple: every tool call that reaches the runtime eventually has a matching result on the wire, even if that result is an honest synthetic error. The UI should not stay stuck in completed-but-not-submitted limbo. The provider should not reject the next turn with a 400 because the previous message shape is invalid. The model should be able to recover from the failure in normal language, not force the developer to understand provider message schemas.
Teams evaluating local coding agents should add this to their checklist alongside sandboxing, MCP permissions, context limits, and cost. Ask vendors and open-source projects how they handle partial streams, missing tool results, crash recovery, resumed sessions, and manual transcript corruption. If the answer is “retry,” keep digging. Retry without transcript repair is how you make a wedge deterministic.
The editorial take: Qwen Code’s nightly is not flashy, which is precisely why it matters. Local agents become serious when they stop assuming the happy path. The future of open coding agents will not be decided only by who has the best model weights; it will be decided by whose runtime can take a bad network, a crashed process, a late tool result, and a messy notebook diff without handing the developer a broken transcript and calling it autonomy.
Sources: Qwen Code v0.16.0-nightly.20260522.48b0a8bfc, Qwen Code PR #4176, Qwen Code v0.16.0, Qwen Code README