ai-frameworks

Qwen Code 0.15.10 Turns Long Coding Sessions Into a Tool-Budget Problem

Anatoliy Kolodkin

10 May 2026 • 4 min read

Qwen Code 0.15.10 is a coding-agent release about budgets. Not the finance kind, although your token bill may have opinions. The important budgets are tool schemas, context windows, and reusable instructions. Those are the constraints that decide whether an agent can keep working after the demo, after the tool list grows, and after the session becomes messy enough to resemble actual software engineering.

The release has a long changelog, but three changes define the story: ToolSearch for deferred tool-schema loading, reactive compression when provider context windows overflow, and automatic project-skill extraction behind an explicit opt-in flag. Together, they show a coding agent moving from “load everything and hope the model copes” toward a more disciplined runtime: discover tools when needed, recover when context breaks, and turn repeated workflows into reviewable project artifacts.

The hidden tax is the tool list

PR #3589 adds ToolSearch and on-demand deferred tool schemas. The pull request says a typical 39-tool setup previously spent roughly 15,000 tokens per request on declarations. That number should make every agent framework maintainer uncomfortable. Fifteen thousand tokens before the model has read the user’s request, inspected a file, or reasoned about a plan is not capability. It is overhead wearing a tool belt.

MCP makes this problem sharper. The easiest way to make an agent look powerful is to register more servers and more tools: GitHub, Jira, filesystem, browser, cloud APIs, observability, databases, deployment systems, internal docs. Each tool needs a name, description, parameters, and schema. Multiply that by every request and the model spends a material chunk of its context budget reading tools it will not use. That increases latency, cost, and failure probability. It also crowds out the actual working context: source code, logs, diffs, tests, and user intent.

Qwen Code’s deferred-loading design adds flags to DeclarativeTool: shouldDefer, alwaysLoad, and searchHint. ToolRegistry.getFunctionDeclarations() filters deferred tools by default, while revealDeferredTool(name) can expose a hidden tool when needed. That is the right shape. Low-frequency or specialized tools should not be shoved into every prompt just in case. They should be discoverable.

There is a tradeoff, because deferred tools change the model’s planning problem. A model cannot call a tool it cannot see unless it learns to search for it. That makes naming, search hints, and descriptions part of the agent UX. A badly named deferred tool is effectively invisible. A well-described tool catalog lets teams add capability without turning every request into a 15K-token table of contents. This is where agent platform work starts to look less like prompt engineering and more like information architecture.

Context overflow needs recovery, not drama

PR #3879 adds reactive compression when the provider reports a context-window overflow. Qwen Code classifies context-length errors, compresses the current conversation, and retries the failed turn once with compressed context. That “once” is important. Infinite compress-and-retry loops are how agents turn failure into expensive theater. A single guarded retry is a pragmatic service behavior: recover when possible, fail when recovery is not credible.

Long coding sessions hit context limits in boring ways. Logs grow. Diffs grow. Tool results include files. Users paste stack traces. The agent retries commands. A build failure drags in dependency output. The conversation contains old plans that are no longer useful but still occupy tokens. Proactive summarization helps, but provider-side limits can still arrive unpredictably. Reactive compression is not a substitute for good context management; it is the seatbelt for when good context management loses.

Teams should test this explicitly. If your coding agent dies after a large refactor or a noisy test run, the relevant question is not “does the model have a bigger context window?” Bigger windows get filled. The question is whether the runtime can degrade gracefully: compress stale context, preserve the current task state, avoid duplicating tool results, and retry without pretending it remembers every detail perfectly. Compression should move a session from dead to recoverable, not from precise to vaguely confident.

Auto-generated skills are useful supply-chain artifacts

The most interesting and riskiest feature is autoSkill from PR #3673. It is disabled by default with memory.enableAutoSkill: false. After a session reaches a default threshold of 20 tool calls, Qwen Code can fork a background review agent to extract reusable project-level workflows into ${projectRoot}/.qwen/skills/. That is clever because project-specific agent behavior is rarely designed upfront. It emerges from repeated debugging rituals, build commands, conventions, deployment quirks, and “always check this file first” habits.

The safety posture is the part that makes it publishable. autoSkill is opt-in. Review-agent writes are constrained to the skills directory. Files must contain source: auto-skill frontmatter before edit or write operations are allowed, which prevents overwriting user-authored skills. Those guardrails matter because generated skills are not inert notes. They are instructions future agents may load and follow. If code can be supply chain, agent instructions can be supply chain too.

The right operational model is to review generated skills like code. Put them in version control. Require pull requests if they affect shared repos. Check that they encode stable project knowledge rather than one-off session mistakes. Remove secrets, machine-specific paths, or brittle assumptions. A skill that says “always run this cleanup command before tests” may save hours. A skill that learned the wrong cleanup command from a broken session may quietly damage the next one.

Qwen Code 0.15.10 also fixes practical paper cuts, including Edit/WriteFile false rejections for regular source files such as .kt, .cpp, .py, and .ts after partial or truncated reads. That is less philosophically interesting than ToolSearch, but it matters. Coding agents lose trust through dumb file-handling failures faster than through weak benchmark scores.

For builders, the action item is simple: measure your agent’s overhead. Count tools. Count schema tokens. Track how much context is spent before the task starts. Move rare tools behind discovery where the framework supports it. Treat generated skills as reviewable artifacts, not magical memory. And test long-session recovery with deliberately noisy logs and large diffs, because production coding work is not kind to clean demos.

The next coding-agent bottleneck is not only model quality. It is tool budgets, context budgets, and instruction budgets. Qwen Code 0.15.10 is useful because it names those constraints in code. Looks good to me.

Sources: Qwen Code v0.15.10 release, PR #3589, PR #3673, PR #3879, PR #4002

The hidden tax is the tool list

Context overflow needs recovery, not drama

Auto-generated skills are useful supply-chain artifacts

Sign up for more like this.