AXPO Shows Why Tool-Using Models Need Different RL Than Chat Models

AXPO Shows Why Tool-Using Models Need Different RL Than Chat Models

The interesting thing about AXPO is not that it makes Qwen3-VL-Thinking score a little better on multimodal benchmarks. The interesting thing is the failure mode it catches: models that know a tool exists, talk about using it, and then retreat back into pure text because acting has become too expensive in the reward loop. That is not a minor training quirk. It is the difference between an agent that can operate in the world and a very articulate autocomplete with commitment issues.

The paper calls this the Thinking-Acting Gap. In standard GRPO-style reinforcement learning, tool use is high variance early in training. The model may decide to zoom into an image, run a search, crop a region, or take another environmental action, but those attempts often fail before the policy has enough signal to improve. If the reward mostly sees failed actions, the model learns the wrong lesson: thinking is safer than acting. The authors report that tool use appears in only about 30% of rollouts, and when tools are attempted, tool-using rollouts inside a group are all wrong on roughly 40% of questions. That is how you accidentally train an agent to become more passive.

Tool use is not just another token sequence

AXPO’s fix is targeted rather than mystical. When tool-using subgroups fail, it freezes the question plus the thinking prefix, then resamples the tool-call continuation. In other words, it does not throw away the whole trajectory or reward the model for avoiding action. It gives the model another chance at the exact part of the behavior distribution that matters: the decision to act, the concrete tool call, and the follow-through after the tool returns.

That detail matters for practitioners because most agent evals still hide this pathology behind final accuracy. A model can improve on aggregate while using tools less often. That may look like progress until the task distribution shifts toward work that actually requires search, visual inspection, shell execution, UI manipulation, or retrieval. If your benchmark only asks whether the final answer is right, you will miss the fact that the model’s operating surface is shrinking.

The reported numbers are modest in one framing and more provocative in another. The abstract says SFT+AXPO beats SFT+GRPO by +1.8 percentage points Pass@1 and +1.8 percentage points Pass@4 at 8B on average. The project page emphasizes larger baseline-relative gains: +7.9pp Pass@1 and +6.2pp Pass@4 at 8B. The authors train three scales of Qwen3-VL-Thinking across nine multimodal benchmarks covering reasoning, perception, and search, and they report that the 8B SFT+AXPO model surpasses the 32B Base on Pass@4 with four times fewer parameters.

That last point is where the work stops being just a training paper and starts becoming an economics paper. Agent workloads are not cheap chat completions. They branch, retry, call tools, carry long context, and sometimes need multiple rollouts before they produce a trustworthy result. If an 8B model can be trained to act reliably enough to beat a larger base model on tool-using work, that changes routing decisions. The cheapest usable agent is often not the smallest model or the highest-scoring chat model. It is the model whose behavior distribution matches the environment.

The metric your agent dashboard probably lacks

Teams building coding agents, browser agents, and multimodal assistants should steal the diagnostic even if they never use AXPO. Track tool-attempt rate. Track successful-tool rate. Track all-wrong tool subgroups. Track whether the model stops after one search when the task needs two. Track whether it mentions a tool but never invokes it. Track whether tool use collapses over training, fine-tuning, prompt changes, or model routing updates.

The qualitative examples are useful because they look familiar. A GRPO-trained model says it should use image_zoom_in but does not actually call it; AXPO commits the call and answers the visual question correctly. In search, the weaker policy stops after the first hop; AXPO performs the second search needed to resolve the answer. That is the agent equivalent of a junior engineer saying “we should check the logs” and then never opening the log viewer.

There is a governance angle here too. Tool avoidance is not always bad. In production, agents should avoid dangerous or unnecessary actions. But that policy has to be explicit. If the model avoids tools because the training loop punished early exploration, you do not have safety. You have learned helplessness with a nicer benchmark score. Proper action governance means permission scopes, dry runs, audit logs, and evals that separate “correctly declined to act” from “failed to act because acting is hard.”

The caveat is straightforward: this is a paper and project-page result, not a broadly reproduced deployment pattern yet. The gains need to be retested on the tools a real system exposes: browsers, IDEs, shells, image crops, internal search, databases, spreadsheets, and ticketing systems. Tool semantics matter. A crop tool is not a shell. A search tool is not a payment action. AXPO names the failure class; production teams still need environment-specific evidence.

Still, the lesson is clean. Tool-using agents need training signals designed for action, not chat-shaped reward recipes with tools bolted on afterward. If your model only thinks about using tools, it is not an agent. It is a reviewer leaving TODO comments in its own head.

Sources: arXiv, AXPO project page, Hugging Face Daily Papers