Prove Trains Tool-Using Models Where the Tools Actually Work

Prove Trains Tool-Using Models Where the Tools Actually Work

Tool-use papers usually fail the reality test in one of two ways. Either the “tools” are static JSON fixtures with no meaningful state, or the reward quietly teaches the model that spraying extra calls is fine as long as the final answer looks right. Prove is worth reading because it attacks both problems: it trains models where the tools actually execute, the state actually changes, and pointless tool spam is explicitly punished.

IBM Research’s paper, Synthesize and Reward: Learning Multi-Step Tool Use in Live Environments, introduces Prove — Programmatic Rewards On Verified Environments. The system trains small Qwen and Granite-family models against 20 live stateful MCP servers exposing 343 tools across finance, productivity, commerce, travel, social, IoT, developer tools, and knowledge management. This is exactly the terrain where agent products are heading: not toy function calls, but stateful workflows where a result from step one becomes the only valid input for step three.

The timing matters. MCP has become the default shape of the tool layer across Claude, ChatGPT, VS Code, Cursor, MCPJam, and a growing pile of agent clients. That makes tool exposure easier. It does not make tool use reliable. If anything, it increases the blast radius of models that can call more systems without understanding state, dependency order, or when no tool should be called at all.

The useful trick is boring: validate against the real environment

Prove’s training corpus totals 13,517 examples: 10,895 multi-turn MCP conversations, 1,500 clarification trajectories, and 1,122 abstention examples, including 806 from When2Call and 316 from xLAM-Irrelevance. The data pipeline auto-discovers dependency graphs over tool pairs, extracts length-2 to length-5 tool chains, samples real entities from server state, then replay-validates traces against reset environments. Conversations with schema or execution error rates above 30% are discarded, and traces are deduplicated by Jaccard similarity on tool-call sequences at a 0.70 threshold.

That may sound like plumbing. It is the contribution. Synthetic tool-use data is famously prone to inventing users, account IDs, SKUs, calendar entries, files, bookings, and device names that do not exist. Models trained on that data learn workflows that only function inside the prompt writer’s imagination. Prove’s sampler probes live server state first, grounds query generation in real entities, and then replays the result. If you are building internal agent training data for CRM, ticketing, repo automation, billing, or support workflows, this is the part to copy.

The paper also gives every rollout a unique session ID, which prevents one rollout’s tool calls from contaminating another rollout’s state. That is a small implementation detail with large consequences. Without session isolation, multi-agent or multi-rollout training can accidentally reward models for state left behind by previous attempts. In production terms, it is the difference between a test suite that resets its database and one that occasionally passes because yesterday’s row is still lying around.

Reward design is where agent habits are made

Prove decomposes tool-use quality into five reward components: validity, dependency-ordered coverage, efficiency, tool-name guidance, and argument-value matching. Validity itself is tiered: the function name must exist, required parameters must appear with compatible JSON types, and live execution must succeed without error. That hierarchy matches the failure modes engineers actually see: the model calls a nonexistent tool, passes malformed arguments, uses the right tool in the wrong order, or succeeds only because the environment is forgiving.

The adaptive efficiency budget is especially important. Too many evals treat extra tool calls as harmless. They are not. Every unnecessary call adds latency, cost, audit noise, and security exposure. In a production agent, tool spam can also trigger rate limits, mutate state unexpectedly, or leak data through integrations. Prove’s reward does not merely tell the model to be concise; it scales the allowed call budget with task complexity. A one-step lookup should not get the same call budget as a five-step refund workflow. That idea belongs in runtime policy as much as training.

The paper’s robustness knobs are also practical: 40% distractor injection with 3 to 8 unrelated tools, 30% enum stripping, 5% irrelevance queries, and hidden-tool variants for missing-function or clarification behavior. These are not academic decorations. Real tool catalogs are noisy. Schemas are incomplete. The relevant tool may be missing. A reliable agent needs to ask a clarification question, abstain, or say the environment cannot satisfy the request instead of hallucinating the closest API.

Results are useful without being overclaimed. The authors train Qwen3-4B, Qwen3-8B, Qwen2.5-7B, and Granite-4.1-8B for 350 GRPO steps. Qwen3-4B improves by +10.2 on BFCL Multi-Turn overall. Qwen2.5-7B improves by +6.8 on τ²-bench average and +6.5 on T-Eval overall. All four models improve across the three benchmark families, although Granite’s BFCL overall gain is nearly flat at +0.1. That caveat matters: tool-use RL is not magic seasoning. Base model, learning rate, environment coverage, and reward shaping all matter.

The comparison to AgenticQwen is also revealing. Prove uses about 13K examples and no judge model, while the paper describes AgenticQwen as using roughly 100K branching trajectories and LLM-as-judge rewards with Qwen3-235B. That does not make one approach universally better, but it highlights an important design tradeoff. If you can build programmatic rewards against verified environments, you can reduce dependence on expensive judge models and subjective scoring. For tool use, executable truth is usually better than another model’s opinion.

For teams already exposing MCP servers, the action item is not necessarily to train a model tomorrow. It is to audit the tool environment like it might become training infrastructure. Can tools reset cleanly? Are traces recorded deterministically? Can a session be replayed? Are failures typed well enough for an agent to recover? Do schemas encode constraints, enums, and required fields clearly? Can your harness distinguish “no applicable tool” from “malformed call” from “state changed under us”? Those properties improve production reliability even before they improve training data.

There is one raised eyebrow: the paper says the code, server library, synthesized data, and checkpoints will be released soon. Until that happens, Prove is a credible design paper rather than a drop-in artifact. That is fine, but reproducibility is part of the value proposition for infrastructure papers. Builders should track the release before treating the numbers as portable.

The broader lesson is simple. MCP makes tools easy to connect; it does not make agents competent. Competence comes from live state, valid traces, isolated rollouts, rewards that punish waste, and environments that can prove whether the agent actually did the job. Prove is not the final answer for tool-use training, but it is pointing at the right target: stop training agents on laminated API schemas and start training them where the world can push back.

Sources: arXiv, arXiv HTML, Model Context Protocol docs, Berkeley Function-Calling Leaderboard, τ-bench