AsyncTool Shows Tool-Calling Agents Still Don't Know How to Wait

AsyncTool Shows Tool-Calling Agents Still Don't Know How to Wait

Tool calling is not solved just because a model can emit valid JSON.

AsyncTool lands on the part of agent engineering that demos usually edit out: time. In most benchmark traces, a model calls a tool, gets the result immediately, and continues as if the world were a blocking function call. Real systems are not that polite. CI queues. Search APIs lag. Ticketing systems rate-limit. Browsers hang. Humans approve or cancel in the middle. A useful agent has to keep working while results are pending, remember which result belongs to which task, and resume without confusing stale state for current truth.

The benchmark is simple in concept and brutal in implication. AsyncTool contains 712 multitasking instances where agents receive dual-task and tri-task workloads. Tool responses can be delayed. Tasks can be similar or cross-domain. The evaluated categories include Data Management, Filesystem, Data Generation, MessageAPI, Number Operations, SocialConnect, String Manipulation, TicketPurchase, TradingBot, TravelPlanning, DataFormat, and Machine Operation. The agent is scored at step level, sub-task level, and task level, with extra coordination metrics to capture whether it actually handles asynchronous work efficiently.

The result is a cold shower for anyone treating function-call accuracy as agent competence. GPT-4.1 reaches 96.22 function-call F1 and still only 38.06 task-level overall. Qwen-Max reaches 86.22 function F1 and 25.56 overall. Open models show the same cliff: Qwen2.5-32B-Instruct gets 94.24 function F1 but only 24.86 overall; DeepSeek-V3.1-Terminus reaches 86.10 function F1 and 28.93 overall. The model can know which tool to call and still fail the job.

The missing skill is coordination, not syntax

Closed-model task-level scores put GPT-4.1 at 38.06, Gemini 2.5 Pro at 32.44, GPT-5 at 31.32, GPT-4o at 31.74, Qwen-Max at 25.56, and Kimi-K2 at 24.44. That spread is interesting, but the absolute numbers are the story. These systems are not failing because they cannot write a function name. They are failing because asynchronous execution creates state-management pressure.

That distinction matters for product teams. A support agent handling three customer workflows might wait for a billing API, continue investigating logs for another ticket, then receive the billing result after the user has changed the request. A coding agent might start tests, inspect nearby code while waiting, then need to bind the test output to the correct patch version. A research agent might fire multiple searches and then merge results that arrive out of order. If the model’s only state is a scratchpad full of prose, you have built a race condition with a friendly voice.

AsyncTool therefore exposes an evaluation gap. Many “tool-use” benchmarks reward correct calls in isolation. That is table stakes. The harder question is whether the runtime and model together can maintain a dependency graph: task A is waiting on tool result X; task B can proceed; result Y belongs to a now-cancelled branch; result Z arrived after permissions changed. Without that structure, the model has to infer the entire scheduler from conversation history. Sometimes it will. Often it will not. The benchmark numbers show what “often” looks like.

Do not outsource your scheduler to the model

The engineering lesson is not “wait for better models.” Better models will help, but asynchronous coordination should be a runtime contract. Give every task a stable ID. Give every tool call a correlation ID. Store pending results outside the prompt. Track dependency edges. Record deadlines, cancellations, policy state, and resume points. When a tool result returns, the runtime should know which task branch it can wake and what has changed since the call was made.

Then the model can do what it is good at: decide what to do next given a clean state summary. It should not be responsible for reconstructing an event loop from a chat transcript. That is not intelligence; that is missing infrastructure.

This applies directly to coding agents. If the agent runs tests asynchronously, the test result must be tied to a commit hash or file snapshot. If it launches multiple subagents, each output needs provenance and a merge policy. If it opens a browser tab while editing code, the browser result should not leak into an unrelated task without explicit routing. The more autonomous the agent becomes, the more mundane the runtime needs to be. Boring IDs beat clever prompts.

Latency is also a security problem

Delayed tool feedback is not only an efficiency issue. It is a policy issue. A permission may change while a call is pending. A user may revoke consent. A secret-bearing result may arrive after the model has switched context. A previous branch may be abandoned but still receive API output. If the runtime cannot bind delayed results to the correct policy state, the agent can act on information it should no longer use.

That is why AsyncTool belongs in the governance conversation, not just the benchmark conversation. Tool-call logs should include timestamps, task IDs, inputs, outputs, permissions in force at call time, permissions in force at result time, and the model turn that consumed the result. This sounds bureaucratic until a production agent sends the wrong message, books the wrong trip, or applies a stale patch because it confused which delayed result it was looking at.

The GitHub repo for AsyncTool was fresh during research — MIT licensed, created on 2026-05-27, updated 2026-05-29, with benchmark data and evaluation scripts present and a tiny early footprint of 7 stars. There was no meaningful HN discussion. That is fine. The best infrastructure benchmarks often arrive before the discourse finds vocabulary for the failure mode.

Practitioners should use AsyncTool as a checklist. Do your agent evals include delayed tools? Do they include multiple simultaneous tasks? Do you score task completion separately from function-call correctness? Can your runtime cancel a pending branch? Can it prove which result influenced which action? If not, your “agent” may just be a synchronous demo wearing a hard hat.

The editorial take is blunt: tool calling is not solved until agents can wait, switch, resume, and prove they did not mix up the state. AsyncTool does not make that problem glamorous. It makes it measurable. That is better.

Sources: arXiv, AsyncTool GitHub repository, arXiv HTML.