The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration
A comprehensive new survey traces the full arc of how LLM agents have learned to use tools — from the earliest experiments asking "did the model pick the right tool?" all the way to today's multi-agent systems navigating long, branching trajectories with real-world side effects. The paper establishes a five-level taxonomy: single-tool selection, sequential multi-tool calling, parallel execution, nested hierarchical orchestration with sub-agents, and adaptive orchestration that rewrites its own plan based on live tool outputs. Each level has its own benchmarks, training approaches, and failure modes, and the survey maps all of them with notable clarity.
The central insight is a shift in what the hard problem actually is. Early tool-use research was about invocation correctness — did the agent call the right function? At levels four and five, the challenge becomes orchestration correctness: can the agent reason coherently across a long trajectory with incomplete information, compounding state, and irreversible side effects? Safety, cost, and verifiability only become urgent design constraints at those higher levels, which is where most production systems now live. For teams hitting reliability walls with their coding agents, this taxonomy offers a useful diagnostic frame — it helps identify which level of orchestration complexity is the source of the problem, and what the documented failure patterns look like at that level.