Your Agent's Tool Library Is a Software Artifact — and It's Rotting While You Watch Task Completion Scores
Most agentic coding pipelines measure success by one metric: did the task complete? A new benchmark called EvolveTool-Bench exposes what that metric hides. When agents are allowed to create their own tools at runtime — writing helper functions, API wrappers, and data processors on the fly — those tools accumulate into a library. And that library is quietly rotting.
The research analyzed systems across three domains where agents must execute self-generated tools, and found that systems with nearly identical task completion rates (63–68%) differed by up to 18% in library health — covering metrics like reuse, redundancy, regression stability, and safety. Two pipelines that look equivalent on the standard scoreboard are building tool collections with dramatically different long-term maintainability profiles. The standard metric simply can't see the difference.
One structural cause: without explicit quality pressure in the training objective, agents almost universally prefer writing new tools over reusing existing ones. Each iteration adds redundancy. Over time, the library grows sprawling and fragile — a hidden debt that only surfaces when something downstream breaks. The divergence between code-level and strategy-level tool evolution approaches also matters significantly for how quickly the rot sets in.
For engineering teams running agentic coding pipelines that accumulate tools across sessions or tasks, this is a direct warning: your task completion metrics are not measuring tool library health, and the two don't move together. EvolveTool-Bench provides a concrete framework for auditing what your agents are actually building over time — before it surfaces as a silent production failure.