TRAJEVAL: All Coding Agents Examine 22× More Functions Than Necessary — A Diagnostic Framework That Breaks Open the Black Box
A new diagnostic framework called TRAJEVAL has produced one of the clearest quantifications of coding agent inefficiency yet recorded: across 16,758 agent trajectories, all agents — regardless of architecture or underlying model — examined approximately 22 times more functions than a reference patch actually required. The framework decomposes agent execution into three interpretable stages (Search, Read, Edit) and measures precision and recall at each stage, making it possible to pinpoint exactly where an agent goes wrong rather than just observing that the final patch was incorrect.
The architecture-specific failure modes uncovered by TRAJEVAL are particularly valuable. GPT-5 correctly localizes relevant files but then targets the wrong functions for editing. Qwen-32B fails much earlier, at file discovery itself. These are completely different root causes producing the same surface-level failure — a fact that binary Pass@1 metrics entirely conceal. This diagnostic resolution matters because the fix for a model that can't find the right file is different from the fix for a model that finds the file but edits the wrong function.
The paper doesn't stop at diagnosis. Using TRAJEVAL's trajectory-stage signals as real-time feedback during agent execution improved two state-of-the-art models by 2.2 to 4.6 percentage points while reducing inference costs by 20 to 31 percent simultaneously. The cost reduction comes directly from eliminating unnecessary reads — the same 22× over-reading that makes agents expensive is also what makes their edits imprecise. Teams can instrument their own agents with these trajectory signals today without waiting for better base models.