Agents Beat Two Decades of Human Bug-Hunting Research — The Simple Trick That Does It

Agents Beat Two Decades of Human Bug-Hunting Research — The Simple Trick That Does It

Bug attribution — figuring out which commit introduced a defect — has been one of software engineering's most stubborn unsolved problems. The SZZ algorithm, which just received the 2026 ACM SIGSOFT Impact Award, was published twenty years ago and still underpins most fault-attribution research today. The best traditional approach published in 2025 managed to push its F1 score from 0.54 to 0.64 on the gold-standard Linux kernel dataset, a modest improvement that took years of careful engineering. A new paper from Niklas Risse and colleagues obliterates that ceiling in a single step: a simple agentic workflow reaches 0.81 — a bigger leap than the previous two decades combined.

The mechanism is almost disarmingly simple. The agent reads a fix commit's diff and message, derives short "greppable" pattern strings from the semantic content of the change, then iterates over candidate commits using those patterns as targeted search queries. The key is that agents are good at constructing precise, context-aware search expressions that no static pattern matcher can replicate — and crucially, they can adapt those expressions mid-trajectory when initial ones fail. This adaptive, semantically-grounded search is what pushes performance so far beyond hand-engineered baselines.

The implications extend well beyond bug attribution. Any task where the answer is buried in a large candidate corpus — log triage, incident root-cause analysis, dependency audits, change impact analysis — likely benefits from the same pattern. Agents-as-search-query-generators is a powerful mental model: rather than encoding domain knowledge into fixed rules, you let the agent derive search expressions from the specific context of each query. The 0.64 → 0.81 jump on a 20-year benchmark is the clearest single-number proof yet that a well-designed agentic workflow can obsolete highly tuned traditional systems.

Read the full paper on arXiv →