vibe-coding

AI Code Review Catches Only 15–31% of Real Issues — and Adding More Context Makes It Worse

Anatoliy Kolodkin

30 Mar 2026 • 1 min read

A new benchmark called SWE-PRBench evaluated 8 frontier AI models on 350 real pull requests with human-annotated review quality, and the results provide two calibrations every engineering team running AI code review should know. The first is a baseline number: at the diff-only level, frontier models detect 15–31% of the issues that human reviewers flag. That's a useful accelerator — a quarter of human-level coverage, consistently — but it's not a replacement, and it's important to understand the gap before wiring AI review into CI pipelines with high confidence levels. The second finding inverts the intuition of almost everyone who has tuned an AI code review system: providing more context makes performance worse, not better, across all 8 tested models.

The benchmark tested three frozen context configurations: diff only, diff plus full file content, and full context with AST-extracted function dependencies and import graph resolution. All 8 models degraded monotonically from the first configuration to the third. The mechanism is attention dilution: as the context window fills with resolved files and graph data, the model's ability to detect issues that require cross-file reasoning actually collapses. A structured 2,000-token diff-with-summary outperformed both richer configurations. If you're currently sending full repository context or resolved AST graphs to your code review agent, this paper provides evidence-based justification for rolling that back. The improvement isn't marginal — it's consistent across every model tested, which makes it a structural finding rather than a model-specific quirk.

The broader implication extends beyond code review: any agent task where you're unsure whether more context helps should be empirically tested against a trimmed alternative, not assumed to benefit from additional information. The attention dilution mechanism is domain-agnostic, and SWE-PRBench is now a reproducible harness for measuring it in the code review context specifically.

Read the full article at arXiv →

Sign up for more like this.