Code Review Agent Benchmark (c-CRAB): Current Agents Solve Only 40% of Real PR Reviews

Code Review Agent Benchmark (c-CRAB): Current Agents Solve Only 40% of Real PR Reviews

A new benchmark from researchers at the National University of Singapore is putting AI code review agents to the test — and the results are humbling. c-CRAB (Code Review Agent Benchmark) is the first evaluation framework designed specifically to measure how well agents review pull requests, as opposed to writing code. Using real human PR reviews as ground truth, the benchmark generates test suites that check whether an agent's review covers the same concerns a human engineer would raise.

Four systems were evaluated: the open-source PR-Agent and commercial agents from Devin, Claude Code, and Codex. Together, the best agents solved only about 40% of c-CRAB tasks — a striking gap given their code-generation capabilities. Perhaps more importantly, the research found that agents don't just miss things; they systematically prioritize different concerns than human reviewers do. That distinction matters enormously for any team treating automated review as a quality gate, since it suggests current agents aren't simply incomplete — they're operating on a different set of priorities altogether.

The benchmark's methodology is itself a contribution worth noting. By generating executable test suites from human reviews, c-CRAB opens the door to fully automated evaluation loops for review agents — no human judge required. For engineering teams building agentic coding pipelines, the 40% figure is a useful calibration point, and the finding points clearly toward human-agent pairing as an architectural necessity rather than a temporary workaround.

Read the full article at arXiv (cs.SE + cs.AI) →