Code Review Agent Benchmark (c-CRAB): Current Agents Solve Only 40% of Real PR Reviews
A new benchmark from the National University of Singapore puts today's best AI code review agents to the test — and the results are sobering. c-CRAB (Code Review Agent Benchmark) is the first evaluation framework built specifically around the task of reviewing pull requests rather than writing code. By generating automated test suites from real human PR reviews, the researchers created a rigorous way to check whether an agent's feedback actually covers the same concerns a human reviewer would raise. Four systems were evaluated: the open-source PR-Agent alongside commercial agents from Devin, Claude Code, and Codex. Together, the best agents solved only about 40% of benchmark tasks — a striking gap given how capable these same systems are at code generation.
What makes this more than a simple performance ranking is the nature of the failure. The agents weren't just missing details — they were actively prioritizing different concerns than their human counterparts, suggesting a fundamental difference in how AI and human reviewers read code. That finding has real architectural implications: human-agent pairing in code review isn't just a transitional workaround, it may be a structural necessity for quality-critical pipelines. The evaluation methodology itself is a reusable pattern worth noting — generating held-out test suites from human reviews opens a path for fully automated review quality loops, which is exactly the kind of infrastructure scaling teams will need as agentic coding becomes standard practice.