vibe-coding

Claude Reviewing Claude Is Structurally Circular — Without Executable Specs, AI Code Review Just Echoes Its Own Mistakes

Anatoliy Kolodkin

30 Mar 2026 • 1 min read

The most common response to AI-generated code quality problems is deploying an AI reviewer. A new paper argues that response is structurally circular when you don't have executable specifications in place. The argument is not that AI code review is bad — it's that when the reviewing agent and the generating agent share the same training distribution, they share the same blind spots. The reviewer checks code against itself, not against what the code was supposed to do. Without an external reference, both agents reason from the same artifact, and their failures correlate rather than cancel.

The paper supports this with experiments pitting Claude against Claude-generated code, then cross-checking with a panel of four models from three different families. Same-family review consistently catches fewer errors than cross-family review — and neither catches errors at the rate that human experts do. The explanation is structural: errors that are invisible to the generating model tend to be invisible to models trained on similar data, regardless of how capable the reviewer appears on other tasks. The proposed solution is not a better reviewer model but executable specifications — formal contracts, property-based tests, type-checked schemas — that give the reviewing agent an external anchor. With executable specs, the agent detects deviations from a defined reference rather than reasoning about intent from the code surface alone.

The practical implication for teams building agentic CI/CD pipelines is direct: invest in specification infrastructure before deploying AI reviewers, not after. Executable specs are the quality gate that makes AI code review non-circular. Running Claude to review Claude-generated code without them is less reliable than it appears — not because Claude is a poor reviewer, but because the correlation structure of errors between generator and reviewer is an architectural problem that a better model alone cannot fix.

Read the full article at arXiv →

Sign up for more like this.