vibe-coding

SlopCodeBench: No Agent Survives Iterative Coding — Code Quality Degrades in 80% of Agent Trajectories Across All 11 Models Tested

Anatoliy Kolodkin

27 Mar 2026 • 1 min read

A new benchmark called SlopCodeBench has produced what may be the most sobering result in agentic coding research to date: not a single agent from any of 11 models tested could complete an iterative coding task end-to-end. The highest checkpoint solve rate across all models was just 17.2%. More troubling than the failure rate is what happens to code quality over time — structural erosion rises in 80% of agent trajectories, and verbosity (redundant or duplicated code) appears in 89.8% of runs. Compared against 48 real open-source Python repositories, agent-written code was measured to be 2.2× more verbose and consistently more eroded with each iteration.

What makes this research particularly important for engineering teams is that SlopCodeBench is the first benchmark designed to simulate what developers actually do in production: extend and modify existing code under evolving specifications, repeatedly. Prior benchmarks like SWE-bench measure single-shot correctness, which turns out to systematically miss the degradation that accumulates across multiple engineering cycles. The paper tracked 20 open-source repos over time and found human code stays stable while agent code deteriorates — and crucially, better initial prompts improved quality at the start of trajectories without halting the degradation that followed. If prompts can't fix it, the paper argues, the solution requires architectural changes: better review loops, task decomposition, or fundamentally different agent designs.

For any team shipping agents into production codebases today, this paper introduces the vocabulary and metrics to instrument your own systems: verbosity ratios, structural complexity concentration, and per-trajectory quality signals. The benchmark is language-agnostic and the findings replicated across all 11 models tested — making this a result that's hard to explain away as a model-specific limitation.

Read the full article at arXiv →

Sign up for more like this.