110,000 Real PRs Later: What Happens When Coding Agents Actually Ship Code — Activity, Churn, and Long-Term Survivability

110,000 Real PRs Later: What Happens When Coding Agents Actually Ship Code — Activity, Churn, and Long-Term Survivability

A new large-scale empirical study has done something the agentic coding field has been waiting for: it went outside the benchmarks entirely and examined 110,000 real open-source pull requests from five production coding agents — OpenAI Codex, Claude Code, GitHub Copilot, Google Jules, and Devin — to measure what actually happens when these agents contribute to real software projects. The results are striking, and they carry a practical message that benchmark leaderboards can't convey.

The headline finding from the study, presented at MSR 2026, is a code survival analysis: agent-authored code churns significantly more than human-authored code over time. That is, code written by agents gets edited, refactored, or reverted at a measurably higher rate — a signal that, despite passing review, it's less aligned with the long-term architecture of the codebase. Crucially, agents that received more review comments before merge produced code with better survival rates, suggesting that the human review loop isn't just a safety check — it's a quality mechanism that actually shapes the code toward longer-term viability.

The five agents also differ meaningfully from each other: in merge frequency, the file types they favor, the back-and-forth they generate before a merge, and their long-term churn rates. For engineering teams making decisions about which agents to deploy and how much autonomous activity to permit without human review, this study offers the closest thing yet to a real-world differentiation dataset — far more actionable than synthetic benchmark scores on curated tasks.

Read the full article at arXiv →