SWE-CI: The First Benchmark Designed to Test Whether AI Agents Can Maintain a Codebase Long-Term
The dominant way to evaluate AI coding agents — give them a bug, see if they fix it — has always felt like grading a developer on a single take-home assignment. SWE-CI is a new benchmark that throws that paradigm out and replaces it with something far more representative of how software actually works: continuous maintenance over time. Built around 100 real-world repositories, SWE-CI tasks agents with keeping code quality intact across an evolution history averaging 233 days and 71 consecutive commits. Instead of one-shot repair, agents must reason about how a codebase has changed and keep it healthy as requirements evolve.
The mechanics matter here. SWE-CI is structured around the CI loop — the same feedback cycle professional engineers use every day — and it forces agents through dozens of analysis-and-coding iterations per task rather than a single attempt. This design surfaces failure modes that static benchmarks simply can't see: an agent that patches a bug perfectly in isolation might break something three commits later, or drift out of alignment with the project's conventions as context compounds.
The benchmark arrives at a timely moment. With SWE-Bench Verified scores cracking 80% across multiple model families, there's real pressure to ask harder questions. SWE-CI reframes what "good" looks like — not "did it fix the bug?" but "could it replace a long-term engineering contributor?" That's a much tougher standard, and it's probably the right one to be measuring against as the field pushes toward genuine engineering autonomy.