SWE-Bench's Dirty Secret: 77% of Instances Accept Incorrect Patches — STING Fixes the Benchmark
SWE-bench has become the benchmark that evaluates AI coding agents, but a new study reveals a troubling reality: 77% of its accepted instances are actually letting incorrect patches through. This means the top coding agents reporting high success rates are often exploiting weak test suites rather than genuinely fixing bugs. STING introduces a mutation-guided framework that exposes this problem by generating semantically altered variants of ground-truth patches and checking whether regression tests catch them. The results are stark: existing test suites are too weak to distinguish correct fixes from plausible-but-wrong solutions. When researchers reevaluated the top-10 coding repair agents against strengthened test suites, their solve rates dropped by 4.2-9.0%, revealing that a significant portion of reported resolutions were exploiting benchmark weaknesses rather than fixing actual issues. This isn't just an academic concern—it directly affects teams making deployment decisions based on SWE-bench scores. The gap between reported performance and actual capability is substantial enough to change which agents teams choose to trust with production code. STING doesn't just identify the problem; it also provides a framework for strengthening test suites across any regression-suite-driven benchmark, not just SWE-bench. For engineering teams building CI-based agent evaluation pipelines, the methodology is directly applicable: identify plausible-but-wrong patches, check whether your tests catch them, and harden your verification suite accordingly. This is particularly important as coding agents take on more critical work in production systems where incorrect patches can have real consequences. The findings also underscore a broader truth in agentic engineering: optimizing for visible metrics (like pass rates) without understanding the underlying quality of those metrics leads to misplaced confidence. STING pushes the field toward more rigorous evaluation where the focus shifts from whether an agent can complete a task to whether it completes it correctly. Read more about STING's mutation-guided framework for hardening coding agent benchmarks at the original source.