SWE-Bench Leaderboard March 2026: Claude Opus 4.5 Hits 80.9%, Open-Weight Model Cracks 80%

The March 2026 SWE-Bench Verified leaderboard tells a clear story: the 80% threshold that seemed distant a year ago is now being crossed simultaneously by multiple model families. Claude Opus 4.5 leads at 80.9%, up from roughly 65% twelve months prior. Gemini 3.1 Pro surged to 80.6% to claim the third spot. Most notably, MiniMax M2.5 reached 80.2% — the first time an open-weight model has cleared that bar, a signal that frontier-level coding agent performance is no longer exclusively the domain of closed, proprietary systems.

The optimism comes with an important asterisk. Scale AI's harder SWE-Bench Pro — designed to surface real-world code quality reasoning rather than pattern-matched bug fixes — tells a different story: top agents are scoring only around 23%. That gap between 80% on the standard benchmark and 23% on the harder one isn't a rounding error; it's a structural reminder that current agents are still hitting a ceiling when the problems get genuinely messy and context-dependent, the way most real engineering work actually is.

Taken together, the two numbers are a useful calibration. The rapid convergence above 80% on SWE-Bench Verified suggests the benchmark itself may be approaching saturation as a meaningful signal, while SWE-Bench Pro and the newly introduced SWE-CI point toward where the real frontier challenges lie: not one-shot bug repair, but sustained, reliable performance across the full complexity of production software.

Read the full article at marc0.dev →