llm-rankings

The Real LLM Rankings Story Is the Gap Between Prestige and Production

Anatoliy Kolodkin

28 Apr 2026 • 4 min read

There are two LLM rankings that matter, and right now they are telling different stories.

Arena AI's Text leaderboard — the one benchmark that developers actually trust — is basically a frozen lake this week. Anthropic holds four of the top five positions. Claude Opus 4.7 Thinking sits at 1503, with its stablemate 4.6 Thinking at a statistically tied 1503. No one landed a meaningful punch overnight. No one displaced anyone in the visible top 20. The frontier, as measured by head-to-head human preference voting, is stable.

Then there's OpenRouter.

OpenRouter is not a benchmark. It's a routing layer — the infrastructure plumbing that moves billions of tokens worth of actual production traffic through models from dozens of providers. Its rankings are not about who looks smartest in a controlled eval. They're about who builders are actually paying to use, right now, at scale. And that board is not frozen at all.

Tencent's Hy3 preview (free) debuted at #18 this week, displacing Qwen3.6 Plus from the visible top 20. Step 3.5 Flash climbed two spots to #8, driven by 721 billion tokens in weekly volume and 98% week-over-week growth. Meanwhile, MiniMax M2.5 — which looked like a contender just days ago — dropped from #9 to #12, the sharpest fall in the current top 20. MiMo-V2-Pro also fell two places despite posting 61% weekly growth.

That last detail is the tell. Positive growth in usage, but still losing rank. That means the models above them are growing faster, and the total addressable market for routed traffic is concentrating around a smaller set of winners even as the underlying demand diversifies. The mid-table is getting crowded in a way that benefits the top more than it helps the climbers.

The $0 context window angle is worth dwelling on. Hy3 preview's entry as a free model is not a coincidence. On OpenRouter, free-tier availability is a distribution mechanism. Developers route traffic to it not because it's necessarily the best model — it's not topping Arena — but because it has zero cost friction for experimentation and production integration. Tencent is buying distribution the same way a SaaS product gives away a free tier: not because it's charity, but because usage begets lock-in. Once a dev workflow is built around a model's API shape, context window behavior, and failure modes, switching costs accumulate even without contractual obligations.

NVIDIA's Nemotron 3 Super (free) sitting at #11 on OpenRouter with 650 billion tokens tells a similar story from a different angle. NVIDIA describes it as a 120B total / 12B active hybrid MoE Mamba-Transformer, with a 1 million token context window and claims of up to 7.5x throughput advantage over Qwen3.5-122B on long-output tasks. On a benchmark, it might not displace Claude Opus 4.7. In production routing, it is finding a real niche among developers who need long-context document processing or agentic pipelines and don't want to pay frontier prices for every token.

This is the split worth understanding: benchmarks measure intelligence, and routing measures economics plus workflow fit. The two do not always agree, and conflating them is how teams end up with expensive production setups that underperform what a cheaper, less-hyped model would have delivered for their specific use case.

Look at the token volumes on OpenRouter's top five. Kimi K2.6: 1.58 trillion tokens routed. Claude Sonnet 4.6: 1.36 trillion. DeepSeek V3.2: 1.28 trillion. Claude Opus 4.7: 1.15 trillion. Gemini 3 Flash Preview: 1.04 trillion. These are not small numbers. This is real infrastructure carrying real product traffic. The fact that DeepSeek V3.2 — which does not top Arena — is third in actual routed volume while Gemini 3 Flash Preview sits fifth despite its strong eval showing tells you that price-performance and availability matter as much as raw capability in the decisions that actually get made.

One more data point worth sitting with: Anthropic held Claude Opus 4.7 pricing flat at $5 per million input and $25 per million output tokens — the same as 4.6 — while improving advanced software engineering benchmarks. Their cited wins include CursorBench at 70% versus 58% for Opus 4.6, and a 13% coding lift on a 93-task benchmark. That is a meaningful quality improvement at the same price point, and it is exactly the kind of move that reinforces Anthropic's position in the upper tier of both benchmarks and production routing. When you can hold price steady while improving coding performance, you make the "should we re-evaluate our model choice" conversation much harder for competitors to win.

The community reaction backs this up, at least on the eval side. The Claude Opus 4.7 launch drew roughly 1,959 points on Hacker News — the kind of engagement that only happens when developers think a model might change their day-to-day tooling. NVIDIA's Nemotron 3 Super, by contrast, got around 13 HN points. The open-source model enthusiasts still exist, but they need more than a good benchmark to translate that enthusiasm into production traffic. Distribution, integration points, and default positioning in popular frameworks do the work that raw performance cannot do alone.

So what's the practical read for someone building this week?

If you're choosing a model for a production AI feature today, the question is not "who won Arena this week." It's "which model am I confident will stay competitive in routing costs, remain available, and improve without requiring me to re-integrate every six weeks." The stability at Arena's top is actually useful signal: it means you can pick Anthropic's top tier and have reasonable confidence the choice won't be obsoleted by a surprise displace in the next few weeks. But the OpenRouter volatility below that tells you that cost-conscious and workflow-specific use cases have a wider and more dynamic set of viable options that are worth evaluating on their actual economics, not just their benchmark scores.

The mid-table movers — Step 3.5 Flash, Nemotron 3 Super, Hy3 preview — are not threatening the frontier leaders. They are eating into the second-tier production workloads where developers are price-sensitive, need long contexts, or are building agentic flows that generate high token volumes. That is a real and growing market, and the fact that these models are gaining usage against models that are also growing suggests the total demand is expanding faster than even the leaders can absorb it. That is a healthy market signal, and it means the "right model" question is more nuanced than it was six months ago.

The frozen Arena leaderboard is not boring. It's a sign that the competition has moved to a different layer: not raw intelligence, but intelligence delivered at the right price, latency, and context length for the specific job. That is a harder and more interesting problem, and it's playing out right now on OpenRouter, not on the eval boards.

Sources: LM Arena Leaderboard, OpenRouter Rankings, Anthropic Claude Opus 4.7, NVIDIA Nemotron 3 Super

Sign up for more like this.