llm-rankings

The Best Model Isn’t the Most-Used Model

Anatoliy Kolodkin

07 May 2026 • 5 min read

The interesting thing about this week's LLM rankings is not that Anthropic still owns the top of the benchmark table. It does. Claude Opus 4.7 Thinking is still #1 on Arena Text at 1503 Elo, Claude Opus 4.6 Thinking is still #2 at 1502, and Claude Opus 4.6 is still #3 at 1497. The benchmark story is basically frozen.

The usage story is not frozen at all.

On OpenRouter, Tencent's Hy3 preview has become the most-used model by token volume, processing 3.66 trillion tokens this week. That is up from 3.03 trillion and represents 298% week-over-week growth. More importantly, Hy3 is now doing more volume than Kimi K2.6, Claude Sonnet 4.6, and Gemini 3 Flash Preview combined. That is not a leaderboard footnote. That is the market telling us something the Elo table does not.

The simple version: the best model is not necessarily the model people actually use. In 2026, that distinction matters more than another one-point Elo move.

Benchmarks still crown Claude. Workloads are voting for cheap.

Arena's text leaderboard is stable enough to be boring in the way only a mature benchmark can be boring. Anthropic holds the top three spots, with Claude Opus 4.7 Thinking at 1503 Elo, Claude Opus 4.6 Thinking at 1502, and Claude Opus 4.6 at 1497. Google is close behind with Gemini 3.1 Pro Preview at 1493, while Claude Opus 4.7 sits at 1491, tied with Meta's Muse-Spark. OpenAI's GPT-5.5 High is #7 at 1488.

The code leaderboard is even more Anthropic-heavy. Claude Opus 4.7 Thinking moved from 1568 to 1570 Elo and remains #1. Claude Opus 4.7 is #2 at 1561, Claude Opus 4.6 Thinking is #3 at 1548, and Claude Opus 4.6 is #4 at 1543. Anthropic owns six of the top ten code slots, and the thinking variants continue to outperform their non-thinking siblings by roughly 7 to 9 Elo points.

If you are selecting a model for a hard reasoning task, a production coding agent, or anything where correctness costs more than tokens, that still matters. Claude is not winning these tables by accident. The consistency across text and code suggests Anthropic's top models remain the safest default when you need quality under ambiguity.

But if you run a product with millions or billions of routine calls, the benchmark table is not the only table that matters. OpenRouter's usage rankings are showing a different layer of the market: not the model people admire, but the model people can afford to call all day.

Hy3 is the store-brand moment for LLMs

Hy3 preview is the story because it exposes the split between benchmark prestige and deployment gravity. According to the research brief, Hy3 sits around #7 on Arena Text at roughly 1485 Elo: competitive, useful, not best-in-class. Yet on OpenRouter it is #1 by volume at 3.66 trillion tokens, ahead of Kimi K2.6 at 1.80 trillion, Claude Sonnet 4.6 at 1.34 trillion, and Gemini 3 Flash Preview at 974 billion.

That shape should look familiar to anyone who has built infrastructure under a budget. Most systems do not need the best possible model for every call. They need the cheapest model that clears the quality bar for the specific job: summarize this support ticket, classify this message, rewrite this paragraph, generate a first-pass SQL query, produce boilerplate, extract fields from a document, explain an error log. For a large class of tasks, “good enough and free” beats “excellent and expensive” before the architect has finished drawing the routing diagram.

That does not mean Hy3 is secretly better than Claude. It means distribution and pricing can move faster than quality. OpenRouter's free-tier access removes the adoption tax. Developers can test the model without procurement, without a sales call, and without feeling a meter running in their head. Once a model becomes the default for low-risk workloads, it starts accumulating real usage before the benchmark conversation catches up.

This is the store-brand moment for LLMs. The premium product still wins blind taste tests. The store brand wins the grocery bill.

DeepSeek is proving that “Flash” is no longer a secondary tier

Tencent is not the only price-pressure story. DeepSeek now has two models in OpenRouter's top seven: DeepSeek V4 Flash at #6 with 819 billion tokens and 158% week-over-week growth, and DeepSeek V3.2 at #7 with 815 billion tokens and 33% growth. Combined, DeepSeek is doing roughly 1.63 trillion tokens, up from about 1.07 trillion last week.

The detail that matters is not merely that DeepSeek is growing. It is that V4 Flash has overtaken V3.2 in volume. That suggests users are not just adopting DeepSeek as a cheaper alternative to the frontier labs; they are trading down inside the DeepSeek family itself. The lighter variant is becoming the primary choice for many workloads.

That should change how teams design model stacks. Too many LLM integrations still behave like there is one “main model” and maybe a fallback if it fails. The usage data argues for a more explicit routing layer: premium reasoning models for high-stakes work, fast/value models for repeatable transformations, and free or near-free models for bulk processing where errors can be detected downstream.

Engineers should stop asking, “Which model should we use?” as if the answer is singular. The better question is, “Which parts of the workload deserve the expensive model?” If you cannot answer that, you are probably either overpaying for simple tasks or underestimating the risk of cheap models in places where they can silently corrupt output.

What practitioners should do with this

The first move is measurement. If your application sends all requests to one frontier model, instrument task type, latency, retry rate, human correction rate, and downstream failure rate. Token cost by itself is not enough. A cheaper model that causes a 3% increase in manual review might be more expensive than the premium model. A cheaper model that performs identically on low-risk classification jobs is found money.

The second move is evaluation by workload, not by leaderboard. Arena Elo is useful signal, especially for general preference and coding capability, but it is not your product. Build a small internal eval set from real prompts: the weird support tickets, the messy JSON extraction jobs, the codebase-specific questions, the customer-facing generations where tone matters. Then run Claude, Gemini, GPT, DeepSeek, Hy3, Kimi, and whatever else is cheap enough to tempt you. The winner for your workload may not be the model with the most impressive public score.

The third move is routing with graceful escalation. Start cheap when the task is reversible, structured, or easy to verify. Escalate to a stronger model when confidence is low, output violates schema, retrieval evidence conflicts, or the user is about to see the answer. This is not glamorous architecture. It is just the LLM version of using a cache, a queue, and a database index before buying a bigger machine.

The fourth move is vendor-risk hygiene. Free-tier usage explosions are useful signal, but they can also produce fragile dependencies. Free previews change pricing, rate limits, context windows, and availability. If Hy3 is now in your critical path because it was cheap and convenient, treat that as a temporary optimization until you have fallback coverage and model-agnostic tests.

The rankings are giving us a clean split this week. Anthropic remains the quality leader on Arena, especially for code. Tencent and DeepSeek are showing that usage share is increasingly won by models that are cheap enough to become infrastructure. Those are not contradictory facts. They are the two halves of the market finally separating.

My read: the next serious LLM engineering advantage will not come from picking the single smartest model. It will come from knowing when not to use it. Premium models are becoming the senior engineer in the loop. The rest of the stack is about to get a lot more cost-conscious.

Sources: OpenRouter Rankings, OpenRouter Model Catalog, Arena AI Text Leaderboard, LM Arena Leaderboard

Benchmarks still crown Claude. Workloads are voting for cheap.

Hy3 is the store-brand moment for LLMs

DeepSeek is proving that “Flash” is no longer a secondary tier

What practitioners should do with this

Sign up for more like this.