llm-rankings

LLM rankings: Production traffic is routing around prestige

Anatoliy Kolodkin

24 May 2026 • 5 min read

The most useful LLM ranking today is not the one with the prettiest medal table. Arena’s Text and WebDev leaderboards are effectively frozen: Anthropic’s Opus line still owns the top of the preference-test stack, and the public pages are still showing updates from several days ago. The more interesting signal is happening downstream, where people are paying for tokens and routing real workloads.

OpenRouter’s weekly usage ranking moved enough to publish, and the diff says the quiet part out loud: production AI systems are optimizing around throughput, latency, and unit economics, not just top-line Elo. DeepSeek V4 Flash remains the traffic monster at #1 with 3.29T weekly tokens and 99% week-over-week growth. Tencent’s Hy3 preview is still #2 at 3.01T tokens. Those are not “interesting alternatives” anymore; they are infrastructure-scale choices.

That does not make the Arena leaderboard irrelevant. It makes it incomplete. Arena is still answering a valuable question: which models win head-to-head preference tests on hard prompts? OpenRouter is answering a different one: which models are builders actually sending work to all week? If you are shipping AI features, the second question may be closer to your production reality.

The leaderboard split is now an architecture pattern

Arena’s Text top five is unchanged: claude-opus-4-6-thinking at 1502 Elo, claude-opus-4-7-thinking at 1500, claude-opus-4-6 at 1498, claude-opus-4-7 at 1492, and Meta’s muse-spark at 1489. The WebDev board tells the same story with a coding-adjacent accent: claude-opus-4-7-thinking leads at 1567, followed by claude-opus-4-7 at 1560, claude-opus-4-6-thinking at 1545, claude-opus-4-6 at 1540, and GLM 5.1 at 1532.

That is a clean win for Anthropic at the frontier-quality layer. If your workload is complex planning, deep code review, architectural reasoning, or high-stakes synthesis, the leaderboard is still pointing at premium reasoning models. Nobody serious should look at this and conclude that quality stopped mattering.

But usage is not behaving like a simple quality contest. OpenRouter’s top four are DeepSeek V4 Flash, Hy3 preview, Claude Sonnet 4.6 at 1.77T tokens, and Claude Opus 4.7 at 1.7T tokens. Anthropic remains heavily used, but the very top of router traffic belongs to models that look less like prestige purchases and more like high-volume workhorses.

The mid-table movement makes the pattern clearer. Gemini 3 Flash Preview moved from #6 to #5 with 1.14T weekly tokens, pushing Owl Alpha to #6 even though Owl Alpha still posted 1.11T tokens and 62% week-over-week growth. Gemini 2.5 Flash Lite made the biggest top-20 rank move, climbing three places from #14 to #11 with 609B tokens. Gemini 3.1 Pro Preview also moved up, from #17 to #16, with 490B tokens and 27% growth.

Meanwhile, some models grew and still lost rank. MiniMax M2.7 fell from #11 to #13 despite 593B tokens and 23% week-over-week growth. Claude Opus 4.6 dropped from #15 to #17 while still growing 19%. GLM 5.1 slipped from #19 to #20 despite 340B tokens and 17% growth. That is the part worth underlining: in a market growing this quickly, “up and to the right” is not enough. You can gain usage and still lose share of attention if the cheaper, faster lane is expanding faster around you.

Stop asking for one best model

The practitioner mistake is treating these rankings like a fantasy draft. Pick the best model, wire it into the product, move on. That was always a little lazy; now it is visibly behind the market.

The usage data points toward model portfolios becoming the default architecture. Premium models handle the expensive thinking: plan generation, final answer synthesis, code review, ambiguous support escalations, security-sensitive reasoning, and tasks where a bad answer costs more than a larger inference bill. Flash-class and Lite-class models take the bulk traffic: extraction, classification, search-query rewriting, summarization, moderation prechecks, context compression, test generation, agent substeps, and the thousand small calls that make AI products feel responsive.

This is not just about saving money. It is about system design. A slower premium model may be the correct choice for a single-shot benchmark prompt and the wrong choice for an agent loop that needs twenty tool calls, three retries, and a user still watching the spinner. A cheaper model with predictable latency can improve the product even when it loses a preference fight in isolation. Users do not experience your model selection as a leaderboard score. They experience it as latency, reliability, cost limits, and whether the product does the job.

The OpenRouter catalog context reinforces that point. Qwen3.7 Max is positioned as a 1,000,000-token-context model for agent-centric workloads. Gemini 3.5 Flash is listed as multimodal with a 1,048,576-token context window and Flash-tier cost/speed positioning. Long context plus lower marginal cost changes what gets routed where. The model that is “best” for final reasoning is not necessarily the model you want chewing through a huge repository, compressing logs, or preprocessing a support history before the premium model sees the distilled version.

Engineers should read today’s ranking diff as a nudge to make routing explicit. Do not keep model choice buried in application code as a vendor string. Treat it like policy. Define lanes by task type, latency budget, quality threshold, context size, privacy requirement, and retry behavior. Then measure cost per successful task, not cost per token in isolation. Cheap tokens that fail twice are expensive. Expensive tokens that prevent a human escalation may be cheap.

Your evals need a price column

If your internal eval dashboard only ranks models by answer quality, it is missing the deployment question. Add p50 and p95 latency. Add task success rate after retries. Add tool-call validity. Add context-window failure behavior. Add refusal and over-compliance rates for your domain. Add cost per completed workflow. If you operate at scale, add availability and vendor degradation notes, because the theoretical best model is not useful when it turns into an incident dependency.

The most practical move this week is to rerun your evals with at least three lanes: frontier reasoning, fast general-purpose, and cheap bulk-processing. Compare Claude Opus or Sonnet against the Flash-family models, DeepSeek V4 Flash, Hy3 preview, and whatever regional or open-weight option is viable for your constraints. Then route by evidence. The goal is not to crown a winner. The goal is to stop using a single model as a substitute for architecture.

There is also a product lesson here. Users rarely ask for “the smartest model” in the abstract. They ask for a feature to finish, a ticket to be triaged, a diff to be reviewed, a spreadsheet to be cleaned, a document to be understood. The winning system may call a premium model once and a cheaper model fifteen times. Or it may do the reverse for a high-risk workflow. Either way, the system wins by matching model capability to job shape.

So yes, Arena still says Claude Opus is the quality king. That matters. But OpenRouter’s usage table is showing where production gravity is pulling: toward throughput, long context, lower latency, and portfolios of models instead of one blessed endpoint. The builders who internalize that will ship AI features that are faster, cheaper, and more resilient. The builders still arguing about the single “best LLM” are reviewing the wrong diff.

Sources: OpenRouter Rankings, Arena AI Text leaderboard, LM Arena leaderboard, OpenRouter model catalog API

The leaderboard split is now an architecture pattern

Stop asking for one best model

Your evals need a price column

Sign up for more like this.