llm-rankings

The LLM Leaderboard Has Split in Two

Anatoliy Kolodkin

12 May 2026 • 5 min read

The most useful LLM leaderboard this week is not the one with the cleanest Elo score. It is the one where agents are quietly burning trillions of tokens.

Arena AI’s text and WebDev/Code boards are stable enough to look almost boring: Anthropic still owns the top end, Claude Opus 4.7 Thinking is still first in both text and coding, and the top five has the shape of a vendor moat. But OpenRouter’s usage rankings are telling a different story. The models gaining traffic are not simply the models winning human preference tests. They are the models with long context, tool support, aggressive pricing, and enough latency discipline to survive production agent loops.

That split matters because most engineering teams are still talking about “the best model” as if they are choosing a database. They are not. They are choosing a routing policy.

The benchmark winner and the deployment winner are drifting apart

On Arena’s Text leaderboard, the top slot remains claude-opus-4-7-thinking at 1503 Elo, followed by claude-opus-4-6-thinking at 1502, claude-opus-4-6 at 1498, Google’s gemini-3.1-pro-preview at 1492, and claude-opus-4-7 at 1491. The WebDev/Code board is even more Anthropic-heavy: Claude Opus 4.7 Thinking leads at 1570 Elo, with Claude Opus 4.7, Opus 4.6 Thinking, and Opus 4.6 filling the next three positions.

That is a real signal. If a human is comparing outputs side by side, Anthropic is still the vendor to beat, especially for code. For teams buying a premium model for hard reasoning, architecture review, migration planning, or high-stakes code generation, ignoring that would be malpractice dressed up as contrarianism.

But OpenRouter’s weekly traffic board is where the practitioner story gets sharper. Hy3 preview (free) remains number one with 2.68 trillion weekly tokens. Kimi K2.6 follows at 1.61 trillion. Claude Sonnet 4.6 sits at 1.45 trillion, and Claude Opus 4.7 at 1.24 trillion. Then the interesting movement starts: DeepSeek V4 Flash moved up to fifth with 1.11 trillion tokens, paid Hy3 preview jumped from sixteenth to eighth with 857 billion tokens, and Owl Alpha rose from twentieth to seventeenth with 405 billion tokens.

No new model entered the top 20. This is not launch-day confetti. It is reshuffling inside an already crowded production market, which makes it more useful. Novelty can put a model on a leaderboard for a day. Sustained token volume usually means someone wired it into a workflow.

Hy3 is priced like infrastructure, not a trophy model

Tencent’s Hy3 is the clearest example of the new playbook. According to Tencent, Hy3 preview is a 295-billion-parameter mixture-of-experts model with 21 billion activated parameters and up to a 256K-token context window. The company also claims 40% better inference efficiency, 54% lower time to first token, 47% lower end-to-end response time, and more than 99.99% success rate across CodeBuddy and WorkBuddy workloads.

Those are not the metrics you lead with when your only customer is a benchmark judge. Those are the metrics you lead with when your customer is an agent framework that may perform hundreds of tool calls, retry failed steps, read half a repository, and generate enough intermediate text to make a CFO develop a twitch.

OpenRouter’s catalog gives the practical version of the pitch: paid Hy3 lists at $0.066 per million input tokens, $0.26 per million output tokens, and $0.029 per million cached input tokens, with a 262,144-token context window and tool support. Tencent explicitly names OpenClaw, OpenCode, and KiloCode among supported open-source agent frameworks, and says the model has powered complex agent workflows up to 495 steps.

The free Hy3 preview sitting at number one could be promotion-driven. Fine. Free models distort usage charts the way free pizza distorts office attendance. But the paid Hy3 preview jumping eight spots matters more because it suggests developers are testing whether the economics hold after the novelty layer peels away. If you operate agents at scale, this is the part worth caring about: not whether Hy3 wins every human preference matchup, but whether it completes long-running workflows cheaply, quickly, and with fewer recovery loops.

DeepSeek is making “good enough” a system architecture

DeepSeek’s movement points in the same direction. DeepSeek V4 Flash now sits at fifth on OpenRouter with 1.11 trillion weekly tokens, up 58%. OpenRouter metadata describes it as a 284B total / 13B active MoE model with a 1M-token context window, priced at $0.14 per million input tokens and $0.28 per million output tokens. DeepSeek V4 Pro slipped to ninth but still posted 816 billion tokens, up 99%, with 1.6 trillion total parameters, 49 billion activated parameters, a 1M-token context window, and pricing of $0.435/M input and $0.87/M output.

Simon Willison’s read on DeepSeek V4 — “almost on the frontier, a fraction of the price” — captures the deployment consequence better than any leaderboard rank. A model does not need to be the absolute best at every task to become the default workhorse. It needs to be reliable enough, cheap enough, and available in the right shape for the workload.

That changes how engineers should design AI systems. The naive architecture is a single hard-coded premium model behind every request. The better architecture is a portfolio: a cheap long-context model for retrieval-heavy planning and document digestion, an agent-specialized model for tool-heavy execution, and a premium preference winner for final synthesis, sensitive reasoning, or tasks where quality deltas actually show up in user outcomes.

This is not theoretical optimization theater. If your agent spends 70% of its tokens reading, summarizing, planning, and checking tool output, routing all of that through a top-shelf model may be indistinguishable from setting margin on fire. Conversely, if the final answer is wrong, cheap tokens did not save you anything. The work is not picking a model. The work is measuring which part of the workflow deserves which model.

Agent traffic is the community reaction

Public discussion around this exact ranking shift is thin, but the usage panel is more useful than a comment thread. OpenRouter’s visible top apps show agent products consuming serious volume: Hermes Agent at 246B tokens, OpenClaw at 189B, Kilo Code at 130B, pi at 46.5B, and Claude Code at 35.3B. That is practitioner sentiment with a bill attached.

Owl Alpha’s rise is also worth watching, with the usual caveat that free models can turn leaderboards into weather reports. OpenRouter describes Owl Alpha as an agentic foundation model with native tool use, structured outputs, long-context support, a 1,048,756-token context window, 262,144 max output tokens, and $0/M pricing. If it keeps climbing after rate limits tighten or pricing changes, that becomes a signal. Until then, it is a useful experiment and a reminder that agent workloads are hungry for context length and tool discipline more than brand prestige.

The practical advice is blunt: stop evaluating models only with chat transcripts. For production agent systems, track task success rate, latency to completed workflow, retry count, tool-call correctness, context retention, cache hit rate, and cost per completed job. Cost per token is a component metric, not the scoreboard. A cheap model that causes retries may be expensive. An expensive model that eliminates three recovery loops may be cheap.

For teams doing model selection this week, the eval matrix should include at least three buckets: the premium Arena leader, the cheap long-context workhorse, and the agent-specialized tool-use model. Run the same real workflows through all three. Include ugly cases: large repos, noisy logs, partial failures, stale tool output, ambiguous tickets, and requests that require saying no. Benchmarks are useful, but your production system is where the bill arrives.

The headline is not that Anthropic lost. It did not. The headline is that the market now has two leaderboards: one for what humans prefer in isolation, and one for what software systems can afford to run all day. If your product still has a single hard-coded “best model,” that is not a strategy anymore. It is technical debt with an API key.

Sources: OpenRouter Rankings, Arena AI Leaderboard, Tencent Hy3 announcement, OpenRouter Hy3 catalog, OpenRouter Owl Alpha catalog, Simon Willison on DeepSeek V4

The benchmark winner and the deployment winner are drifting apart

Hy3 is priced like infrastructure, not a trophy model

DeepSeek is making “good enough” a system architecture

Agent traffic is the community reaction

Sign up for more like this.