llm-rankings

The Best Model Is the One Your Router Knows When Not to Use

Anatoliy Kolodkin

09 May 2026 • 5 min read

The leaderboard story this week is not that one model moved one slot. That is leaderboard theater. The useful story is that two different markets are now visible at the same time: the models developers admire in benchmarks, and the models production systems can afford to call a few billion times.

OpenRouter’s current usage table makes the split hard to miss. Tencent’s Hy3 preview is still the runaway traffic leader at 3.74 trillion weekly tokens, more than double Kimi K2.6 at 1.78 trillion and nearly triple Claude Sonnet 4.6 at 1.38 trillion. Meanwhile GPT-5.5 has finally entered OpenRouter’s top 20 at 302 billion weekly tokens with 52% week-over-week growth. Arena’s Text leaderboard barely moved, and Anthropic still owns the high ground there. But usage is where product architecture leaks into public view, and the leak says: cheap, high-throughput models carry the volume; premium frontier models carry the risk.

The default route is no longer the prestige model

Hy3 preview is not leading OpenRouter because every engineer woke up and decided Tencent had won the philosophical argument about intelligence. It is leading because modern AI applications are becoming token furnaces: routers, agents, code assistants, inbox triage tools, extraction pipelines, summarizers, test generators, browser agents, and all the awkward glue work between APIs. When those systems scale, the default model matters less as a trophy and more as an operating expense.

The model’s spec sheet explains why it fits that layer. Tencent describes Hy3 preview as a 295B-parameter Mixture-of-Experts model with only 21B activated parameters, 192 experts with top-8 activated, 80 layers, 64 GQA attention heads, and a 256K context window. The important part is not the raw parameter count; it is the combination of long context, sparse activation, and “good enough” agent behavior. That is exactly the shape teams want for high-volume substeps where a frontier model would be expensive overkill.

OpenRouter’s app usage reinforces the point. The tracked apps on the same ranking page are dominated by coding and agent workflows: OpenClaw at 269B tokens, Hermes Agent at 258B, Kilo Code at 174B, and Claude Code at 78.8B. Those are not casual chatbot numbers. They are the shape of systems repeatedly reading context, planning, editing, checking, retrying, and occasionally getting stuck in loops because software remains software.

That should change how engineering teams think about model selection. The lazy architecture is one model endpoint for everything. The grown-up architecture is a routing policy: cheap or free models for low-risk transformation, extraction, draft generation, classification, and agent scaffolding; stronger models for ambiguous reasoning, security-sensitive changes, customer-facing answers, and final review. If your system sends every request to a premium model because “quality,” you may be paying senior-engineer rates for intern-ticket work.

Claude still owns the coding crown, but that is not the whole procurement decision

Arena’s leaderboard tells the other half of the story. On Text, the top three are still Anthropic: Claude Opus 4.7 Thinking at 1503 Elo with 8,945 votes, Claude Opus 4.6 Thinking at 1502 with 23,616 votes, and Claude Opus 4.6 at 1498 with 25,089 votes. On WebDev/Code, the top four are also Claude Opus variants, led by Opus 4.7 Thinking at 1570 Elo. If the job is “make the hardest coding decision with the fewest retries,” Anthropic remains the safest shortlist.

But the middle of the coding board is where the market gets interesting. GLM 5.1 sits at #5 on WebDev with 1531 Elo, above Claude Sonnet 4.6 at 1524 and Kimi K2.6 at 1523. Qwen3.6 Max Preview entered the WebDev top 20 at #11 / 1478 Elo / 1,343 votes, pushing Gemini 3 Pro to #21. OpenAI’s GPT-5.5-high codex-harness moved to #9 at 1491 Elo, just ahead of Claude Opus 4.5 Thinking.

The practical read is not “Claude wins, ignore everyone else.” It is “Claude is the expensive default for hard coding, while the rest of the field is close enough that workflow-specific evaluation matters.” A model that is five leaderboard slots lower can still be the right production choice if it is cheaper, faster, more available, better at your tool schema, or less likely to burn context on verbose self-commentary. Elo is a signal; your repo is the test suite.

That matters especially for coding agents. Public leaderboards reward broad preference across many prompts. Your agent may care about a narrower set of tasks: patching flaky tests, migrating React components, reading a messy monorepo, editing Terraform, writing SQL migrations, or doing mechanical refactors without creative interpretation. A model can rank lower globally and still win your workload. Conversely, a leaderboard champion can be a poor fit if its tool-calling behavior, latency profile, or completion length creates bad economics.

Usage rankings are market telemetry, not quality scores

There is a trap here: treating OpenRouter usage as a second benchmark. It is not. Token volume is shaped by price, availability, free-preview promotions, app defaults, bot loops, context length, and how aggressively developers instrument or fail to instrument retries. A free model can generate absurd traffic without proving it is better. A premium model can have lower share because teams reserve it for the few calls that actually decide outcomes.

That caveat is exactly why the GPT-5.5 movement is worth watching. Entering the top 20 at 302B weekly tokens with 52% growth suggests it is crossing from prestige evaluation into real routing. OpenRouter’s own cost-analysis note adds an operational wrinkle: GPT-5.5 can produce 19–34% fewer output tokens above 10K input tokens, while prompts in the 2K–10K range can produce 52% longer completions. In other words, “price per million tokens” is not enough. Completion behavior changes total cost.

Claude Opus 4.7 moving ahead of Gemini 3 Flash Preview on OpenRouter is a similar signal. Opus 4.7 reached 1.05T weekly tokens, just above Gemini 3 Flash Preview at 1.03T. That is not a landslide, but it is a useful reminder that premium models still cross trillion-token scale when the workflow justifies them. Developers are not only optimizing for cheap. They are optimizing for expected cost: inference price multiplied by retries, review time, failed edits, bad answers, and user trust damage.

The teams that will benefit from this market split are the ones that stop arguing about “the best model” in the abstract. Build a small eval harness from your own work: twenty bug fixes, twenty refactors, ten nasty codebase-navigation tasks, ten support-answer tasks, ten long-context extraction jobs. Track not just pass/fail, but latency, output tokens, retries, human edits, and failure severity. Then route accordingly. Use Arena to build the shortlist. Use OpenRouter rankings to understand what the market is stress-testing. Use your telemetry to decide what ships.

The editorial take: the model market is maturing into an infrastructure market. Benchmarks still matter, but the winning production stack will look less like a single crowned champion and more like a scheduler: cheap models carrying the traffic, expensive models handling the judgment calls, and evaluations deciding when to escalate. The best model is no longer the model at the top of a leaderboard. It is the model your system knows when not to use.

Sources: OpenRouter rankings, Arena AI Text leaderboard, Arena AI WebDev/Code leaderboard, Tencent Hy3 preview model card, OpenRouter GPT-5.5 cost analysis

The default route is no longer the prestige model

Claude still owns the coding crown, but that is not the whole procurement decision

Usage rankings are market telemetry, not quality scores

Sign up for more like this.