llm-rankings

Hy3 Has the Traffic, Claude Has the Taste Test

Anatoliy Kolodkin

16 May 2026 • 5 min read

The leaderboard story this week is not that Tencent beat Anthropic, or that Anthropic beat OpenAI, or that one more model name with a decimal point moved two slots in a table. The useful story is messier: traffic and taste have split. OpenRouter says Tencent’s Hy3 preview is where a huge amount of production-style token volume is going. Arena AI says humans still prefer Claude at the top of both text and WebDev tasks. If you are building with models, that split matters more than the trophy count.

OpenRouter’s latest rankings put Hy3 preview at #1 with 2.4 trillion weekly tokens, up a cartoonish 526,637% week over week. Claude Opus 4.7 and Claude Sonnet 4.6 follow at #2 and #3 with 1.56 trillion and 1.54 trillion tokens respectively. That is not a rounding error. It is a routing economy with real gravity, and Tencent is currently sitting in the center of it.

But the Arena AI text leaderboard tells a different story. Claude Opus 4.6 thinking remains #1 at 1502 Elo, followed by Claude Opus 4.7 thinking at 1500, Claude Opus 4.6 at 1498, and Claude Opus 4.7 at 1492. On Arena WebDev, Anthropic’s grip is even cleaner: Claude Opus 4.7 thinking leads at 1567 Elo, Opus 4.7 base is #2 at 1559, Opus 4.6 thinking is #3 at 1546, and Opus 4.6 base is #4 at 1541.

So no, Hy3 has not “beaten Claude” in the way engineers usually mean when they ask whether a model is better. It has beaten Claude in a different game: distribution, routing, price, availability, and possibly launch incentives. That game is still real. It is just not the same game as blind human preference.

Traffic is not taste, and taste is not deployment

Tencent’s Hy3 pitch is explicitly a deployment pitch. The company describes Hy3 preview as a 295-billion-parameter Mixture-of-Experts model with 21 billion active parameters and up to a 256K context window. It claims a 54% reduction in time to first token, a 47% reduction in end-to-end response time, more than 99.99% success rate in CodeBuddy and WorkBuddy deployments, complex agent workflows up to 495 steps, and a 40% inference-efficiency improvement. Pricing starts around $0.18 per million input tokens, $0.06 per million cached input tokens, and $0.59 per million output tokens on Tencent Cloud TokenHub.

That is the language of a model vendor trying to win the boring, expensive middle of AI usage: long-running agents, document-heavy workflows, retrieval pipelines, internal copilots, and repeatable automation where latency and cost matter because the model is not being used once. It is being used thousands or millions of times. A model does not need to be the most beloved answer generator on Arena to become a default workhorse in that environment. It needs to be good enough, cheap enough, fast enough, and available where the traffic already flows.

The free listing complicates the victory lap. Hy3 preview (free) fell from #7 to #18 on OpenRouter, dropping to 473 billion weekly tokens, even as the main Hy3 preview listing held #1 and grew. That is the signal to watch. If the paid or non-free listing keeps its lead while the free variant cools, Hy3 has retention beyond subsidy traffic. If the main listing follows the free one down after launch energy fades, the story becomes less “Tencent found product-market fit” and more “free tokens are undefeated at generating charts.”

Practitioners should not sneer at either outcome. Promo-driven usage can still expose a model to enough real workloads to make it better, cheaper, and easier to integrate. But teams should not mistake a launch spike for a migration plan. Before routing serious work to Hy3, measure it on your own failure modes: tool-call reliability, JSON discipline, code-edit accuracy, refusal behavior, latency under concurrency, cache hit economics, and how often a cheaper answer creates a more expensive human cleanup.

Claude still owns the judgment layer

Arena’s WebDev board is the strongest evidence that Claude remains the model family to beat when taste and judgment matter. UI work is not just syntax. It requires visual hierarchy, product intuition, iterative correction, and the ability to avoid turning a small request into a decorative crime scene. Claude holding the top four WebDev slots says users still prefer Anthropic’s outputs when the task demands more than throughput.

That does not make Claude the right default for every agent loop. Opus-class models are expensive enough that using them as the first stop for every low-stakes operation is often architectural laziness wearing a quality hat. The better pattern is tiered routing: expensive model for planning, review, ambiguity, architectural decisions, UI judgment, and failed-task escalation; cheaper high-throughput model for repetitive execution; verifier around both. The bill should reflect the shape of the work, not the leaderboard screenshot someone pasted into Slack.

OpenRouter’s app rankings make this more urgent. The research scrape shows Hermes Agent at 353 billion tokens, OpenClaw at 195 billion, Kilo Code at 166 billion, and Claude Code at 70.5 billion in visible daily app traffic. Agent traffic is no longer a cute demo category. It is large enough to move model rankings, distort “popularity” signals, and punish teams that pick models by fandom instead of instrumentation.

There is also a useful caution in the Claude 4.7 reaction. Hacker News discussion around Opus 4.7 has focused less on leaderboard victory and more on behavior changes: adaptive thinking defaults, whether longer context actually improves coding reliability, and anecdotal reports that Opus 4.7 sometimes gets first-try edits right less often than 4.6. That does not invalidate Arena’s results. It reinforces the point: “newer” and “higher-ranked” are not the same as “better for your repo, your prompts, and your review process.”

OpenAI’s coding signal is back on the board, but not in first place

The most interesting Arena WebDev change is GPT-5.5-xhigh entering at #9 with 1501 Elo and 3,220 votes under the codex-harness label. That pushed several incumbents down and helped knock MiMo-v2.5 out of the top 20. It is not a top-four Anthropic-level result, but it is a meaningful coding-board entry because harnessed coding performance is what teams actually feel day to day: edit selection, test repair, patch coherence, and whether the model can stop before it “improves” unrelated files.

For teams already invested in OpenAI tooling, this is worth testing, not worshipping. Put GPT-5.5-xhigh into the same evaluation harness you use for Claude and your cheaper execution model. Give it dirty branches, partial test failures, migration scripts, frontend regressions, and ambiguous product requests. The question is not whether it can impress a benchmark. The question is whether it reduces review burden per dollar.

Owl Alpha is the other “needs review” item. It jumped from #16 to #10 on OpenRouter, reaching 621 billion weekly tokens and +159% week-over-week growth. That is too large to ignore and too under-explained to trust. It is not present in Arena’s Text or WebDev top 20, and the research did not surface a strong public practitioner thread explaining the move. That usually means one of three things: a platform integration, a routing/default change, or concentrated app traffic. Popularity is a smoke alarm, not a root-cause analysis.

The practical move is boring and correct: build a routing matrix. Use Claude Opus 4.7 or 4.6 where failure is expensive and judgment matters. Trial Hy3 for high-volume, latency-sensitive, verification-friendly workflows where its economics can compound. Keep an eye on Owl Alpha, but benchmark it against known baselines before moving anything important. Test GPT-5.5-xhigh specifically on coding workflows where harness behavior matters more than chat charm.

The industry keeps trying to compress model selection into one table because tables are comforting. This week’s rankings argue for the opposite. The best model is increasingly not a model. It is a policy: route by workload, verify by risk, escalate by uncertainty, and keep measuring after the launch promo ends. Hy3 has the traffic. Claude has the taste test. The teams that win will use both facts without confusing them.

Sources: OpenRouter Rankings, Arena AI Text Leaderboard, Arena AI WebDev Leaderboard, Tencent Hy3 launch notes, OpenRouter Hy3 model page, OpenRouter Claude Opus 4.7 model page, Reddit LocalLLaMA Hy3 discussion, Hacker News Claude Opus 4.7 discussion.

Traffic is not taste, and taste is not deployment

Claude still owns the judgment layer

OpenAI’s coding signal is back on the board, but not in first place

Sign up for more like this.