llm-rankings

Muse Spark Cracks Arena Top 3 in One Week — and Llama 4 Is Already Gone

Anatoliy Kolodkin

16 Apr 2026 • 5 min read

One week ago, Meta launched Muse Spark — the first model from its Superintelligence Labs division. Today it sits at #3 on the Arena AI text leaderboard with 1495 Elo, displacing Llama 4, which was Meta's own previous flagship. The Llama brand didn't lose to a competitor. It lost to a successor.

The leaderboard shuffle tells a story of corporate impatience done right. Llama 4 held #3 at 1493 Elo. Muse Spark enters at 1495 — a 2-point margin that's statistically thin but symbolically enormous. Meta didn't iterate on Llama. It started over.

The MSL Gambit

Meta Superintelligence Labs exists because Mark Zuckerberg got tired of watching OpenAI and Anthropic lap his AI division. The response was characteristically aggressive: poach Alexandr Wang from Scale AI, invest $14.3 billion for a 49% stake in his company, and give him a blank check to rebuild Meta's AI stack from scratch. Muse Spark is the first output from that pipeline.

The model is natively multimodal with visual chain-of-thought and tool use baked in — not bolted on. It also ranks #3 on the Arena Vision leaderboard at 1292 Elo, beating nearly every specialized vision model. That's not a coincidence. MSL designed for multi-category competence from day one, and the Arena results suggest the architecture delivers.

But the most technically interesting feature is what Meta calls "Contemplating mode" — a parallel multi-agent approach where multiple AI agents work the same problem simultaneously and synthesize their outputs. This is compute-time scaling taken in a different direction than Anthropic's extended thinking or OpenAI's chain-of-thought. Instead of one model thinking longer, Muse Spark spawns multiple reasoning paths and merges them. If it works reliably at scale, it's a genuinely different paradigm.

The Score Is Competitive, Not Dominant

Let's be precise about what 1495 Elo means. Claude Opus 4.6 Thinking leads at 1502 — 7 points ahead. Claude Opus 4.6 (non-thinking) is at 1496. Gemini 3.1 Pro Preview sits at 1493. The gap between #1 and #7 is 21 Elo points. This is a crowded top tier, and Muse Spark is in the mix but not above it.

What's more revealing is where Muse Spark doesn't appear. The Arena Code leaderboard — arguably more relevant to the people reading this — has no Muse Spark in the top 10. Anthropic holds the top 4 spots on code, with Claude Opus 4.6 Thinking at 1548 and Claude Opus 4.6 at 1545. GLM 5.1 from Z.ai sits at #3 with 1537. GPT-5.4 High (Codex) is down at #7 with 1457.

For developers choosing models, this is the split that matters: Muse Spark for general reasoning and vision tasks, Claude for anything involving code, and Gemini quietly holding its own across both. The era of one model ruling everything is over. The leaderboard is a specialist's market now.

Llama Is the Subtext

Forbes ran a piece headlined "Muse Spark: Meta's Rebuilt AI Stack After Llama's Disappointment," and the framing is correct. Llama 4 wasn't bad — it was #3 on Arena at one point. But "not bad" wasn't enough for a company with Meta's resources and ambitions. The Llama brand carried the baggage of incremental progress while competitors were making generational jumps.

The message from MSL is unambiguous: the Llama pipeline is a dead end. Future models come from the MSL stack. For the thousands of developers who built on Llama's open-source promise, this is a pivot they didn't ask for. Zuckerberg has promised future open-source models from MSL, but Muse Spark itself is closed — available only through Meta's AI app and website, requiring a Facebook or Instagram login. That's not an open ecosystem. That's a walled garden with a "coming soon" sign on the gate.

The open-source LLM community now faces a genuine gap. Llama was the default for self-hosted models. If MSL doesn't follow through on open-sourcing, the alternatives are Qwen (Alibaba), GLM (Z.ai), and Mistral — all capable but none with Llama's mindshare or ecosystem momentum. The vacuum is real, and it's Meta's to fill or abandon.

The Distribution Advantage

Muse Spark's real edge isn't the Elo score. It's distribution. Meta can put this model in front of billions of users through Facebook, Instagram, and WhatsApp without anyone installing an app or signing up for a new service. When Zuckerberg said Meta would win at AI through scale, this is what he meant: the model doesn't need to be the best. It needs to be good enough, everywhere, immediately.

For builders, this creates a tension. Muse Spark is good — top-3-on-Arena good. But it's not accessible the way Llama was. You can't fine-tune it. You can't self-host it. You can't build a product on it without depending on Meta's API and its authentication requirements. The "Contemplating" mode is architecturally exciting, but you can only access it through Meta's consumer interface, not as a developer tool.

This is Meta playing to its strength: consumer AI at scale, not developer tools. And that's fine — it's arguably the right strategy for the company. But developers who hoped MSL would be Llama 2's spiritual successor should adjust expectations. This is a consumer play first.

What the Full Board Looks Like

Stepping back from Muse Spark specifically, the Arena leaderboard this week shows a remarkably stable top tier with meaningful motion underneath:

Claude Opus 4.6 Thinking holds #1 at 1502 (-2 Elo), with the non-thinking variant at #2 (1496). Anthropic's lead is narrow but consistent across text, code, and vision.
Gemini 3.1 Pro Preview gained 1 Elo to reach 1493, holding #4. Google's steady improvement continues without fanfare.
GPT-5.4 High dropped 3 Elo to 1481, now at #7. OpenAI's flagship sliding while competitors hold or gain is a signal worth watching.
xAI's Grok 4.20 variants hold three spots in the top 10 (#6, #8, #10) with mixed movement. Volume play, not a winner.
OpenRouter has stopped publishing model rankings entirely, redirecting to app/agent token volumes. The top apps by volume: MiMo-V2-Pro integration (393B tokens), Hermes Agent (223B), Kilo Code (196B), Claude Code (135B). The infrastructure layer is shifting from model-centric to application-centric — a telling move.

The Bottom Line for Builders

Three things worth doing with this information:

First, if you're building on Llama, start evaluating alternatives now. Muse Spark isn't self-hostable, and the Llama pipeline appears frozen. Qwen 3.6 Plus (#9 on Arena Code at 1453) and GLM 5 (#10 at 1440) are viable open-source options that are actively improving.

Second, watch the code leaderboard more than the text leaderboard if you're building developer tools. The text rankings measure general chat quality; the code rankings measure whether a model can actually do useful work. Anthropic's dominance there (4 of top 5) is a stronger signal than the 7-point text lead.

Third, the "Contemplating" multi-agent paradigm is worth studying even if you can't use Muse Spark directly. The idea of parallel agent collaboration as a reasoning strategy is showing up in research from multiple labs. If you're building agent architectures, this pattern — spawn, reason in parallel, synthesize — is likely to be a standard primitive within a year.

Muse Spark is good news for the leaderboard. More competition at the top pushes everyone forward. But a closed model from a company that built its AI reputation on open source is a complicated kind of progress. The Elo score says "worthy competitor." The distribution strategy says "consumer platform." The Llama-shaped hole says "we'll believe the open-source promise when we see it."

Sources: Arena AI Leaderboard, Meta AI Blog, TechCrunch, Forbes