ai-models

The Most Interesting New LLM Benchmark Today Is Not a Model Launch, It’s a Cleanup Pass on Arabic Benchmarks

Anatoliy Kolodkin

21 Apr 2026 • 5 min read

Leaderboards are cheap. Cleaning up the benchmark before you publish the leaderboard is the part almost nobody wants to pay for.

That is what makes TII UAE’s new QIMMA Arabic LLM leaderboard more interesting than the usual model-ranking churn. The headline result, that Qwen/Qwen3.5-397B-A17B-FP8 lands at the top with an average score of 68.06, is fine as far as it goes. But the more consequential story is that the team behind QIMMA is arguing that Arabic evaluation has been compromised for a while by bad translations, inconsistent labeling, and benchmarks that were never built with native Arabic use in mind. In other words, the problem is not only that models need to improve. The ruler has been bent.

That is a useful corrective in a market that keeps treating benchmark tables like hard science while quietly skipping the part where somebody checks whether the prompts make sense to the people who actually speak the language.

The discard rates are the giveaway

QIMMA pulls together 109 subsets from 14 source benchmarks into a suite of more than 52,000 samples across seven domains: culture, STEM, legal, medical, safety, poetry and literature, and coding. The authors say 99 percent of the content is native Arabic, with code evaluation as the only language-agnostic exception. That alone would make it notable. What pushes it into worth-paying-attention territory is the evidence that the cleanup step was not cosmetic.

According to the Hugging Face post, QIMMA ran a two-stage quality process that used Qwen3-235B-A22B-Instruct and DeepSeek-V3-671B for automated scoring, then escalated disputed or culturally sensitive cases to native Arabic speakers for human review. The discard rates are the tell. TII says it removed 23.3 percent of MizanQA, 8.3 percent of PalmX, 6.7 percent of MedAraBench, and 3.1 percent of ArabicMMLU. If you are throwing out nearly a quarter of one benchmark before you trust the scores, the issue is not edge-case noise. The issue is that the benchmark was not publication-ready.

That matters because benchmark discourse tends to flatten everything into deltas between models. A model beats another by two points, somebody posts the chart, and the industry moves on. But if the task itself is poorly localized or structurally inconsistent, those deltas do not tell you what people think they tell you. They may mostly be measuring who handled awkward translation artifacts, broken answer choices, or culturally off-target phrasing more gracefully.

For Arabic in particular, that is not a minor concern. A lot of English-first evaluation work assumes translation is a mechanical preprocessing step. It is not. Dialect, register, idiom, legal language, religious and cultural references, and writing quality all shape difficulty in ways that a naive translation pipeline can destroy. QIMMA’s core contribution is less “here is a new leaderboard” than “stop pretending localization quality is separate from model evaluation.”

The coding subset is the most practical warning

The sharpest operational lesson in the whole release is buried in the coding details. QIMMA says it is the first Arabic leaderboard to include code evaluation, using Arabic adaptations of HumanEval+ and MBPP+. More importantly, the team says 88 percent of Arabic HumanEval+ prompts and 81 percent of Arabic MBPP+ prompts required modification for clarity, consistency, structure, or semantics.

That should make every team benchmarking non-English coding models a little uncomfortable. If eight or nine prompts out of ten need cleanup before they are even fit for evaluation, then a lot of multilingual coding claims are probably grading the translation layer as much as the model. This is exactly how bad evals turn into bad product decisions. A team concludes that Model A underperforms in Arabic coding, when the real issue is that the prompt became ambiguous or unnatural during localization. The model gets blamed for the benchmark’s sloppy plumbing.

The broader point is that multilingual coding is not solved by translating English problem statements and hoping for the best. If you ship coding assistants, education tools, or developer support systems in Arabic, you need a validation pass on localized prompts, not just translated unit tests. Otherwise you are building confidence on top of corrupted inputs.

One leaderboard, multiple kinds of “best”

QIMMA’s published table also makes a point the industry keeps relearning: there is no single meaningful answer to “what is the best Arabic model?” Qwen3.5-397B-A17B-FP8 leads the overall ranking at 68.06. Karnak follows at 66.20. Jais-2-70B-Chat comes in at 65.81. But the authors’ own summary is more nuanced than the average leaderboard tweet. Arabic-specialized models perform especially well on cultural and linguistically grounded tasks, while larger multilingual models still dominate the coding subsets.

That split tracks what practitioners should expect in production. “Arabic” is not one workload. Legal retrieval in Modern Standard Arabic, educational tutoring, medical assistance, consumer support, safety moderation, and code generation all stress different parts of a model. A model that excels at culturally grounded QA may still be mediocre for code-heavy developer workflows. A multilingual giant that cruises through coding tasks may still miss tone, idiom, or domain nuance in real user-facing Arabic applications.

That is why overall scores should be treated as triage, not truth. They are useful for narrowing the field. They are not a substitute for testing against your own task mix. If you are building Arabic products, the right question is not which model tops QIMMA. The right question is which model wins on your slice of QIMMA-like work, under your latency and cost constraints, with your risk tolerance for hallucination and cultural misread.

Why this matters beyond Arabic

There is a bigger industry lesson here. QIMMA is nominally an Arabic leaderboard, but the methodology critique applies almost everywhere multilingual evaluation gets treated as a checkbox. The frontier labs have become very good at publishing polished model cards and cross-language benchmark wins. They are much less consistent about showing the sample-level messiness underneath. QIMMA’s choice to publish auditable outputs and foreground dataset repair is a reminder that evaluation quality is infrastructure. It is not a marketing accessory.

That matters even more now that model launches are increasingly incremental. We are in a phase where providers keep shipping better packaging, better routing, better tooling, and modestly better raw capability. In that environment, evaluation integrity becomes more important, not less. If benchmark quality is shaky, the signal can vanish entirely inside the noise floor of model-to-model differences.

There is also a geopolitical angle worth noticing. Arabic AI work is often covered in English-language tech media only when a frontier model vendor mentions MENA expansion or when a sovereign AI project raises money. QIMMA is more interesting than that. It is an example of regional evaluation work asserting that local language quality cannot be an afterthought delegated to translated English benchmarks. That is a healthier direction for the ecosystem than waiting for global labs to decide which languages deserve first-class measurement.

What engineers should do with this

If you run multilingual model evaluations, steal the process lesson immediately. Audit the dataset before you compare the models. Track discard rates. Flag culturally sensitive samples for human review. Publish sample-level outputs wherever licensing allows. If your benchmark cannot survive that scrutiny, it should not be deciding roadmap priorities.

If you build Arabic user-facing systems, segment your evals by workflow instead of shopping for a mythical single best model. Test cultural QA separately from coding, legal retrieval separately from safety, and support flows separately from long-form generation. QIMMA is valuable precisely because it shows those buckets diverge.

And if you sell multilingual AI, expect buyers to get less patient with leaderboard theater. The next generation of serious evaluation will ask not only how high your model scored, but what got removed, rewritten, normalized, or hand-reviewed before the score existed. That is a good thing. It is how benchmarks stop being PR props and start becoming engineering tools again.

The strongest thing QIMMA does is make benchmark hygiene visible. That should embarrass a chunk of the leaderboard economy, and honestly, good. We have enough ranking tables already. What we need are more evaluation efforts willing to admit that measuring model quality starts with measuring whether the test itself deserves to exist.

Sources: TII UAE, QIMMA Arabic leaderboard, Qwen model card, Jais-2 model card, Hugging Face LightEval

The discard rates are the giveaway

The coding subset is the most practical warning

One leaderboard, multiple kinds of “best”

Why this matters beyond Arabic

What engineers should do with this

Sign up for more like this.