google-ai

Android Bench Says GPT 5.5 Beats Gemini. That’s Not the Whole Story.

Anatoliy Kolodkin

26 May 2026 • 5 min read

The funniest thing about Google’s Android Bench leaderboard is not that GPT 5.5 sits above Gemini on a Google-run Android coding benchmark. It is that the result is only the least interesting part of the table. The useful signal is hiding in the adjacent columns: latency, tokens, cost, confidence intervals, verification behavior, and the very Android-specific ways agents fail when Kotlin, Gradle, screenshots, and multi-file repos enter the chat.

The New Stack’s fresh writeup frames the headline cleanly: Google ranks the best AI models for building Android apps, and the winner is not Gemini. Fine. That will get the clicks. But for engineering teams choosing a coding agent, “winner” is the wrong abstraction. Android Bench is valuable precisely because it makes the model-selection question messier than a one-number scoreboard.

Google’s Android Bench is a benchmark for LLMs solving Android development tasks. The official Android Developers framing is practical: can the model fix issues in real Android projects and validate the work with unit or instrumentation tests? That is a better question than “can the model emit plausible Kotlin in a vacuum?” Android engineering is rarely a single-file puzzle. It is Gradle, lifecycle state, Compose quirks, API migrations, screenshots, generated code, flaky build environments, and the grim ritual of waiting for the test suite to tell you which assumption just died.

The leaderboard is useful because it is not just a leaderboard

During the sweep, Android Bench showed GPT 5.5 leading with a 74.0 pass@1 score, a confidence interval of 66.8–80.5, average latency of 15.5 hours, average total tokens of 64.5 million, and an average cost of $133.90 for a full benchmark run. GPT 5.4 and Gemini 3.1 Pro Preview were tied at 72.4 pass@1, but the cost and time profile changed the story. Gemini 3.1 Pro Preview was listed at $49.00 and 11.5 hours, while GPT 5.4 came in at $91.70 and 21.2 hours.

That is the kind of table teams should want. A two-point score difference can look decisive in a chart and irrelevant in a budget review. If a model is slightly better but dramatically more expensive, slower, or less predictable under your harness, the top row may not be the best production choice. In agentic coding, the unit of value is not benchmark rank. It is accepted change per dollar, per hour, per reviewer interruption.

The broader leaderboard reinforces the point. Claude Opus 4.7 appeared at 68.7, GPT 5.3 Codex at 67.7, Claude Opus 4.6 at 66.6, GPT 5.2 Codex at 62.5, Claude Sonnet 4.6 at 58.4, Kimi K2.6 at 58.6, DeepSeek V4 Pro at 55.4, Gemini 3 Flash Preview at 42.0, Gemma 4 31B IT at 33.2, and Gemini 2.5 Pro at 29.1. That spread is useful, but it is not a procurement policy. It is a shortlist generator.

The methodology matters more than the bragging rights. Android Bench’s technical report describes 100 Android development tasks structured like SWE-Bench, including Kotlin, Android APIs, Gradle, large multi-file codebases, and 20 multimodal tasks requiring screenshot understanding. Twelve tasks were written manually by Android domain experts; 88 came from existing GitHub pull requests. The tasks span 33 repositories, including Jerboa, compose-rich-editor, Pocket Casts Android, Now in Android, and Thunderbird Android. Google says it runs the benchmark 10 times per model and reports pass@1 with bootstrap confidence intervals, while enforcing a $10 inference-cost limit per task and a 250-turn limit.

The “sed loop” is the real warning label

The most useful detail in Google’s own error analysis is not which model won. It is how weaker agents lost. The report calls out models falling into “sed loops,” attempting brittle regex edits to Kotlin, skipping verification, or hallucinating local variable availability. That is painfully familiar if you have watched an agent confidently mangle a structured language through shell substitutions because the harness made text surgery easier than semantic editing.

This is where Android Bench becomes a harness benchmark, not just a model benchmark. A model that understands Android architecture can still fail if the tool surface nudges it toward bad edits. A weaker model can look stronger if the environment forces small patches, compiler feedback, test execution, and safe file operations. Builders should treat agent behavior as a systems problem: model, prompt, patch tool, sandbox, verifier, retry policy, and reviewer workflow all contribute to the final diff.

For Android teams, that means the eval you run internally should measure more than pass rate. Track whether the agent runs ./gradlew test or ./gradlew assembleDebug. Track whether it edits production code or quietly rewrites tests. Track the number of tool calls, wall-clock time, token spend, rollback rate, and reviewer corrections. Track whether it can interpret Compose screenshot failures or simply keeps changing padding until the snapshot stops yelling. The difference between a useful coding agent and an expensive intern simulator often shows up in those operational metrics before it shows up in a leaderboard score.

Public benchmarks are maps, not terrain

Android Bench is more credible because Google published a result where an OpenAI model leads. Vendor-owned benchmarks become marketing PDFs when the vendor always wins. This one is more interesting because the house model is not automatically the house champion. That gives the benchmark some integrity, but it does not remove the usual trapdoor: the benchmark is public, many tasks come from GitHub, and contamination risk is real.

The New Stack quotes Zencoder CEO Andrew Filev making the right point: public domain-specific benchmarks are useful, but private evals can reorder rankings. He said a small change in test-case framing shifted model spread from six percentage points to 26 in Zencoder’s own research. That is not a reason to ignore Android Bench. It is a reason to stop treating public leaderboards as if they know your monorepo.

Your app has its own gravity. Maybe it uses a legacy Gradle setup, a custom design system, heavy native bindings, a private SDK, or an instrumentation suite that only passes on the cursed emulator image in CI. Maybe your biggest pain is not bug fixing but migration: Java to Kotlin, XML to Compose, one navigation stack to another, API-level cleanup, flaky screenshot tests, or dependency updates. Android Bench can tell you which models are worth testing. It cannot tell you which one will survive your repo’s particular swamp.

The practitioner move is straightforward: use Android Bench to build the candidate list, then run private evals. Include GPT 5.5, Gemini 3.1 Pro Preview, Claude’s strongest coding models, and cheaper open-weight or local options if cost or data boundaries matter. Give each agent the same harness, the same repository constraints, the same tests, and the same acceptance rubric. Score cost per accepted patch, time-to-green, review burden, and failure recoverability. If your team cannot reproduce the benchmark’s claims in a miniature version of your own workflow, the leaderboard is decorative.

The forward-looking pressure is on Google. If Android Bench is serious, it should become part of how Android Studio, Gemini Code Assist, Antigravity, and Google’s developer tooling choose models for tasks. The honest future is not “always use the vendor’s flagship model.” It is routing: this model for Compose screenshot reasoning, that model for Gradle breakage, a cheaper one for refactors, a local one for sensitive code, and a verifier that does not care whose logo is on the API response.

The editorial take: Android Bench is LGTM because it moves the industry from coding-model vibes toward domain-specific evidence. But the table is a starting point, not a verdict. The best Android coding agent is the model-harness-tooling combination that gets your codebase to green cheaply, repeatably, and with a diff a senior engineer can approve without developing a facial twitch.

Sources: The New Stack, Android Bench, Android Bench methodology, Android Bench on GitHub, Android Developers Blog

The leaderboard is useful because it is not just a leaderboard

The “sed loop” is the real warning label

Public benchmarks are maps, not terrain

Sign up for more like this.