MobileDev-Bench: The First Production-Grade Coding Agent Benchmark on Mobile Apps — All Top Models Fail on Real Android, React Native, and Flutter Issues

MobileDev-Bench: The First Production-Grade Coding Agent Benchmark on Mobile Apps — All Top Models Fail on Real Android, React Native, and Flutter Issues

Mobile development has always been a harder target for coding agents than web or general-purpose library work — and now there's benchmark data to prove it. MobileDev-Bench is the first production-grade evaluation covering all three major mobile stacks: Android Native (Java/Kotlin), React Native (TypeScript), and Flutter (Dart). Its 384 tasks are drawn from real issue trackers across 18 production apps, with fixes that average 12.5 files and 324 lines changed. Nearly 36% of instances require coordinated changes across multiple artifact types simultaneously — source code and manifest files together, for example — which is precisely the kind of cross-artifact coordination that collapses under models trained on general code.

Four state-of-the-art models — GPT-5.2, Claude Sonnet 4.5, Gemini Flash 2.5, and Qwen3-Coder — were evaluated end-to-end with automated validation inside real mobile build environments. All showed low resolution rates. The failure taxonomy the paper surfaces is immediately useful as a debugging checklist: platform API misuse (models ignoring mobile lifecycle constraints), cross-artifact coordination failures (correct code, wrong manifest), and framework lifecycle errors specific to React Native and Flutter. For teams using coding agents on mobile issues today, this taxonomy tells you exactly where to look when the agent's fix builds cleanly but ships broken behavior.

Read the full article at arXiv →