ai-models

MobileGym Makes GUI-Agent Benchmarks Look Less Like Vibes and More Like Engineering

Anatoliy Kolodkin

27 May 2026 • 4 min read

Mobile-agent demos have been winning the wrong argument. The impressive part is not that a model can tap through a shopping flow on a phone-shaped screenshot while a video plays nicely on social media. The hard part is proving what happened after the tap: which state changed, which side effects leaked, whether the task can be replayed from the same start, and whether the result survives contact with something more rigorous than vibes.

That is why MobileGym, a new browser-hosted Android-like simulation platform for mobile GUI agents, is more interesting than its leaderboard. It gives agent researchers a programmable world with 28 simulated apps, 416 parameterized task templates, deterministic judges, rollback-friendly state, and enough cheap parallelism to support reinforcement learning instead of one-off demo theater. For an industry trying to turn agents from “watch this” clips into dependable software, that is the right layer to improve.

The benchmark is really an operating system for evaluation

MobileGym’s app set includes 12 everyday apps plus 16 system apps: WeChat, Alipay, Bilibili, RedNote, eBay, Spotify, Maps, Reddit, Calendar, Files, Settings, Browser, and others. The benchmark splits its 416 templates into 256 test and 160 train templates, with finite parameter ranges producing more than 27,000 distinct task instances before counting continuous ranges. That matters because agents trained on a small fixed list of tasks learn benchmark trivia. Parameterized tasks force the model to operate a pattern, not memorize a path.

The engineering numbers are the quiet story. Each browser instance uses roughly 400 MB of RAM, around 50 MB of disk, and cold-starts in about three seconds. The project page says 256 parallel instances can run on one server. That moves GUI-agent experimentation away from fragile device farms, disposable real accounts, and expensive emulator orchestration toward something closer to normal ML infrastructure: resettable environments, batch rollouts, state inspection, and repeatable failures.

The evaluator gets privileged access to structured JSON state while the agent sees only screenshots and actions. That separation is exactly what a good benchmark should do. Users do not want the agent to cheat by reading hidden state, but evaluators absolutely need ground truth. Otherwise “success” becomes whatever a vision-language judge guessed from the final screenshot.

MobileGym makes that weakness explicit. The authors report a VLM-judge audit with 10.2% misjudgment. In consumer demos, that might look tolerable. In any agent that sends messages, changes settings, buys things, books appointments, or moves money, it is not. A screenshot can tell you that a button changed color. It cannot reliably tell you whether the agent mutated the wrong record, left a stale preference behind, or triggered an unrelated cross-app side effect.

Side effects are the metric agent teams keep under-measuring

The side-effect story is where MobileGym becomes useful beyond mobile UI research. Modern agent systems are basically permissioned mutation engines. A coding agent changes files and opens pull requests. A personal assistant changes calendars and sends messages. A support agent updates tickets and customer records. In all of those systems, “completed the task” is not enough. The agent must complete the task without quietly damaging adjacent state.

MobileGym’s full-state diffing and deterministic judges point toward the evaluation pattern teams should be using everywhere: start from a known snapshot, run the agent, inspect the complete terminal state, and measure intended changes separately from unintended ones. That is not glamorous, but it is how agent systems become reviewable. Audit logs, rollback, sandboxing, permission scopes, and deterministic replay are not enterprise paperwork. They are the difference between debugging a system and arguing with a transcript.

The leaderboard reinforces the point that GUI control is still hard. Gemini 3.1 Pro leads at 58.8% success, followed by Doubao-Seed-2.0-Pro at 52.0% and Qwen3.6-Plus at 45.7%. Open-source GUI specialists trail far behind: AutoGLM-Phone-9B at 20.0%, UI-Venus-1.5-8B at 15.4%, GUI-Owl-1.5-8B-Think at 15.1%, and UI-TARS-1.5-8B at 13.8%. Those numbers should cool some product-roadmap optimism. A phone GUI looks familiar to humans, but it is a nasty partially observable control problem for models.

The encouraging result is that training in the simulated environment appears to transfer. Qwen3-VL-4B starts at 9.4% success and reaches 22.2% after GRPO on the 256-task test set. In a 59-task sim-to-real signal subset, simulation success rises from 33.9% to 76.7%, while real-device success rises from 32.2% to 72.9%. The authors report that 95.1% of the simulation-side gain is retained on the real-device subset.

That last number should be read carefully, not worshipped. Simulation fidelity is always the tax. MobileGym is not a perfect reproduction of every proprietary Android backend, payment rail, encrypted store, or app-specific weirdness. It is a controlled approximation focused on interaction fidelity: screenshots, navigation, state transitions, cross-app tasks, and measurable outcomes. The 95.1% retained gain is strong evidence that this particular simulator is useful, not a universal law that simulated GUI training always transfers.

What builders should steal from it

If you are evaluating GUI agents, do not accept screenshot-only grading. Require deterministic task state, reset and fork support, parameterized tasks, side-effect detection, cost and runtime accounting, and a clear distinction between query tasks and mutation tasks. MobileGym’s typed AnswerSheet protocol for query tasks is a small but important example: the answer format should be machine-checkable instead of left to free-text matching and judge-model vibes.

If you are building production agents, the same lesson applies even if you never touch MobileGym. Your agent runtime needs pre-state, post-state, intended diff, unintended diff, tool-call trace, permission boundary, and rollback story. Without those, you are not measuring reliability. You are collecting anecdotes until an incident gives you a better dataset.

MobileGym is early — the GitHub repository had 30 stars during research, and Hacker News had no exact discussion hits for “MobileGym” and “GUI agent.” That is fine. Infrastructure usually arrives before the crowd notices. The important thing here is not that another benchmark exists. It is that this benchmark treats mobile agents like systems that change state, not chatbots that happen to click.

The LGTM take: agents do not need prettier demos. They need verifiable worlds where success, side effects, rollback, and training cost are first-class measurements. MobileGym is a serious step in that direction.

Sources: arXiv, MobileGym project page, GitHub, Hugging Face Papers

The benchmark is really an operating system for evaluation

Side effects are the metric agent teams keep under-measuring

What builders should steal from it

Sign up for more like this.