iOSWorld Shows Phone Agents Still Fail Where Real Assistants Have to Work: Across Apps, Memory, and Personal Context

iOSWorld Shows Phone Agents Still Fail Where Real Assistants Have to Work: Across Apps, Memory, and Personal Context

Phone-agent demos have always had a suspiciously clean-room quality. The model opens an app, taps a visible button, maybe fills a form, and everyone pretends we are one launch away from a personal assistant that can run your digital life. iOSWorld is useful because it removes that comfort. It gives the agent a persistent fictional person, Jordan Avery, spreads that person’s life across 26 custom iOS apps, and then asks the system to work across apps, memory, preferences, history, and state. The result is exactly the cold shower this category needed.

The best reported configuration, Claude Opus 4.6 with vision plus XML observations, reaches 51.9% overall. That headline is not terrible. The split is the problem. It scores 81.5% on single-app tasks, but only 36.7% on multi-app tasks and 54.3% on memory tasks. In other words, the system does reasonably well when the job looks like remote-controlling one app. It falls apart where a real assistant has to earn trust: connecting facts across apps and remembering what matters about the user.

The hard part is not tapping the screen

iOSWorld includes 133 tasks: 27 single-app tasks, 60 multi-app tasks spanning two to eight apps, and 46 memory or personalization tasks. The seeded data covers transactions, messages, travel, social relationships, financial activity, calendars, notes, food, shopping, fitness, sports, utilities, and professional networking. That matters because a phone is not a UI benchmark. It is a messy personal database with weak boundaries and lots of implied context.

The multi-app category is the honest one. A useful assistant might need to check a delivery receipt, compare it with a bank charge, find the related message, update a note, and decide whether the result matches the user’s preference. No single app knows the whole answer. The agent has to preserve intent while switching interfaces, and it has to avoid making irreversible mistakes in the process. That is where the best configuration still fails most of the time.

This should inform product strategy immediately. If you are building consumer assistants, do not ship broad “drive my phone” autonomy as the default trust model. Ship scoped workflows with confirmations, app-native integrations, and narrow permissions. Let the agent suggest actions, gather evidence, and prepare drafts; require confirmation before money moves, messages are sent, appointments are changed, or records are deleted. The benchmark is not saying phone agents are useless. It is saying the reliable unit of automation is smaller than the marketing unit.

More context is not automatically better context

The XML observation result is one of the most useful findings in the paper. Privileged accessibility-style structure helps strong frontier models, improving some by up to 26 percentage points. But smaller models can get worse. GPT-5.4 Mini drops from 26% vision-only to 16% vision plus XML. Qwen3.5 35B-A3B drops from 13% to 11% overall and from 7% to 0% on multi-app tasks.

That is a perfect warning for every agent team stuffing ever-larger state blobs into context. More UI structure is not the same thing as better state. It adds tokens, distractors, action possibilities, and attention burden. A stronger model may use it to anchor decisions. A smaller one may drown in it. The same pattern shows up in enterprise agents when teams dump schemas, logs, tickets, docs, chat history, and policy pages into one prompt and call it grounding. Sometimes it is grounding. Sometimes it is a haystack budget with a model attached.

For engineers, the takeaway is to design observation layers, not merely expose all available state. Compress the interface around task-relevant affordances. Prefer semantic labels over raw DOM or accessibility tree noise. Keep screenshot or visual state available for verification and recovery, but do not assume the model should reason over every element on the screen every step. Context engineering is interface design. If your state representation would confuse a human operator, it will probably confuse the model faster and more expensively.

Tools beat robot fingers when the app belongs to you

The paper’s MCP tool-use ablation points toward the practical path. For Qwen3.5 35B, structured tools improve strict pass rate from 12.8% to 24.8% and mean rubric score from 0.33 to 0.683 across the 133 tasks. That is still not solved, but it is a meaningful improvement from giving the model task-shaped handles instead of asking it to infer everything from pixels and trees.

This is the architectural lesson phone-agent vendors should steal. Raw UI control should be the fallback for systems you do not own. If you own the app, expose intent-level actions: search orders, fetch active subscriptions, compare transaction, draft message, add calendar event, update note. Then use the visual interface as a verification channel. The assistant should not have to tap through five screens to learn something an API can safely expose in one typed call.

That also reduces privacy risk. Broad screen control encourages broad observation. Intent-level tools can enforce least privilege, redact unnecessary fields, log access, and validate parameters before execution. A phone agent that can see everything and tap anything is operationally convenient right up until it is a compliance incident. Tool APIs make permission boundaries legible. Robot fingers make them theatrical.

The failure taxonomy reinforces the point. For vision plus XML frontier failures, iOSWorld reports 51% budget exhausted, 26% gave up, and 23% premature stops. These are not exotic cognition failures. They are runtime design failures: too much search, unclear progress, weak stopping criteria, insufficient planning, and poor recovery. Better models will help, but the real product work is around budgets, checkpoints, task decomposition, and explicit uncertainty.

iOSWorld’s artifact release is also important. The authors ship apps, seeded data, tasks, rubrics, evaluation code, MCP servers, and AWS EC2 Mac runner support. That is the right bar for this category. If a benchmark claims to measure personal assistants but cannot be rerun, inspected, or instrumented, it is just a demo wearing a lab coat.

The editorial read is simple: phone agents are not blocked by one more impressive tap sequence. They are blocked by cross-app state, personal memory, permission boundaries, and tool interfaces that do not lie to the model. Builders should stop asking “can the model use the phone?” and start asking “what is the smallest safe interface the model needs to complete this task?” That is less flashy. It is also how this becomes a product instead of a keynote trick.

Sources: arXiv, iOSWorld project page, iOSWorld GitHub, AndroidWorld