DIRECT Says Bigger VLM Planners Are Often Just Slower. Route the Robot's Brain Instead
Robots make the agent-cost problem impossible to ignore. A browser agent can waste twenty seconds “thinking” and merely irritate you. A robot arm doing the same thing is just standing there, burning latency in the physical world while a banana waits for a frontier model to discover object permanence.
That is why DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners? is more interesting than its robotics wrapper suggests. The paper studies high-level vision-language model planners for embodied agents and asks a practical question: when should the system spend more inference-time compute, and when is that just expensive ritual?
The answer is refreshingly unromantic. Bigger planners, deeper chain-of-thought, and longer memory histories can help, but none of them helps uniformly. DIRECT routes each task to the cheapest capable planner using scene and instruction context. On simulated benchmarks and a physical Franka DROID setup, the router matches or beats stronger fixed planners while cutting latency substantially — up to 65% in reported hardware settings. That is not just a robotics result. It is a design pattern for every agent stack that currently treats “use the best model” as architecture.
Test-time compute is a budget, not a virtue
The paper evaluates three scaling axes: chain-of-thought depth, model size, and memory history. Each sounds like an obvious way to improve an agent. Let the model think more. Use a larger model. Give it more history. In practice, each is conditional. Chain-of-thought helps when the task requires hidden semantic, spatial, or physical reasoning. Model size helps broaden the command and perception skill set. Memory helps when the current step depends on previous actions. But each can also waste time, add noise, or slow execution without changing the outcome.
DIRECT’s empirical setup spans VLABench, RoboMME, and physical Franka DROID hardware. The project page reports more than 270,000 simulated routing decisions and 245 hardware trajectories, which is exactly the sort of scale you want before trusting a routing claim. The paper’s most quotable VLABench result is brutal for “always think harder” architectures: 44% of tasks are matched or beaten by Qwen3-VL 8B Instruct versus a Thinking configuration at under 2% of the latency. The project page translates one setup as roughly 63× faster — 1.9 seconds versus 118 seconds.
That is not a small optimization. That is the difference between an interactive system and a philosophical appliance.
The hardware numbers make the tradeoff more concrete. On a multi-step grocery-bagging task with a Franka DROID setup, Qwen3.5-VL 9B No Thinking reaches 47.62% success at 2.19 seconds. The Thinking version jumps to 90.48% success but takes 19.58 seconds. DIRECT reaches 95.24% success at 6.85 seconds. In other words, routing preserves most of the intelligence benefit while avoiding the habit of forcing every task through the slow path.
The router is the architecture
DIRECT conditions on a scene image and instruction, encodes multimodal context with frozen vision and text encoders, and predicts which planner offers the best quality-cost tradeoff. The overhead is small — the paper describes embedding/router cost around 20–50 milliseconds, negligible next to VLM planner calls above one second. That matters because routing only works operationally if the router does not become the new bottleneck.
The deeper point is that the router is not a fallback ladder. Many production agent systems still use a crude pattern: try the cheap model, retry on failure, escalate to the expensive model, maybe add more context if the first two attempts embarrass everyone. That is better than nothing, but it spends the routing decision after the system has already failed. DIRECT points toward pre-call routing based on task shape. The system should decide whether the situation demands expensive reasoning before it pays for expensive reasoning.
This is where the robotics paper starts looking like an enterprise agent paper in disguise. Coding agents have the same structure. Some tasks need a frontier model with long-horizon planning: cross-repo refactors, ambiguous failing tests, security-sensitive changes, migrations where the failure mode is subtle. Many tasks do not: renaming a variable, updating a dependency pin, drafting boilerplate tests, applying a documented API change. A stack that sends both through the same premium model is not “maximizing quality.” It is failing to schedule work.
Browser agents have the same problem. So do data-analysis agents, support-ticket agents, and internal ops agents. The expensive call should be attached to task demand, not user anxiety or vendor marketing. Test-time compute is a resource like memory, CPU, database locks, or human reviewer attention. Serious systems allocate it. Amateur systems spray it.
Memory is not automatically wisdom
One useful nuance in DIRECT is that memory history is treated as a compute axis, not a moral good. Agent builders often assume more history is better because it sounds more human. In practice, longer history can help history-dependent tasks while hurting others by adding irrelevant state. Anyone who has watched an LLM fixate on an outdated earlier instruction has seen this failure mode. Context is not free just because the model accepts it.
That should change how teams design agent memory. Do not only ask how much history the model can hold. Ask which tasks need history, what kind of history they need, and when the retrieval layer should deliberately omit stale facts. For robotics, that might mean prior object movements or failed grasp attempts. For coding agents, it might mean the last test failure, not every previous command. For support agents, it might mean the current customer’s last unresolved issue, not the entire conversation archive since account creation.
The same goes for chain-of-thought. DIRECT’s results do not say “thinking is bad.” They say thinking is situational. That distinction matters because the post-DeepSeek/RL reasoning era has trained everyone to equate longer visible reasoning with better model performance. Sometimes it is. Sometimes it is latency cosplay. The system-level question is whether extra reasoning changes the action selected, the validation result, or the user outcome. If not, it is just a tax.
What practitioners should steal
The actionable move is to build routing metrics before building another model menu. Instrument task classes, context length, tool failure rates, ambiguity signals, validation outcomes, retry counts, and human correction rates. Then use those signals to decide when to call smaller models, non-thinking modes, larger models, or deeper reasoning configurations. If your agent stack only logs total token spend and final success, you do not have enough data to route intelligently.
Teams should also separate “planner quality” from “system quality.” A stronger planner that takes too long can be worse for the product. A cheaper planner that succeeds on easy tasks and escalates only when the task shape demands it can produce a better aggregate user experience. In physical systems, the latency penalty is obvious. In software systems, it hides inside cloud bills, slow PR cycles, and users who stop trusting agents because they feel expensive and unpredictable.
DIRECT also suggests a healthier benchmark habit. Instead of reporting one best model score, report quality-cost frontiers. Success rate at what latency? At what token budget? With what router overhead? How often did the system choose the cheap path? How often did it regret that choice? Leaderboards that ignore routing will increasingly describe components rather than products.
The forward-looking take: agent systems in 2026 will not be one giant brain. They will be schedulers wrapped around multiple imperfect brains, validators, tools, memory policies, and escalation paths. DIRECT is valuable because it makes that architecture visible in a domain where bad routing has nowhere to hide. The robot arm either moves or it waits. Software agents should be held to the same standard.
Sources: arXiv, DIRECT project page, DIRECT interactive demos, FrugalGPT routing context