LocateAnything Fixes a Small-Looking VLM Bottleneck That Breaks Real Agents

LocateAnything Fixes a Small-Looking VLM Bottleneck That Breaks Real Agents

Visual grounding is the kind of model capability that looks like plumbing right up until an agent clicks the wrong thing. A coding assistant can recover from a bad suggestion. A screen-operating agent that taps the wrong delete button, selects the wrong invoice total, or drags the wrong bounding box has a different failure mode: it turns perception error into action.

NVIDIA Research’s LocateAnything is interesting because it attacks that unglamorous bottleneck directly. The framework replaces the usual coordinate-token approach for boxes and points with Parallel Box Decoding, a method that predicts a whole box or point as an atomic unit instead of serializing coordinates token by token. That sounds small. For agents that operate in pixels — GUIs, documents, robotics, OCR workflows, and dense visual scenes — it is not small at all.

A box is not four unrelated words

Most vision-language grounding systems have inherited a language-shaped output format. The model emits coordinates as text-like sequences: x1, y1, x2, y2, each step conditioned on the previous tokens. That is convenient because it fits the decoder machinery, but it is awkward for geometry. A bounding box is a coupled spatial object. The coordinates constrain each other. Predicting them as independent-ish text tokens adds latency and creates avoidable error modes.

LocateAnything’s Parallel Box Decoding treats each box or point as a constant-length atomic unit. The model predicts the full coordinate set, such as (x1, y1, x2, y2), in one parallel step. It is built on a Moon-ViT vision encoder and a Qwen2.5 language decoder connected by an MLP projector. The framework offers Fast Mode, Slow Mode, and Hybrid Mode. Hybrid Mode uses fast decoding by default, then falls back to autoregressive re-decoding when format irregularity or spatial ambiguity appears.

That hybrid design is the production-minded part. Pure speed is attractive until it silently degrades on the cases where precision matters. Pure autoregressive decoding is safer but expensive. Confidence-triggered escalation is how real agent systems should behave: take the cheap path when the output is clean, pay the tax when ambiguity demands it.

The reported throughput numbers are substantial. LocateAnything reaches 12.7 boxes per second on a single NVIDIA H100 in Hybrid Mode, more than 10 times textual Qwen3-VL at 1.1 BPS and 2.5 times quantized Rex-Omni at 5.0 BPS. As target boxes increase from 20 to 300, Parallel Box Decoding shows 2x to 6x speedups. In agent loops, lower latency is not just a nicer benchmark line. It changes how often the system can observe, verify, retry, and recover before the user gives up or the environment changes.

The data mix points beyond object detection

LocateAnything-Data includes 12 million unique images, more than 138 million language queries or training samples, and 785 million boxes. The dataset mix is revealing: 66.9% of queries and 83.1% of boxes are general object detection, but the rest reaches into the areas agents actually need. GUI element grounding accounts for 16.5% of queries. Referring comprehension is 7.3%. OCR and text localization is 3.6%. Layout grounding is 3.5%. Point localization is 2.2%.

That breadth matters because the next useful visual agent is not just recognizing “a bicycle” in an image. It needs to find the disabled-looking billing toggle, the third subtotal in a PDF, the tiny close icon in a modal, the selected row in a data grid, or the object the user described indirectly. GUI grounding, layout, OCR, referring expressions, and dense detection are not separate product problems when the agent’s job is to act in a mixed visual workspace.

The reported results are similarly broad. LocateAnything improves mean F1 by 3.8% on LVIS and 1.8% on COCO over Rex-Omni at identical model size. At IoU=0.95 on LVIS, it reports 31.1 versus 20.7, which is the kind of high-precision localization improvement that matters when the click target is small. It reaches 58.7 mean F1 on Dense200 and 39.9 on VisDrone, compared with Rex-Omni’s 58.3 and 35.8. For GUI grounding, it reports 60.3 mean F1 on ScreenSpot-Pro, ahead of Qwen3-VL-30B-A3B and GUI-Owl-32B according to the project page. For document-style tasks, it reports 76.8 on DocLayNet, 70.1 on M6Doc, and 43.3 on TotalText OCR.

The ablation is also useful. PBD Slow Mode reaches 52.1 F1 on COCO. Hybrid Mode keeps 51.6 F1 while preserving most speed gains at 13.2 BPS. That is the tradeoff practitioners should care about: not maximum speed in isolation, and not maximum accuracy with unusable latency, but a mode that holds quality while making the loop cheaper.

Builders should validate it where their UI actually fails

There are caveats. NVIDIA’s page is a strong primary research source, but teams should not assume a reported grounding suite equals production fit. GUI agents fail in messy enterprise apps, remote desktops, custom design systems, browser zoom states, dark mode, virtualization artifacts, canvas UIs, weird PDF renderers, and mobile layouts that compress labels into icons. If you are replacing or augmenting a detector, run your own screens, documents, and failure cases through it before changing the stack.

The larger lesson is architectural. Agent teams should separate observation, grounding, action, and verification. Do not let the same model casually describe the screen, infer intent, choose a target, and execute a click without intermediate artifacts. Grounding outputs should be inspectable: box, confidence, source query, fallback mode, screenshot hash, and post-action verification. If a system cannot show why it clicked somewhere, it will be miserable to debug and impossible to trust at scale.

LocateAnything also nudges the industry away from text-maximalism. Not every problem should be squeezed through token generation. When the output object has structure — boxes, points, trajectories, tables, plans, typed tool calls — the model interface should respect that structure. Parallel Box Decoding is one example. The broader pattern is what matters: align decoding with the artifact, not with whatever was easiest to bolt onto a language model.

Community reaction is still quiet. Hacker News had no exact hits for “LocateAnything” and “Parallel Box Decoding” during research, and broader searches were polluted by unrelated tracker products. That is normal for method papers. The impact will show up if GUI-agent, OCR, document, robotics, and labeling teams can reproduce the latency and precision gains in their own domains.

The LGTM take: if your agent’s world is pixels, box decoding is not a footnote. It is the difference between observing and operating.

Sources: NVIDIA Research, arXiv, arXiv HTML, Hugging Face Papers