ai-models

Ai2’s MolmoAct 2 Is the More Useful Robotics Model Story Because You Can Actually Inspect It

Anatoliy Kolodkin

07 May 2026 • 5 min read

The most useful robotics model release this week is not the one with the glossiest demo. It is the one you can actually inspect.

Ai2’s MolmoAct 2 is not as cinematic as a humanoid hand cracking eggs or playing piano, but it ships the pieces practitioners need: model weights, datasets, an action tokenizer, benchmark numbers, architectural details, and enough limitations to make the release falsifiable. In robotics foundation models, that is the difference between a launch and a contribution.

The headline claim is solid: MolmoAct 2 outperforms Physical Intelligence’s π0.5 across several reported benchmarks while running dramatically faster than the original MolmoAct. Ai2 says inference latency drops from 6,700 milliseconds per action call to about 180 milliseconds for base MolmoAct 2, or about 790 milliseconds for MolmoAct 2 with adaptive depth reasoning on LIBERO using one NVIDIA H100. That is not a footnote. It changes what kind of robot behavior is even plausible.

A robot that waits 6.7 seconds between action decisions is not really operating in a live scene. It is doing batch planning with motors attached. A robot that responds in sub-second time is still not necessarily reliable, but it can participate in a control loop with moving objects, shifted placements, distractors, and rephrased instructions. Latency is where robotics demos quietly become products or quietly stay demos.

Open artifacts beat closed choreography

MolmoAct 2 is built on Molmo 2-ER, an embodied-reasoning variant trained on roughly 3 million additional examples spanning pointing, object detection, spatial reasoning, multi-image reasoning, and image/video spatial question answering. Ai2 reports Molmo 2-ER averaging 63.8 out of 100 across 13 embodied-reasoning benchmarks, ahead of GPT-5, Gemini 2.5 Pro, Qwen3-VL-8B, and GR-ER 1.5 in its evaluation.

The action layer pairs that vision-language backbone with a dedicated action expert using flow matching and a KV-cache bridge to the VLM. MolmoAct 2-Think adds depth perception tokens, while adaptive depth routing avoids full dense prediction and gives a reported 17% speedup compared with full depth-token prediction. Ai2 also released MolmoAct 2-FAST Tokenizer, an open action tokenizer positioned as a fully open alternative to Physical Intelligence’s FAST tokenizer, including training data.

That last part matters more than it sounds. Robotics research has been sliding toward a world where the best demos are closed, the data is private, the hardware assumptions are hidden, and the benchmark methodology is whatever the launch blog says it is. Open weights and open-ish datasets do not make a result automatically trustworthy, but they make it arguable. Other labs can test, fine-tune, inspect failure modes, and discover whether the reported gains survive contact with their hardware.

The dataset story is especially important. MolmoAct 2-Bimanual YAM includes more than 700 hours — Ai2 describes it as 720-plus hours in some materials — of bimanual tabletop manipulation demonstrations covering tasks like towel folding, grocery scanning, smartphone charging, and table bussing. The original MolmoAct used 22 hours of curated in-house data over three months, roughly 10,600 trajectories. Moving to a dataset around 30× larger is the kind of unglamorous scaling that robotics actually needs.

Ai2 also re-annotated robot demonstrations, increasing unique labels from about 71,000 to 146,000. That sounds like dataset plumbing until you remember that language quality is part of generalization. A robot policy trained on impoverished instructions learns a narrow interface. A policy trained on richer, more varied labels has a better chance of responding to how humans actually ask for things, which is almost never in benchmark-perfect phrasing.

The numbers are good, and still not deployment numbers

Ai2’s reported benchmark results are credible enough to deserve attention and incomplete enough to deserve skepticism. In simulation, MolmoAct 2 scores 20.6% average success on MolmoBot versus 10.3% for π0.5. On RoboEval, it scores 0.443 versus 0.405 for π0.5. In real-world zero-shot Franka tests, Ai2 reports 87.1% average success across tasks, compared with 48.4% for MolmoBot and 45.2% for π0.5. Individual task results include 100% on apple-on-plate, 86.7% on pipette-in-tray, 93.3% on red-cube-in-tape-roll, 93.3% on knife-in-box, and 62% for a longer-horizon multi-object bowl task.

Those are meaningful gains. They are also not an invitation to put the system into an unsupervised kitchen, warehouse, wet lab, or elder-care facility. A 20.6% simulation success rate that doubles a baseline is still 20.6%. An 87.1% average across selected real-world tasks is promising, but the failures are what determine whether a system can be trusted. Robotics benchmarks are often least informative exactly where operational risk is highest: edge cases, messy objects, sensor occlusion, human interruption, and tasks that require recovery after partial failure.

Ai2 deserves credit for naming limitations rather than pretending the release is magic. The team points to gripper occlusion, control-system speed mismatches, fine-grained manipulation failures, and depth-axis errors from 2D visual traces. That honesty is useful for practitioners because it maps directly to integration work. If your application depends on precise depth estimation, small-object manipulation, or recovery from occluded hands, MolmoAct 2’s strengths may not transfer cleanly.

The third-party signal is also worth watching. Ai2 says MolmoAct 2 scored 0.51 on Cortex AI’s benchmark, ahead of OpenVLA-OFT at 0.36, π0.5 at 0.32, Cosmos Policy at 0.16, and X-VLA at 0.05, ranking first on seven of eight tasks. Independent replication will matter more than launch-week numbers, but third-party benchmark movement is healthier than demo-only progress.

What builders should actually do with this

For robotics teams, MolmoAct 2 is best treated as a reference architecture, not a product you drop into production. Study the split between embodied reasoning backbone, action expert, tokenizer, depth tokens, adaptive routing, and post-training. That decomposition is the useful pattern. It suggests robotics foundation models are moving away from monolithic “vision in, action out” policies toward modular stacks where perception, language grounding, action generation, and control timing can be tuned separately.

If you are building in a regulated physical workflow — lab automation, healthcare-adjacent operations, manufacturing QA, or food handling — the first question is not whether MolmoAct 2 beats π0.5. It is whether the failure modes are observable and recoverable in your environment. Can the system tell when it is uncertain? Can a human pause or correct it? Can you log the perception trace, instruction, action sequence, and outcome? Can you replay failures? Open artifacts help here because they let teams build evaluation harnesses instead of trusting vendor demos.

The open action tokenizer may become one of the more durable parts of the release. Tokenization sounds boring until a field needs interoperability. Text models benefited from shared abstractions around tokens, datasets, and benchmarks. Robotics still lacks equivalent infrastructure because action spaces are hardware-specific and demonstrations are expensive. A usable open tokenizer, paired with open datasets, gives researchers a shared target for comparison. That is how an ecosystem starts getting less bespoke.

Community reaction so far looks appropriately modest. Hugging Face showed early engagement on the MolmoAct 2 paper and Molmo 2-ER model collection, but searches did not turn up a major high-signal Hacker News or Reddit thread at research time. That is fine. Robotics foundation models should be judged less by launch-week discourse and more by whether labs actually fork, fine-tune, evaluate, and break them over the next six months.

My read: Genesis has the better video; Ai2 has the better artifact. The future of robotics foundation models will need both impressive systems integration and inspectable research substrate, but practitioners should overweight inspectability right now. Closed demos tell you what a company can choreograph. Open weights, datasets, tokenizer details, latency numbers, and reproducible benchmarks tell you what the field can build on.

MolmoAct 2 does not mean robots are solved. It means open robotics models are getting fast enough, data-rich enough, and architecturally specific enough to become serious engineering inputs. That is less glamorous than “human-level robot brain.” It is also much more useful.

Sources: SiliconANGLE, Ai2, MolmoAct 2 models on Hugging Face, MolmoAct 2 datasets on Hugging Face, The AI Economy

Open artifacts beat closed choreography

The numbers are good, and still not deployment numbers

What builders should actually do with this

Sign up for more like this.