ai-models

Claw-Anything Shows Always-On Assistants Are Failing the Context Test

Anatoliy Kolodkin

26 May 2026 • 5 min read

Always-on assistants keep failing the same quiet test: they are useful only if they understand context you did not paste into the chat box. Claw-Anything, a new benchmark from LiberCoders, is valuable because it makes that failure measurable. It gives agents a simulated user world with long-horizon activity, interconnected services, and GUI plus CLI interaction across devices — then asks them to do work inside the mess.

The result is appropriately humbling. GPT-5.5 reaches 34.5% pass@1. Claude Opus 4.7 reaches 31.8. Claude Sonnet 4.5 reaches 28.0. GLM-5.1, an open-source baseline, reaches 31.7. Those numbers are not an indictment of one model. They are a better description of the product category. Frontier reasoning still struggles when the task stops being a clean tool call and starts resembling someone’s actual digital life.

The missing context is the product

Claw-Anything expands agent evaluation along three axes: long-horizon event streams, interconnected services, and cross-device GUI+CLI interaction. The official repo says it includes 200 human-verified evaluation tasks and 2,000 training environments. Its mock environment includes 35 FastAPI services, including Gmail, Calendar, Slack, Notion, Feishu, WeChat, and Zotero-style services. Tasks are split into skill tasks, where the agent dynamically loads tools; tool tasks, where the full tool set is preloaded; and gui tasks, involving Android GUI interaction.

That shape matters because most agent benchmarks are too polite. They hand the model a narrow slice of state, expose a small tool set, and ask for a deterministic action. Real users do not live that way. Their work is scattered across calendars, email threads, Slack channels, documents, browsers, phones, terminals, tickets, local files, and old decisions that nobody wrote down properly. If an assistant cannot reconcile that world, it is not an assistant. It is a command palette with a personality.

The benchmark’s context scale is the blunt-force part. Claw-Anything lists 191.7k words of context, compared with 12.1k for QwenClawBench and 5.3k for Claw-Eval. It averages 10.1 services per task, maxing out at 18, while compared benchmarks sit between 0.1 and 3.9 average services. That is the important move. Long context is not useful because it lets you paste a novella into a prompt. It is useful if the model can find the relevant thread, ignore stale noise, reconcile conflicting events, and act through the right interface without inventing state.

Capability and governance are now the same conversation

Claw-Anything is also a governance benchmark whether it says so or not. Broader access is not free. If an assistant can see Gmail, Calendar, Slack, Notion, mobile GUI state, and CLI tools, the product needs scoped permissions, approval policies, audit logs, sandboxing, secret handling, retention limits, revocation, and recovery workflows. The model-capability question — can it act? — cannot be separated from the operating question — should it act, who authorized it, and how do we unwind the result?

The runner’s design points in that direction. It uses a Think → Act → Observe loop with OpenAI-compatible backends and per-trial Docker sandboxing with port isolation. Graders score completion, robustness, communication, and safety. That is closer to the evaluation surface serious agent products need. A pass/fail result is useful, but not sufficient. Teams should also count unsafe actions, unnecessary tool calls, excessive retries, token burn, hidden state assumptions, permission boundary violations, and recovery behavior after a bad observation.

The skill/tool split is especially useful. Preloading every tool may improve immediate recall, but it bloats the prompt surface, increases the chance of irrelevant tool use, and expands the permission blast radius. Dynamic skill loading is closer to how governed assistants should behave, but it creates discovery, routing, and provenance problems. That is where MCP servers, skill registries, package signing, policy engines, and audit trails meet model quality. Claw-Anything gives teams a way to argue about that tradeoff with numbers instead of architecture vibes.

The benchmark also exposes a cost reality. GPT-5.5’s main run used 77.7M input tokens and 0.9M output tokens. That is not a rounding error; it is an operating expense and a latency signal. Agent evaluation has to include token accounting because assistants that need to reread a simulated life for every action may be accurate enough in a lab and too expensive in production. Retrieval, memory compaction, state indexes, permission-aware context assembly, and tool-observation hygiene become systems requirements, not optimizations for later.

The fine-tuning result is the commercial tell

The most commercially interesting result may not be GPT-5.5 at 34.5 pass@1. It may be Claw-Anything-Qwen3.5-27B. Fine-tuning Qwen3.5-27B on generated environments pushes it from 9.8 to 33.5 pass@1, a +23.7 gain that brings it close to GPT-5.5 on this benchmark and ahead of several larger or closed baselines. That does not mean a fine-tuned open model is suddenly a better personal assistant everywhere. It means environment-shaped data matters.

For enterprises, that is the actionable point. A generic frontier model may be strong at reasoning, but your assistant lives inside your workflows, permissions, data silos, naming conventions, UI quirks, and failure modes. Synthetic-but-executable environments can teach those patterns more directly than another general benchmark chase. The right test world includes months of state, noisy irrelevant activity, conflicting meetings, stale documents, duplicated contacts, renamed projects, broken tool calls, and user preferences that are implied rather than neatly declared. If your eval world is clean, your production assistant will be surprised by dirt.

Community attention is still early. Hugging Face showed 11 upvotes during research, Hacker News had no visible hits for “Claw-Anything,” and GitHub interest moved from 4 stars in HF metadata to 13 stars via API during the research window. That modest reception is not a problem. Benchmarks like this often become useful before they become popular, because the teams who need them are busy discovering that single-tool demos do not survive contact with real workflows.

For builders, the next step is straightforward and annoying: stop evaluating agents on isolated tasks. Build fake worlds. Seed weeks or months of state. Add irrelevant messages. Create conflicting calendar events. Require cross-service actions. Include GUI tasks, not only API calls. Measure pass@1 and pass@k, but also cost, unsafe actions, retries, recovery, permissions, and auditability. Test dynamic skill loading separately from preloaded tool use. If the assistant cannot handle a fake messy life, it is not ready for a real one.

Claw-Anything’s core message is not that models are weak. It is that the assistant promise is bigger than the benchmark set we have been using. Tool calling was the first milestone. Context survival is the next one. The products that win will not merely connect more apps to a chat box. They will build governed, observable systems that can navigate a user’s world without pretending access and understanding are the same thing.

Sources: Hugging Face Papers, arXiv, GitHub, Hugging Face dataset

The missing context is the product

Capability and governance are now the same conversation

The fine-tuning result is the commercial tell

Sign up for more like this.