ai-models

RealClawBench Says Agent Benchmarks Are Still Too Clean

Anatoliy Kolodkin

03 Jun 2026 • 4 min read

Most coding-agent benchmarks still have the same quiet flaw: they are too tidy. They ask models to solve tasks that look like engineering work, but often strip away the parts that make engineering work irritating — half-stated user intent, local files, environment assumptions, command output, artifacts, and the long tail of “please fix the thing you just broke.” RealClawBench is interesting because it starts from the mess instead of sanding it off.

The new arXiv paper builds a benchmark from deployed OpenClaw developer-agent sessions rather than public issue queues or researcher-authored prompts. That sounds like a data-source detail. It is not. It changes the benchmark from “can this model solve a clean task?” to “can this agent operate inside the kind of workspace developers actually hand to automation?” Those are different questions, and the second one is the one teams should care about before they give an agent write access to a repo.

The headline result is not flattering, which is why it is useful. Across 281 released executable tasks, the best model in the paper, Claude Opus 4.7, solves 65.8% by sample success. GPT-5.5 is close at 65.0%, MiMo V2.5 Pro reaches 60.1%, DeepSeek V4 Pro hits 59.8%, GLM 5.1 lands at 57.4%, and Kimi K2.6 is reported at 57.7%. In other words: strong models are strong, but even the leaders still fail roughly one-third of realistic developer-agent work.

Real sessions make the leaderboard less comfortable

The construction pipeline is the story. RealClawBench starts from 450,766 raw calls, folds them into 110,170 sessions, filters to 76,155 tool-use sessions, cleans that down to 40,713 sessions, then narrows again to a 6,995-item high-quality pool, 5,260 scorable items, 414 candidates, and finally 281 released tasks. That amount of filtering is not bureaucratic overhead. It is the work required to turn real agent traces into tasks that can be executed, verified, and shared without simply leaking someone’s workspace.

The paper compares benchmark “realness” against a reference distribution of 76,155 deployed OpenClaw tool-use sessions. RealClawBench reports Jensen-Shannon divergence of 0.146 and total variation of 0.271, while SWE-bench comes in at 0.615 / 0.804, Terminal-Bench 2.0 at 0.905 / 0.971, and WebArena and OSWorld at 1.000 / 1.000 against that distribution. Metrics like JSD and TV will not make anyone’s product demo sparkle, but they answer the question that matters: does this benchmark resemble the thing it claims to measure?

The user-intent signal is similarly telling. RealClawBench scores 6.37 out of 8, compared with 3.85 for SWE-bench, 4.73 for Terminal-Bench 2.0, and 0.95 for GAIA. The benchmark also contains much more local and artifact-dependent work: local signals appear in 38.4% of tasks, artifacts in 96.8%, constraints in 91.8%, and environment dependence in 75.4%. That is exactly where agent products break. Not because the model cannot write a function in isolation, but because it must infer the surrounding contract.

The use of deterministic verifiers is the other important design choice. RealClawBench evaluates final workspace state, stdout, and bounded subprocess behavior rather than asking another model to judge whether the agent sounded correct. That matters because LLM-as-judge is especially dangerous for coding-agent evals: it rewards plausible explanations and can miss broken diffs, hidden environment failures, or tool-use shortcuts. Production teams should steal this idea directly. If your agent workflow cannot define an acceptance condition outside the model’s own prose, it is not ready to be automated.

The cost table is a routing memo

The ranking is useful, but the cost analysis is where the benchmark becomes operational. The paper reports Claude Opus 4.7 at $46.54 for a full benchmark run, GPT-5.5 at $61.87 while ranking slightly lower, and MiMo V2.5 Pro at $11.87 while sitting close enough to the frontier to be interesting. At the cheap end, Gemma 4 31B and GPT-OSS 120B cost $1.35 and $2.04 respectively, but trail the top systems by more than 20 sample-average points.

The right conclusion is not “buy the cheapest model” or “always use Opus.” It is that model routing should be workload-specific. Broad repo exploration, file summarization, first-pass repair loops, and low-risk scaffolding do not need the most expensive model on every turn. Ambiguous architecture decisions, final review, security-sensitive edits, and tasks where a wrong patch is expensive probably do. If a benchmark does not include cost per completed task, it is only half a benchmark for teams running agents at scale.

The 65.8% ceiling also puts a needed dent in the “agent owns the ticket” narrative. A one-third failure rate does not make coding agents useless. It means they are powerful assistants that still need tests, rollback paths, command policies, diff review, and humans who understand the code. The dangerous version of agent adoption is not using agents; it is pretending a model that can complete two-thirds of realistic tasks has earned unsupervised authority over the remaining third.

There is also a privacy tradeoff that should not be hand-waved. Real session data makes benchmarks better, but real sessions can contain proprietary code, identifiers, credentials, customer data, and local context that users never expected to become evaluation material. RealClawBench’s approach — sanitized workspaces, reconstructed tasks, and unreleasable artifacts excluded — is the right direction. But benchmark builders should publish privacy review details with the same seriousness they publish leaderboard tables. The future of realistic agent evals cannot be “trust us, we cleaned it.”

For practitioners, the immediate move is clear: build a small internal RealClawBench-shaped eval. Pull 50 to 100 sanitized tasks from your own agent logs. Reconstruct the workspace state. Write deterministic pass/fail checks. Measure pass rate, cost per accepted change, false-done rate, command count, retries, wall time, and reviewer intervention. The result will be uglier than a public leaderboard. Good. Your production environment is uglier too.

RealClawBench’s most valuable contribution is not crowning a winner. It is making “best coding model” look like the wrong question. The better question is: which model, inside which harness, under which budget, with which verifier, can complete the work your engineers actually delegate? That is less clean than a leaderboard slide. It is also much closer to engineering.

Sources: arXiv, arXiv HTML, OpenClaw GitHub, SWE-bench, Terminal-Bench

Real sessions make the leaderboard less comfortable

The cost table is a routing memo

Sign up for more like this.