LiveBrowseComp Catches Search Agents Cheating With Memory

LiveBrowseComp Catches Search Agents Cheating With Memory

Search-agent benchmarks have a credibility problem: too many “search” traces look like a model confirming what it already knows. The agent writes plausible queries, opens a few pages, cites something adjacent, and the final answer appears grounded. But if the answer was already sitting in the model’s weights, the tool did not discover evidence. It performed citation theater with a latency penalty.

LiveBrowseComp is useful because it attacks that problem directly. The paper argues that many search-agent evaluations accidentally measure intrinsic model knowledge rather than actual web discovery. Its authors diagnose BrowseComp-style tasks, then introduce a 335-question benchmark built from recent, long-tail facts published within the 90 days before construction. The results are the kind of ugly that improves a field: closed-book accuracy falls below 2%, search-augmented scores drop 25 to 40 points versus BrowseComp, and familiar leaderboard rankings become less comfortable.

The failure mode is intrinsic knowledge dependence

The paper’s core term is Intrinsic Knowledge Dependence, or IKD. It describes agents that use web search to verify internally generated hypotheses rather than to discover unknown facts. That is not a subtle distinction. A real research agent should let retrieved evidence constrain the answer. An IKD-heavy agent starts with a belief, then looks for support. The output may still be correct, but the evaluation is measuring memory plus post-hoc sourcing, not research capability.

The diagnostics are sharp. On BrowseComp, agents can answer up to 44.5% of questions without tools. That alone should make benchmark readers suspicious. If a supposedly browsing-heavy benchmark has a large closed-book path, it is partly testing what the model already absorbed during pretraining. Worse, LiveBrowseComp reports that evidence-blocked search performs worse than closed-book answering for every evaluated model. Average pass@4 drops from 26.1 closed-book to 6.2 when answer-supporting evidence is removed. MiniMax M2.5 falls from 44.5 closed-book pass@4 to 8.0 with evidence-blocked search; Kimi-K2.6 falls from 25.5 to 2.3.

That is the result practitioners should tape to the wall. Search can make a model worse when the retrieved documents are hard negatives or missing the necessary evidence. The agent does not reliably distinguish between “I found proof,” “I found something topically related,” and “I am now more confused but still need to answer.” This is how deep-research products produce confident nonsense with a footnote trail that looks expensive.

LiveBrowseComp’s construction tries to remove the shortcut. Its 335 human-authored questions depend on recent long-tail facts from sources including GDELT/news, TMDB, RAWG, CVE/NVD, sports match data, and USGS earthquake data, while filtering away globally salient events. That makes the closed-book route much harder. It also makes the benchmark closer to the actual value proposition of search agents: find information the model probably does not already know.

Tool use is not grounding unless evidence controls the answer

The main LiveBrowseComp scores show the reset. GPT 5.4 reaches 43.2, Seed 2.0 reaches 41.5, Claude Sonnet 4.6 reaches 41.4, Gemini 3.1 Pro reaches 40.0, DeepSeek V4 Pro reaches 38.3, and MiniMax M2.5 reaches 28.0. These are far below corresponding BrowseComp scores in the 51 to 77 range. That is not necessarily because the models got dumber. The benchmark removed a crutch.

The shared scaffold is also worth noting: `search(query)` through serper.dev, `visit(url, goal)` through Jina retrieval, a sandboxed Python interpreter, a 256k max context, and a 250-step iteration budget. In other words, the models had room to work. This is not a tiny-tool setting designed to punish browsing. It is an evaluation of whether the browsing actually changes what the agent can know.

For engineering teams, the action item is not “use LiveBrowseComp and call it a day.” It is to add evidence-dependence tests to your own evals. Always run a closed-book baseline. Remove or perturb supporting documents. Insert hard negatives. Track whether search queries are generated from retrieved leads or from internally invented hypotheses. Require answer spans or structured evidence links when possible. Measure whether the final answer changes appropriately when evidence changes. If the system gives the same answer after you remove the supporting evidence, your tool trace may be decorative.

Cost governance belongs in the same conversation. Long traces are not proof of rigor. A search agent can burn tokens, visit pages, compress context, and still behave like a model laundering its own memory. The useful metric is not pages visited or number of searches. It is evidence acquisition: did a source introduce a fact that constrained, corrected, or completed the final answer? If not, the trace is just a very expensive confidence costume.

This matters for coding agents too. Many development tasks now include documentation lookup, issue search, package investigation, vulnerability research, or API migration work. A coding agent that “searches” by confirming a remembered API shape can silently use stale knowledge. For teams evaluating AI coding assistants, LiveBrowseComp’s lesson transfers cleanly: freshness, source attribution, and negative-evidence handling should be part of the harness, not an afterthought.

There are limits. LiveBrowseComp is still a benchmark, and benchmark construction choices always shape behavior. Recent long-tail facts are a strong antidote to memorization, but they do not cover every research workflow. The arXiv paper is fresh and public discourse was minimal at research time, so independent reproduction matters. Still, the conceptual fix is correct. Search agents should be judged by discovery, not by whether they can attach citations to things they already believed.

The editorial line is simple: tool use is not grounding unless retrieved evidence controls the answer. If the model cannot tell “I found this” from “I remembered this,” it is not a search agent. It is a language model with browser choreography.

Sources: arXiv, arXiv HTML, Hugging Face dataset