VibeSearchBench Shows Why Deep Research Still Misses What Users Actually Want
Most “deep research” benchmarks quietly assume the user has already done the hardest part: specifying exactly what they want. That is convenient for leaderboard construction and wildly unlike real work. Real users start with mush. They ask for a good vendor, a sensible travel plan, a market landscape, a replacement laptop, a policy comparison, or “the thing I should know before making this decision.” Then they refine as they learn. The research process is not one prompt; it is negotiation.
VibeSearchBench is valuable because it points the benchmark at that mess. Instead of evaluating agents on fully specified one-shot prompts, it tests long-horizon proactive search where user intent is progressively revealed through multi-turn dialogue. The benchmark contains 200 manually curated bilingual tasks across 20 domains, split between VibeSearch-Pro and VibeSearch-Daily. Its ground truth is not a neat fixed schema but schema-free knowledge graphs averaging 212.43 nodes, 298.32 triples, and 139.70 source URLs per task. That is the right shape for research work: entities, relationships, evidence, and a user who does not hand you the ontology upfront.
The headline result is not flattering. Seven frontier models all perform poorly. The best reported setup, Claude Opus 4.6 under OpenClaw, reaches only 30.30 average F1. DeepSeek-V4-Pro under OpenClaw reaches 27.03, Kimi K2.6 about 26.17, Gemini 3.1 Pro 23.62, and GPT-5.4 21.92. If your product page says “deep research” and your eval looks nothing like progressive intent discovery, this is the part where the room gets quiet.
Vague intent is not a UX edge case
The word “vibe” is annoying because it is accurate. Users often know the feel of the answer before they know the structure of the question. They can reject a candidate result, clarify a constraint, add a preference, or realize that their original target was wrong. A strong research agent should be able to ask useful follow-ups, search in stages, preserve what it learned, and converge on a structured answer without forcing the user to become a prompt engineer.
That is where many agent evals under-test the system. Fully specified prompts reward models that follow instructions and retrieve facts. Real research requires product sense: when to ask, when to search, when to summarize partial findings, when to challenge a premise, and when to stop. VibeSearchBench’s progressive-disclosure user simulator is an imperfect but necessary move toward evaluating that loop.
The graph-based ground truth is also a smart compromise. Free-text grading is too mushy. Fixed schemas are too artificial. Knowledge graphs let the benchmark ask whether the agent recovered relevant entities and relationships without pretending every domain shares the same output format. VibeSearch-Pro is especially heavy, averaging 373.56 triples and 158.29 source URLs; VibeSearch-Daily averages 223.07 triples and 121.11 source URLs. That is not a quick answer. It is an information-gathering workflow.
The paper reports a human-agreement check where three judge models reach above 98.5% agreement with human experts on sampled trajectories, with Kimi K2.6 at 98.92%. LLM-as-judge remains a caveat — always — but high sampled agreement is at least evidence that the evaluation is not pure vibes grading vibes. The more important question is whether the benchmark pressure encourages agents to behave more like collaborators and less like verbose answer machines.
Verbosity is an agent-cost bug, not a personality trait
The model behavior analysis is where practitioners should pay attention. Claude Opus 4.6 shows the highest assistant/user work ratio in ReAct at 8.26 and also the best F1. Gemini 3.1 Pro has the lowest ratio at 2.84 and limited coverage. That suggests a real tradeoff: agents that do more work can recover more of the graph, but only if that work is directed. Being concise can become shallow; being verbose can become waste.
GPT-5.4 is the cautionary example in the brief. Under OpenClaw it hits 19.9 user turns and 1.27 context compressions, creating a loop of verbose output, context overflow, and redundant re-search. That failure mode should feel familiar to anyone running long-horizon agents. The system talks too much, compresses away useful state, searches again, repeats itself, and spends tokens rebuilding context it already had. This is not merely annoying UX. It is degraded reasoning caused by context-budget mismanagement.
Teams building research agents should instrument interaction quality, not just final answer quality. Track follow-up-question usefulness. Track user-turn efficiency. Track repeated searches. Track context compression events and whether they correlate with lost constraints. Track whether retrieved facts survive into the final structure. Track the ratio of exploration to synthesis. If an agent’s research process cannot be inspected as a process, you will only find failures after the final answer disappoints a user.
There is a direct coding-agent parallel. “Vibe debugging” often starts with a developer describing symptoms imprecisely: the test is flaky, deploy got weird, performance feels worse, the generated code “seems off.” A useful agent must elicit constraints, inspect evidence, form hypotheses, and avoid drowning the context window in logs. VibeSearchBench is about web search, but the operating lesson applies to debugging, code review, incident triage, and migration planning: long-horizon assistance is a dialogue management problem plus an evidence management problem. The model alone is not the product.
The benchmark’s limitations are real. Persona simulators can overfit to evaluator assumptions. Graph matching depends on judge behavior. Bilingual tasks are valuable but still a sample of the world, not the world. And a 30.30 F1 best score does not tell you exactly how a user would rate the experience. But it tells us enough: current frontier agents are still weak at converting vague evolving intent into complete, structured knowledge without wasting context.
The practical move is to stop evaluating deep-research products only on over-specified prompts. Add progressive disclosure. Force agents to ask clarifying questions. Measure context waste. Grade partial knowledge structures, not just polished prose. If your agent can answer a perfect prompt but cannot help shape a vague one, it is not a researcher. It is a report generator waiting for a better PM.
Sources: arXiv, arXiv HTML, VibeSearchBench project page