VitaBench 2.0 Finds the Missing Capability in Personal Agents: Remembering Without Making a Mess
The personal-agent industry keeps treating memory like a feature toggle. Add embeddings, store a few preferences, retrieve the “relevant” chunks, and suddenly the assistant is supposed to know you. VitaBench 2.0 is useful because it says the quiet part out loud: storing user context is the easy half. Using the right context, updating it safely, and staying consistent over time is where current systems start to wobble.
The benchmark, released by Meituan LongCat researchers, targets long-term personalized and proactive agents across multi-session interactions. It evaluates whether models can extract preferences, use them correctly, update them when the user changes their mind, and recognize when they need to ask for missing information before acting. That is a better proxy for “personal assistant” than another single-turn reasoning score because real assistance is mostly state management under uncertainty.
Full context is the expensive upper bound, not the product plan
VitaBench 2.0 compares three memory settings: Full Context, Agentic Memory, and RAG Memory. Full Context is the brute-force ceiling: put everything in the prompt and see what the model can do. Agentic Memory is closer to the product dream, where the agent decides what to remember and retrieve. RAG Memory is the familiar engineering baseline: embed, retrieve, append, hope.
The results are not flattering. Even under the most generous Full Context setting, the strongest thinking model reported, Claude Opus 4.6, reaches only 0.503 Avg@4. DeepSeek-V4-Pro lands at 0.472, Doubao-Seed-2.0-pro at 0.474, and GPT-5 at 0.441. Non-thinking model scores are lower: DeepSeek-V4-Pro at 0.456, Doubao-Seed-2.0-pro at 0.428, GLM-5.1 at 0.420.
The sharper result is what happens when memory systems enter the loop. Claude Opus 4.6 drops from 0.503 Full Context to 0.454 with Agentic Memory and 0.430 with RAG Memory. GPT-5 drops from 0.441 to 0.421 and 0.410. Doubao-Seed-2.0-pro falls from 0.474 to 0.428 and then to 0.339 under RAG Memory. In other words, the memory layer often makes the agent worse.
That should be uncomfortable for anyone building “personalization” by wiring a vector database to a chat model. Retrieval can add stale context. It can surface weak signals as if they were durable preferences. It can bury the decisive fact under semantically similar noise. It can also fail to retrieve the relevant memory because the user’s current task is phrased differently from the historical event that matters. Memory is not just storage; it is an inference, policy, and product-design problem.
Personalization is a governance surface
The most useful framing in VitaBench 2.0 is that personalization is not a vibe. It has sub-capabilities. Preference extraction asks whether the model can infer what the user likes or needs from fragmented interactions. Preference utilization asks whether it can apply that preference in a later task. Preference updating asks whether it can revise the old belief when the user changes behavior or explicitly says something new. Proactiveness asks whether the agent knows when acting would be premature because it lacks information.
Those capabilities map directly onto product risk. If an agent remembers too little, it becomes a normal chatbot with a longer bill. If it remembers too much, it becomes creepy or legally awkward. If it remembers the wrong thing, it makes confident mistakes that feel personal because they are. If it updates preferences too aggressively, a one-off exception becomes a default. If it refuses to update, the assistant becomes haunted by stale context.
This is why memory belongs in the same governance conversation as tool permissions and audit logs. Long-term personalization means retaining user information, deciding how long it lives, exposing it for correction, and defining what can be inferred without explicit consent. A memory layer that silently persists sensitive preferences is a privacy surface. A memory layer that cannot explain why it retrieved something is an observability failure. A memory layer that cannot forget is a product liability with embeddings.
VitaBench 2.0’s metrics also deserve attention. Avg@4 measures mean success across four rollouts. Pass@4 measures whether the task was solved at least once. Pass^4 measures whether it was solved all four times. That distinction is excellent. Personal agents cannot be “sometimes right” on tasks that act on behalf of users. A system that succeeds once in four attempts is not reliable; it is a slot machine with a nicer onboarding flow.
The runnable artifact still has a caveat
The release is early. The GitHub repository was created on May 26 and pushed on May 27, but the README says datasets, evaluation scripts, and code are “released soon.” During research it had 4 stars and 1 fork, and Hacker News had no exact hits for “VitaBench 2.0.” So this is not yet a fully reproducible benchmark package that teams can drop into CI today.
That caveat matters. Leaderboard numbers should be independently reproduced once the evaluation scripts and datasets land. Model names and scores in new benchmark papers are useful signals, not final law. But the benchmark’s problem statement is exactly right. The industry has been over-indexing on context-window size and under-indexing on memory correctness. Bigger prompts can raise the ceiling, but they do not solve preference lifecycle, retrieval precision, contradiction handling, or user control.
For practitioners, the action item is clear: evaluate memory as behavior, not architecture. Build tests where a user preference is implied, later contradicted, conditionally applied, and sometimes irrelevant. Track consistency across repeated runs. Measure false personalization — cases where the agent applies a memory that should not have been used. Log what was retrieved, why, and how it influenced the answer. Give users an inspectable and editable memory surface before you let the agent act on their behalf.
Also, stop treating RAG as a synonym for memory. Retrieval is one mechanism. Personal memory needs schemas, timestamps, confidence, consent, scope, decay, conflict resolution, and deletion. If that sounds less magical than a demo, good. The non-magical parts are where reliable products live.
The LGTM take: VitaBench 2.0 is a reality check for the “memory makes agents personal” crowd. The hard part is not remembering. The hard part is remembering without making a mess.
Sources: arXiv, VitaBench 2.0 GitHub, Hugging Face Papers, prior VitaBench repository