ai-models

Anthropic’s Biology-Agent Benchmark Says the Model Is Not the Whole System

Anatoliy Kolodkin

08 Jun 2026 • 5 min read

The most useful thing in Anthropic’s new biology-agent paper is not that GPT-5.5 beat Claude Sonnet 4 on a benchmark. That is scoreboard trivia. The useful thing is that both numbers got much less important once the researchers stopped asking models to reverse-engineer a messy scientific website and gave them a deterministic tool boundary instead.

That is the story hiding inside Anthropic’s “Paving the way for agents in biology”, a research post based on the arXiv preprint “Deterministic access to global viral sequence data enables robust agentic scientific discovery.” The team built VirBench, a 120-query benchmark for viral sequence retrieval from NCBI Virus, then tested Claude Sonnet 4, Claude Opus 4.7, Biomni, Edison Analysis, GPT-5.2-pro, and GPT-5.5. Without a dedicated retrieval layer, mean accuracy ranged from 16.9% for Claude Sonnet 4 to 91.3% for GPT-5.5. With gget virus, a deterministic query tool for NCBI Virus-style filtering, every evaluated system crossed 90%, and GPT-5.5 reached 99.7%.

The lazy reading is “newer frontier models are getting better at biology.” True, but incomplete. The sharper reading is that high-stakes agents are infrastructure products wearing a model interface. If the agent has to click through human-oriented workflows, infer undocumented filtering semantics, reconcile IDs, handle pagination, and guess which metadata fields mean what, you have not built an AI scientist. You have hired a very expensive intern to do browser archaeology.

Dataset construction is not a vibes benchmark

VirBench asks agents to retrieve viral sequence datasets from NCBI Virus. The benchmark spans 120 realistic queries, 40 pathogens, multiple taxonomic levels, and combinatorial metadata filters. These are not toy questions like “summarize Ebola.” They are operational queries: find sequences for this virus, from this host, in this geography, collected in this time window, with these completeness and length constraints, excluding samples that would contaminate downstream analysis.

That distinction matters because the output is not prose. The output is a dataset. A plausible-looking answer can still be wrong in a way that breaks phylogenetic inference, diagnostic assay design, vaccine target selection, protein-model training data, or outbreak surveillance. Anthropic’s post gives the example of a Zaire ebolavirus query where a Sonnet 4 agent returned 106 sequences in one run, 15 in another, and 5 in a third, against an expected count of 266. One downstream tree pushed the inferred time to the most recent common ancestor back to 1922. That is not a harmless hallucination; that is a bad data dependency with scientific consequences.

This is why 91.3% accuracy for GPT-5.5 in the unsupported setting should not be treated as “basically solved.” For a chat answer, 91% may be impressive. For dataset construction, it is a warning label. One missing geography, one confused RefSeq/GenBank boundary, or one pagination cutoff can change the downstream conclusion while leaving the agent’s final response looking perfectly competent.

The winning move was to move semantics out of the prompt

The intervention, gget virus, is not glamorous in the way model demos are glamorous. That is exactly the point. It formalizes viral sequence retrieval as a reproducible programmatic workflow: stage retrieval, apply metadata constraints before sequence download, coordinate across REST, Datasets, and E-utilities APIs, fetch structured GenBank records where needed, preserve relevant record information, log the query plan, and return outputs machines and humans can inspect.

The paper reports that this cuts data transfer by more than 98% for representative high-volume queries. That number is not just an optimization footnote. NCBI Virus sits over large, messy biological resources; the paper calls out datasets such as Influenza A with more than 1.5 million records and SARS-CoV-2 with more than 9 million records. “Download everything and filter locally” is not a serious agent strategy at that scale. It is a denial-of-service attack with a lab coat.

With gget virus, accuracy rose to 92.8% for Claude Sonnet 4, 90.0% for Biomni, 93.1% for Edison Analysis, 98.9% for GPT-5.2-pro, 98.3% for Claude Opus 4.7, and 99.7% for GPT-5.5. Stability across repeated runs rose to 0.92–1.00. Runtime and tool-call counts generally fell. The performance gap between models narrowed because the brittle part of the task moved from probabilistic reasoning into deterministic infrastructure.

That should make platform teams sit up. The agent reliability playbook is not “buy the newest model and pray.” It is “turn domain expert behavior into callable, versioned, testable tools.” The LLM should decide when to use the tool, translate user intent into structured parameters, inspect results, and explain implications. It should not be recreating the database’s hidden semantics from scratch on every run.

This applies well beyond biology

Anthropic frames the paper around scientific agents, but the pattern is embarrassingly familiar to anyone deploying enterprise agents. Internal finance systems, security consoles, cloud dashboards, HR tools, CRM exports, legal repositories, ticket queues, and CI systems are full of “expert workflows” that are actually undocumented interface rituals. A senior employee knows which dropdown matters. The API exposes half the state. The CSV export silently changes column names. The dashboard filter and backend filter disagree. Then someone points an LLM at the whole mess and calls it automation.

The result is the same failure mode VirBench exposes: the agent often understands the task well enough to attempt it, but lacks a reliable way to execute and verify it. That is the dangerous middle. A totally incapable agent fails loudly. A mostly capable agent produces answers that look right until they contaminate the next step.

For engineers, the practical instruction is simple: audit agent workflows for places where the model is being used as an undocumented API adapter. If correctness depends on pagination, metadata normalization, ID reconciliation, policy-specific filters, schema conventions, or exact reproducibility, build a deterministic adapter. Return structured outputs. Include provenance. Log query plans. Version the behavior. Add fixtures and regression tests. Make the boring road before asking the model to drive faster.

This also changes model-routing economics. If a deterministic tool pushes all agents above 90% and the best ones near 100%, cheaper models become viable for more of the workload. Expensive frontier models still matter for hypothesis generation, ambiguity resolution, and final scientific judgment. But they should not be burned on re-learning how a database filter works. Good tools compress the model gap, which is good news for anyone paying the inference bill and bad news for vendors trying to sell “bigger brain” as the only reliability strategy.

The safety layer is the interface

The broader claim here is not that gget virus solves biology agents. It solves one narrow but consequential retrieval class. It does not decide which biological question is worth asking, validate an experimental design, or replace domain review. But narrow deterministic tools are exactly how serious automation gets built. Reliability compounds at the seams.

The industry keeps talking about agent safety as if it is mostly about model behavior: refusals, jailbreaks, hidden reasoning, eval scores. Those matter. But in real deployments, safety also lives in the interface between model and world. Can the agent call the right tool? Are the inputs typed? Are the outputs auditable? Is the query reproducible? Does the system make illegal states unrepresentable, or does it ask a language model to remember the difference between “complete genome” and “complete enough” because the prompt said so?

Anthropic’s biology-agent work is valuable because it is an anti-demo. It says the quiet part plainly: high-stakes agents need paved roads. The model is the driver, not the road network. In science, medicine, finance, security, and production engineering, the winning teams will not be the ones that paste the longest instructions into a chat window. They will be the ones that turn local expert knowledge into deterministic infrastructure and let models operate above it.

That is less magical than the usual agent pitch. Good. Magic is hard to test.

Sources: Anthropic Research, arXiv, gget GitHub repository, NCBI Virus

Dataset construction is not a vibes benchmark

The winning move was to move semantics out of the prompt

This applies well beyond biology

The safety layer is the interface

Sign up for more like this.