ai-models

Gemini for Science Shows Agent Systems Need Evaluation Loops, Not Lab Coats

Anatoliy Kolodkin

21 May 2026 • 5 min read

Google’s new Gemini for Science announcement is easy to misread as another “AI will accelerate discovery” victory lap. The more useful read is narrower and more interesting: Google is turning scientific work into agent workflows with explicit search, critique, ranking, tool access, and human validation. That is a much better product thesis than asking a chatbot to cosplay as a principal investigator.

The company introduced Gemini for Science at I/O 2026 as a set of Google Labs experiments for researchers: Hypothesis Generation, Computational Discovery, and Literature Insights. The launch sits on top of two Nature papers published May 19: one describing Co-Scientist, a Gemini-based multi-agent system for hypothesis generation, and one describing ERA, an empirical research assistant that uses a large language model plus tree search to write and optimize scientific software. Google is also adding Science Skills, a specialized bundle for agentic platforms like Antigravity that connects to more than 30 life-science databases and tools, including UniProt, AlphaFold Database, AlphaGenome API, and InterPro.

That list sounds like a product manager swallowed a conference agenda. Underneath it is a genuinely important pattern for builders: the frontier model is not the product. The workflow scaffold is.

The agent is useful because it has a review loop

Co-Scientist is the cleanest example. Google describes it as a coalition of Gemini-based agents: a generation agent proposes hypotheses, a proximity agent maps the search space, a reflection agent acts as a virtual peer reviewer, a ranking agent runs an “idea tournament,” an evolution agent refines the best candidates, and a meta-review agent synthesizes the debate into something a scientist can inspect. A supervisor agent plans and coordinates the work in parallel.

The Nature abstract is careful about what is being claimed. Co-Scientist generates and refines hypotheses for experimental verification; it does not magically convert language into truth. Google’s validation examples focus on biomedical applications such as drug repurposing, novel target discovery, and antimicrobial resistance. The paper says Co-Scientist helped identify drug-repurposing candidates and synergistic combination therapies for acute myeloid leukemia that were validated through in vitro experiments. Google DeepMind also says a Stanford liver-fibrosis collaboration found a candidate that blocked 91% of a scarring-linked response in lab tests.

The important engineering lesson is not “agents can do biology.” It is that Google built a system that separates proposal from critique, ranking from generation, and final judgment from automated output. That architecture is portable. Legal research, security triage, architecture review, data-science automation, incident analysis, and procurement diligence all have the same failure mode: one fluent answer is less useful than a structured process that proposes options, attacks them, ranks them, cites evidence, and leaves a human with the decision.

If you are building domain agents, steal the loop before you steal the branding. Define what a candidate is. Define who critiques it. Define what evidence is admissible. Define how candidates are ranked. Define where the human sees the disagreement, not just the winner. A multi-agent system without those controls is usually just a more expensive way to generate confident mush.

ERA is the more engineer-native story

ERA may be less glamorous than a “co-scientist,” but it is closer to how serious engineering teams already work. The system takes a scientific problem and a quality metric, searches literature, writes code, explores variations, combines techniques, evaluates results, and uses tree search to navigate thousands of possible solutions. Google’s Research post describes it bluntly: ERA addresses one of the most time-consuming parts of computational research, the slow loop of writing and refining software for experiments.

The Nature abstract gives the useful numbers. In bioinformatics, ERA discovered 40 novel methods for single-cell data analysis that outperformed top human-developed methods on a public leaderboard. In epidemiology, it generated 14 models that outperformed the CDC ensemble and all other individual models for forecasting COVID-19 hospitalizations. Google Research says the system has now been applied across eight manuscripts, including respiratory-virus forecasting, California snow-fed runoff prediction, CO2 mapping from geostationary satellite data, 3D solar-energy maximization, and retail forecasting.

This is where the AI-models story gets practical. One-shot prompting is a poor fit for tasks with measurable objectives and large solution spaces. Test-time compute, search, executable evaluation, and mutation are a much better fit. That is not new to people who have shipped optimizers, AutoML systems, fuzzers, compilers, or reinforcement-learning loops. What is changing is that LLMs make the search space more semantic: the system can read papers, write code, compose methods, and test variants without every move being hand-specified.

For practitioners, the action item is boring and valuable: when a workflow has a metric, do not stop at “ask the model for the answer.” Build the loop. Generate candidates, execute them, score them, mutate them, log the path, and keep the evaluator outside the model’s vibes. If the output can be tested, make the test harness the center of the product. The model should be the worker, not the judge and jury.

Domain tools are the moat, and also the liability surface

Google’s Science Skills bundle points at the next layer of competition. The system integrates data and tools from more than 30 major life-science resources, and Google says researchers can use those skills inside agentic platforms such as Antigravity for workflows like structural bioinformatics and genomic analysis. This is where agent platforms move from demo to infrastructure: they need authenticated tools, trusted databases, reproducible environments, citations, audit logs, and guardrails around dangerous or dual-use work.

That is also where teams should be skeptical. A model with access to domain tools is not automatically grounded; it is merely better connected. Grounding requires provenance, versioning, conflict handling, and explicit uncertainty. If a biological database changes, if a paper is retracted, if a tool returns a warning, or if two sources disagree, the agent needs to surface that mess instead of flattening it into a confident paragraph. Scientific software has spent decades learning that reproducibility is a systems property. Agents do not get an exemption because the demo has citations.

Google appears aware of the risk. The company says Co-Scientist underwent internal and external safety evaluations, including CBRN misuse evaluations, and that it developed custom safety classifiers to flag unethical research goals and unsafe information. That is necessary, not sufficient. The more capable these systems become at literature search, target discovery, code generation, and lab-adjacent planning, the more they need policy controls that are inspectable by institutions, not just trust in a vendor’s launch post.

The near-term version of this technology is not “AI replaces scientists.” It is more like “AI turns parts of scientific work into reviewable queues.” That is still a big deal. Literature review, hypothesis search, computational experiment generation, and method optimization are expensive bottlenecks. If agents can compress those cycles while keeping citations, code, intermediate failures, and human approvals visible, research teams get leverage without pretending the model discovered truth by itself.

For software teams outside science, Gemini for Science is worth watching because it shows the shape of serious agent products. The winners will not be the ones with the most anthropomorphic assistant name. They will be the ones that encode a domain’s real workflow: proposal, critique, ranking, execution, measurement, audit, and approval. Google’s strongest claim here is not that Gemini can wear a lab coat. It is that agents become useful when they are forced to behave less like chatbots and more like disciplined research systems.

That is the take. Models matter, but in expert domains the model is only one component. The durable advantage is the loop around it: the tools it can call, the evidence it can cite, the experiments it can run, the failures it records, and the human review it cannot skip. Ship the loop. Keep the lab coat optional.

Sources: Google, Nature Co-Scientist paper, Nature ERA paper, Google Research, Google DeepMind

The agent is useful because it has a review loop

ERA is the more engineer-native story

Domain tools are the moat, and also the liability surface

Sign up for more like this.