agentic-coding

Google agents-cli 0.3.0 Makes Evals the Default, Not the Afterthought

Anatoliy Kolodkin

02 Jun 2026 • 5 min read

Google’s agents-cli v0.3.0 is the kind of release most teams will underestimate because it does not promise a smarter chatbot. It changes eval plumbing. It migrates datasets. It adds commands with names like eval grade, eval submit, and eval analyze. Not exactly conference-demo material.

Good. This is the part of agentic software that decides whether anything survives contact with production.

The June 1 release rebuilds Google’s agent evaluation workflow around Vertex AI / Gemini Enterprise Agent Platform EvaluationDataset objects instead of the older ADK EvalSet format. Existing tests/eval/evalsets/*.evalset.json files are no longer read by agents-cli eval generate and related commands. The migration guide points users toward converted files under tests/eval/datasets/, while scaffold upgrade detects legacy evalsets, prints a notice, and avoids overwriting existing destinations.

The annoying migration is the story. Google is pushing agent developers away from local demo fixtures and toward platform-grade evaluation artifacts: datasets, traces, grading, cloud-side submissions, results retrieval, failure-mode analysis, built-in metric discovery, and a rewritten eval skill organized around a “Quality Flywheel” workflow: dataset, generate, grade, analyze, optimize.

The agent is not done when it answers the demo prompt

Most agent projects fail in the gap between “it worked once” and “it behaves reliably under variation.” A coding assistant can generate a plausible workflow from a hand-picked prompt. A support agent can answer the example question in the founder’s deck. A multi-agent system can pass a happy-path scripted run. None of that tells you what happens when the user asks a partial question, changes their mind on turn three, uploads a messy file, hits a permission boundary, or triggers a tool failure halfway through the task.

agents-cli v0.3.0 names the missing loop. eval dataset synthesize can generate user-simulation datasets with an LLM. eval generate runs agent inference over an EvaluationDataset and emits traces. eval grade scores those traces against built-in or custom metrics. eval submit sends an end-to-end cloud-side evaluation run to Vertex AI Eval Service. eval results fetches completed results. eval analyze looks for failure modes. eval metric list discovers built-in metrics.

That is not just CLI surface area. It is a product philosophy: evaluation should be part of the agent-building workflow, not a QA ritual bolted on after deployment. For coding teams, this matters because coding agents are increasingly being used to build other agents. If the tool that scaffolds your agent also scaffolds the evaluation loop, you are less likely to ship a charming unreproducible demo with no regression suite. That is progress.

Schema migrations are irritating because contracts matter

The move from ADK EvalSet to Agent Platform EvaluationDataset will frustrate early adopters. Nobody likes being told their existing files are no longer read. But the direction is defensible. Local eval formats are convenient until they trap you in local-only workflows. A platform schema can represent the cases, turns, agent topology, tool traces, grading inputs, and result metadata needed for repeatable evaluation across local and cloud execution.

The migration guide’s continued-conversation support is especially important. The new eval shape can represent single-prompt cases and “N+1” conversation cases using agent_data.turns and an agent_data.agents topology map. That is not decoration. Multi-agent systems fail through attribution. The planner chooses the wrong subtask. The retriever misses context. The executor calls the wrong tool. The reviewer approves a bad patch. If your trace flattens that into one assistant voice, you cannot debug the system; you can only complain about the model.

Serious eval data needs to preserve who did what, when, with which tools, and under which topology. Otherwise failure analysis becomes folklore. One engineer says the prompt is weak. Another says the tool schema is bad. A third says the model needs more context. Everyone is guessing because the trace is not structured enough to prove anything. Google’s schema direction is a bet that agent evaluation needs to model the system, not just the final answer.

Synthetic evals are a bootstrap, not a benchmark

The new dataset synthesis command is useful and easy to abuse. LLM-driven user simulation can quickly create coverage: common tasks, edge cases, follow-up turns, malformed requests, and adversarial-ish inputs. That is valuable when a team is starting from zero. Blank eval suites are where quality goes to die.

But synthetic datasets can also encode the model’s idea of user behavior rather than actual user behavior. They may overrepresent tidy prompts, underrepresent organizational mess, and miss the failure modes that cost real money: ambiguous permissions, stale tickets, weird internal acronyms, duplicate records, partial outages, and users who ask for the wrong thing confidently. A synthetic eval suite that passes beautifully can become a mirror the model holds up to itself.

The right workflow is hybrid. Use synthesis to bootstrap breadth. Seed the suite with production traces once you have them. Hand-curate high-priority regressions. Add cases for every incident, escalation, and embarrassing demo failure. Separate smoke tests from deep evals. Track model, prompt, tool-schema, and policy changes alongside eval results. And keep a small, stable “never regress this” set that reflects failures your team has actually suffered.

This applies even if you never touch Vertex AI. The practical pattern is portable: write datasets before granting autonomy, generate traces for each run, grade against explicit metrics, inspect failures before optimizing prompts, and compare regressions over time. If your agent cannot be evaluated, it cannot be responsibly delegated. “Experimental” is not a magic word that makes missing tests okay.

Google is selling the control plane underneath everyone’s coding assistant

The cross-tool positioning is the other subtle move. The agents-cli docs describe the project as a CLI and skill set that can turn coding assistants into experts at creating, evaluating, and deploying AI agents on Google Cloud. The docs name Gemini CLI, Claude Code, OpenAI Codex, Google Antigravity, and more. That is smart. Google is not pretending every developer will abandon their preferred coding assistant. It is offering an eval/deploy layer those assistants can call.

That is where the agent platform fight is heading. IDE ownership matters, but the durable value may live in the control plane: datasets, evals, traces, metrics, deployment, policy, observability, and failure analysis. A team may use Codex in the terminal, Claude Code in a repo, Gemini in a cloud workflow, and Antigravity for another surface. The platform that makes all those agents measurable and deployable has leverage.

For practitioners, the takeaway is immediate. Do not build an agent and then ask how to test it. Start with the eval loop. Define the tasks the agent is allowed to perform. Capture representative conversations. Decide which metrics matter: correctness, tool accuracy, policy compliance, latency, cost, citation quality, escalation behavior, or patch quality. Add failure-mode analysis before prompt optimization, because optimizing the wrong prompt against the wrong metric is just automated self-deception.

agents-cli v0.3.0 will not get the loudest Hacker News thread. Fine. The loudest agent demos rarely include the regression suite. This release is a reminder that the winning agent stacks will not be the ones with the fanciest first answer. They will be the ones that make failure measurable before deployment.

Sources: Google agents-cli release, agents-cli docs, eval dataset migration guide, Google Cloud agent evaluation docs

The agent is not done when it answers the demo prompt

Schema migrations are irritating because contracts matter

Synthetic evals are a bootstrap, not a benchmark

Google is selling the control plane underneath everyone’s coding assistant

Sign up for more like this.