AG2 0.13.2 Makes Agent Evaluation a First-Class Runtime Concern, Not a Notebook Ritual
Agent evals are finally moving out of the notebook and into the runtime. That is the real story in AG2 0.13.2, a release whose version number undersells the point: the AutoGen successor is starting to treat quality, cost, traces, provider compatibility, and tool safety as one production problem instead of five separate chores.
The headline feature is the beta evaluation framework under autogen.beta.eval. It does the thing agent teams eventually need and too few teams build early: run an agent over a dataset, capture what actually happened, score the recorded trace, and compare results over time. That trace-first framing matters. Agents do not just answer questions; they choose tools, retry, fetch context, spend tokens, emit intermediate messages, and sometimes succeed for reasons you would never allow in production. A final answer can look fine while the run behind it is a compliance, cost, or reliability incident waiting for traffic.
AG2’s docs draw the right boundary between tests and evals. Tests assert correctness. Evals measure behavior across properties: did the agent call the expected tool, did the final answer mention the required thing, did the run stay under budget, did a release regress a task family that used to pass? That is not academic terminology. It is the difference between “the weather agent returned Boston weather once” and “the weather agent reliably calls get_weather, stays within budget, and does not start improvising another API call after a model update.”
The trace is the unit of quality now
The most useful design choice is that AG2 grades traces, not merely transcripts. The quick-start example uses a two-task weather dataset, a custom @scorer, and a prebuilt tool_called("get_weather") scorer. The sample output includes token counts — input=423, output=78, total=501 — because cost is no longer a billing afterthought. In agent systems, cost is behavior. A run that reaches the correct answer by wandering through ten unnecessary tool calls has failed an engineering property even if the final prose is correct.
The API surface reinforces that. run_agent() writes a JSON run artifact and returns RunResult helpers for summary(), pass_rate(), score_stats(), value_counts(), tag slicing, aggregates, per-task traces, budget status, and diff(load_run(...)) with a .regressions view. That last piece is where this becomes operationally interesting. Agent teams need to compare behavior across releases the way backend teams compare latency, error rates, and test failures. “It feels better” is not a release criterion. “Pass rate improved, no high-risk task regressed, and token spend stayed under threshold” is much closer.
AG2 also supports deterministic CI through TestConfig cassettes, so pre-merge checks do not require live model calls every time. That is an important concession to reality. Live LLM evals are expensive, slow, and noisy; deterministic replay is how you make quality gates part of normal engineering instead of a ritual performed before a big launch. The right workflow is not to replace live evals entirely, but to layer them: deterministic cassettes for fast regression checks, scheduled live evals for provider drift, and production trace grading for what users actually do.
Interop is the quiet power move
The framework can grade existing traces through evaluate_traces, including OpenTelemetry GenAI semantic-convention spans and OpenInference spans. That is the feature platform teams should notice. Framework-native evals are useful, but production estates are messy: one team uses AG2, another uses LangGraph, another has a homegrown orchestration layer wrapped around OpenTelemetry, and everyone eventually wants a single answer to “did this agent get worse?”
If the same scorers can evaluate AG2 traces and captured traces from elsewhere, evals stop being a local developer convenience and become a governance layer. You can imagine a release pipeline where a candidate agent version is checked against curated datasets, then a nightly job samples production traces by task tag, grades tool-call correctness and budget behavior, and flags regressions before the help desk becomes your monitoring system. That is the grown-up version of agent evaluation: not leaderboard theater, but behavioral accounting.
There is a practical warning here for teams building their own eval stack. Do not build only final-answer graders. Add scorers for tool choice, call count, retry count, structured-output validity, refusal behavior, token budget, elapsed time, and policy-sensitive actions. If your agent can write files, open tickets, send messages, query databases, or call MCP tools, you need to grade the path, not just the destination. The trace is where the blast radius lives.
The security fixes are part of the same maturity curve
AG2 0.13.2 also ships security hardening that should not be treated as a footnote. The release tells all users on prior versions to upgrade for a ContextExpression code-injection fix, tracked as GHSA-9fvw-gr53-m7fw. PR #2891 escaped string values before eval. PR #2689 blocked shell operators in readonly/allowed-command mode. Those are exactly the classes of defects that show up when agent frameworks get close to execution paths: expression languages, shell-ish tools, and “readonly” modes that turn out to be more aspirational than enforceable.
The practitioner move is simple: upgrade, then audit anything that relied on permissive behavior. If an agent routes user-controlled or model-controlled content into expression evaluation, assume a hostile input will eventually find it. If a shell tool claims to be readonly or allowlisted, test operators, redirection, pipes, command substitution, and chained commands. Do not trust the label. Trust the enforcement.
The provider additions — V2 Anthropic and Bedrock clients — and TinyFish search/fetch tools are also part of the runtime story. Multi-provider support only helps if behavior remains measurable across providers. Search and fetch tools only help if their use can be traced and scored. LlamaIndex 0.13 workflow-agent support only helps if composition does not make debugging impossible. AG2 appears to be moving toward a stack where extensibility and evaluation advance together, which is the right order. Adding more ways for agents to act without adding more ways to grade and constrain them is how demos become incident reports.
There is one caveat: this is still beta, and AG2 is in the middle of a larger transition. The roadmap says the current autogen.agentchat line is headed for maintenance mode, autogen.beta is intended to become the official v1.0 API, v0.13 is the transition period, and v0.14 is planned as the final current-line release before the original codebase moves to an ag2-original branch. That is a lot of motion. Adopt the eval framework where it pays off immediately, but do not spray beta APIs across a large estate without a migration plan.
The useful path is targeted: put AG2 evals around high-value agents, provider-comparison runs, tool-call correctness checks, and budget gates. Start with a small curated dataset per workflow, tag tasks by risk and intent, and make regressions visible in CI. Then add trace grading for real runs. If a model upgrade improves answer quality but doubles tokens or starts calling the wrong tool 8% of the time, your release process should know before your cloud bill or your users do.
AG2 0.13.2 is not flashy in the way agent launches usually try to be flashy. Good. The useful agent-framework work now looks like traces, budgets, deterministic CI, provider clients, and security hardening landing in the same release. That is what maturity looks like: less “watch these agents talk to each other,” more “prove this agent still behaves when the runtime changes.” LGTM, with the usual beta-api caution sticker still attached.
Sources: AG2 release notes, AG2 evaluation docs, AG2 roadmap, PR #2797, PR #2905, PR #2891, PR #2689