Pydantic AI 1.97 Makes Failure Evals and MCP Boundaries the Pre-v2 Story

Pydantic AI 1.97 Makes Failure Evals and MCP Boundaries the Pre-v2 Story

Pydantic AI v1.97.0 is not the kind of release that gets a keynote slide. Good. The agent framework market has been drowning in demos that make toy agents look inevitable and production agents look easier than they are. This release points in the opposite direction: fewer ambiguous surfaces, more explicit failure handling, and a cleaner MCP boundary before the v2 migration window opens.

The release, published May 15 at 22:15 UTC, adds OnlineEvaluator.run_on_errors, introduces a new MCPToolset built on fastmcp-slim[client], promotes the builder-style pydantic_graph API out of beta, normalizes Google provider IDs, and marks several older paths for deprecation. That sounds like housekeeping. In practice, it is the work that separates a framework people demo from one teams can operate.

The useful feature is scoring the failures, not celebrating the happy path

The most important change is probably the least flashy: OnlineEvaluator.run_on_errors. When enabled, evaluators can run after an agent call raises an exception, with the exception passed through as EvaluatorContext.output. The exception still propagates after evaluator dispatch, which is exactly the right design. Evaluation should observe the failure path, not accidentally turn it into a success path.

That matters because production agent systems do not mostly fail as clean “wrong answer” events. They fail through tool exceptions, timeout paths, malformed provider responses, schema mismatches, partial streams, bad retries, and permission edges that were not in the demo script. Most eval setups over-index on successful runs because those are easy to score. You get a response, compare it against a rubric, maybe let an LLM judge it, and log a green-ish number. The failures become separate logging plumbing, if they get captured at all.

Pydantic AI’s move is small but correct: make the failure itself evaluable. Teams should use this to build classifiers for their own agent failure modes: transient provider error, unsafe tool request, user input outside policy, schema drift, bad retrieval, missing credential, and model refusal. The point is not to hide exceptions under analytics. The point is to turn yesterday’s broken run into tomorrow’s regression case.

The PR details are also a useful signal. The change shipped with tests across decorator and agent-capability paths, ran the full tests/evals/ suite with 450 tests, doc examples with 26 tests, lint, and pyright. Framework quality is not just API design; it is whether maintainers treat the test plan as part of the product. Here, they did.

MCP is a trust boundary, not a convenience wrapper

The new pydantic_ai.mcp.MCPToolset is the strategic part of the release. It accepts a URL, script path, FastMCP server, MCPConfig dictionary, ClientTransport, or pre-built FastMCP client. More importantly, it exposes the knobs teams actually need: auth, progress and message handlers, roots, TLS verification, timeout controls, retries, tool error behavior, tool and resource caching, process_tool_call, return schema inclusion, and sampling_model shortcuts.

That is the right abstraction level. MCP is often presented as “USB-C for AI tools,” which is useful marketing and incomplete engineering. In production, an MCP client is a security and reliability boundary. It decides which tools are visible, what schemas are exposed, how errors behave, whether results are cached, how auth is passed, and whether the caller can verify the server it is talking to. Those are not transport details. They are policy details.

The release deprecates legacy MCPServer* and FastMCPToolset paths while keeping them working through the 1.x line. The PR calls this “Phase A” of the v2 MCP overhaul, which is the right migration posture: introduce the recommended surface early, warn clearly, avoid breaking production users before they have somewhere better to land. Frameworks get deprecations wrong all the time by either moving too slowly or burning users with abrupt cuts. This is closer to how it should be done.

Practitioners should not treat this as optional cleanup. If your Pydantic AI stack uses MCP, start migrating to MCPToolset in staging now. Pay attention to verification defaults, cache behavior, retries, and tool error semantics. Add logging around process_tool_call if you need auditability. The cost of doing this before v2 is measured in a few integration tests. The cost of waiting is discovering your tool boundary during an incident.

Graph APIs are becoming application code

The promotion of the pydantic_graph.beta builder API to top-level exports is another useful signal. GraphBuilder, StepNode, StepContext, Fork, Decision, Join, reducers, and TypeExpression are no longer positioned as experimental side streets. They are part of the public surface.

This is where Pydantic AI’s lane becomes clearer. LangGraph has been making the strongest argument for explicit stateful agent graphs. CrewAI optimizes for role-and-task composition. Pydantic AI is trying to make typed Python-native agents, structured outputs, eval hooks, and graph workflows feel like normal application code rather than a separate orchestration religion. That is a defensible niche, especially for teams already bought into Pydantic as a boundary layer.

The warning behavior matters too. Beta imports now emit a visible PydanticGraphDeprecationWarning. That sounds minor until you have maintained a Python estate where important deprecations disappear under default warning filters. Migration debt usually accumulates quietly. Visible warnings are annoying by design; they make the future breakage someone’s current problem.

The provider ID normalization is in the same category. google-gla: becomes google:, and google-vertex: becomes google-cloud:. Nobody wants to spend a sprint on provider naming. But naming is API surface, and provider identifiers tend to leak into config files, traces, dashboards, allowlists, and cost attribution. Normalizing them before v2 is tedious, useful work.

There is also a governance smell around the deprecation of Agent.to_a2a() and bundled fasta2a integration after fasta2a was adopted by DataLayer. Interop layers are valuable, but framework maintainers should be careful about owning adapters that move faster than the core. Shipping fewer bundled integrations can be a sign of maturity, not neglect, if the replacement path is clear.

For teams already on Pydantic AI, the action list is short: test v1.97 in staging, replace legacy MCP classes with MCPToolset, add at least one evaluator that runs on exceptions, and turn deprecation warnings into tracked migration work. For teams still choosing an agent framework, this release sharpens the Pydantic AI pitch: pick it when typed boundaries, explicit eval hooks, and Python-maintainable APIs matter more than demo-friendly orchestration metaphors.

The broader lesson is that the agent framework fight is moving away from “who can build the cutest multi-agent example” and toward “who can make failure, tools, state, and migration boring.” Pydantic AI v1.97 is boring in the right places. That is a compliment.

Sources: Pydantic AI v1.97.0 release, MCPToolset PR, error evaluator PR, pydantic_graph API PR