ai-frameworks

Pydantic AI 1.92 Turns Token Budgets and MCP Shutdowns Into First-Class Reliability Problems

Anatoliy Kolodkin

08 May 2026 • 4 min read

Pydantic AI 1.92.0 is the kind of agent framework release that looks underwhelming if you judge by demo value and very useful if you have ever had to restart a production service that was stuck waiting for a tool transport to die. The release adds Anthropic task budget support, runtime output_retries overrides, cancellation cleanup for streaming responses, an MCP cancel-scope fix, richer run context, and guaranteed eval teardown. Translation: fewer mysteries when long-running agents spend tokens, cancel work, share MCP connections, or shut down badly.

That is exactly where agent frameworks need to mature. The first wave of tooling optimized for expressiveness: define a tool, call a model, wire a loop, ship a demo. The production wave is about budget accounting, cancellation semantics, trace identity, teardown guarantees, and provider-specific invariants that fail before the request leaves your process. Pydantic AI has been unusually consistent about treating those as API design problems rather than documentation footnotes.

Token budgets are control plane, not accounting

The headline feature is Anthropic task budget support, implemented in PR #5140. That PR merged on May 7 and changed 9 files with 627 additions and 300 deletions. It maps typed anthropic_task_budget configuration to Anthropic’s output_config.task_budget, auto-enables the task-budgets-2026-03-13 beta, validates malformed configurations, rejects unsupported models, and rejects contradictory combinations before they turn into opaque provider-side errors.

The contradiction worth watching is task_budget.remaining combined with AnthropicCompaction. Server-side compaction can track budget itself; passing an explicit remaining value alongside it creates inconsistent accounting. Pydantic AI rejects that locally with a clear UserError instead of letting Anthropic return a server-side 400. That is a small thing with a large developer-experience payoff. Frameworks earn their keep when they translate provider footguns into local invariants.

Task budgets sound like bookkeeping until the agent is doing multi-turn reasoning, summarizing context, compacting memory, calling tools, streaming partial results, and trying to complete work across a long session. Then budget becomes part of the control plane. It determines whether an agent continues, compresses, asks for help, drops context, or fails. Leaving that as loose provider-specific JSON is a recipe for inconsistent behavior across environments.

The PR also makes an important limitation explicit: task-budget carryover for client-side compaction is out of scope. If users summarize earlier turns themselves, they must compute and pass remaining budget manually. That is not a flaw; it is honesty. Frameworks should not pretend they can infer accounting state after an application has rewritten the conversation history outside the provider’s budget model.

MCP shutdown is where the demo ends

The other important fix is PR #4514, merged on May 7 after changing 2 files with 521 additions and 75 deletions. It fixes attempted exit cancel scope in different task by running the MCP session in a dedicated task and closes issue #2818. That one sentence is a decent summary of why MCP integration should be treated as distributed systems plumbing, not a fancy function-call adapter.

MCP servers are becoming the tool layer for agents: internal data access, browser automation, ticket systems, docs, databases, file operations, domain-specific tools. They can also be slow, unreachable, deadlocked, shared across concurrent runs, or backed by subprocesses that do not exit cleanly. The PR history here is more informative than the release note. Follow-up commits added regression tests for parallel agent runs sharing one connection, lifespan plus sibling-task handlers sharing one connection, independent span trees, stale-runner state protection, and bounded shutdown when a transport deadlocks.

That is the real production surface. A naive fix for a cancellation bug can break shared singleton behavior. A recycled session can corrupt readiness state. A deadlocked transport can hang shutdown forever. A test can leak subprocesses. None of this shows up in the “agent calls a weather tool” demo, but it absolutely shows up in services that run all day and need to deploy without leaving zombie processes behind.

Pydantic AI’s choice to run the MCP session in a dedicated task is not just an implementation detail. It is a boundary decision. It says the framework owns the lifecycle of that connection separately from whatever individual agent run is currently using it. That makes cancellation, sharing, tracing, and teardown tractable. It also gives application developers a better place to reason about failure: the tool transport is not the same thing as the model turn.

Run identity is observability glue

The release also populates RunContext with run_id, conversation_id, and metadata, cleans up streaming responses on cancellation, supports runtime output_retries override, and guarantees eval lifecycle teardown. Individually, these are modest. Together, they represent the shape of a framework that expects systems to be observed, evaluated, retried, and shut down under failure.

Run identifiers and conversation identifiers matter because agent behavior is rarely a single request. A complaint from a user may refer to a session that touched multiple model calls, tools, streams, summaries, and retries. Without stable IDs flowing through context, observability becomes log archaeology. Metadata matters for the same reason: routing, tenant identity, experiment flags, policy decisions, and evaluation context all need a place to travel without being jammed into prompts.

The guaranteed eval teardown is also not merely hygienic. Evaluation harnesses for agents increasingly create resources: mock servers, temporary state, traces, subprocesses, network connections, model fixtures. If teardown is best-effort, your evaluation suite can become a source of flakiness that makes teams trust it less. And once teams stop trusting evals, they ship by instinct. We have enough of that already.

For practitioners, the action list is concrete. If you use Anthropic through Pydantic AI, model task budgets explicitly and test compaction behavior rather than waiting for provider errors. If you use MCP, add regression cases for parallel agent runs, unreachable servers, cancellation, app shutdown, and deadlocked transports. If you rely on streaming, test cancellation paths, not just successful streams. If your eval suite leaves resources behind, treat that as a production bug in waiting.

Pydantic AI 1.92 is not glamorous. Good. The boring parts are now the product. Agent frameworks that win in production will be the ones that make budgets, shutdown, retry behavior, tracing, and provider quirks boring enough that application teams can build on them without rediscovering every edge at 2 a.m.

Sources: Pydantic AI v1.92.0 release, PR #5140, PR #4514, Pydantic AI documentation

Token budgets are control plane, not accounting

MCP shutdown is where the demo ends

Run identity is observability glue

Sign up for more like this.