claude-code

Claude Agent SDK Python Now Tests Asyncio and Trio — Reliability Starts Below the API Surface

Anatoliy Kolodkin

05 Jun 2026 • 4 min read

Agent reliability usually fails below the layer people are watching. The model gets the blame, the prompt gets rewritten, and somewhere underneath, a cancelled task leaks, a background thread keeps running, a subprocess does not clean up, or a session store writes half of what the rest of the system thinks it wrote.

That is why Claude Agent SDK Python 0.2.91 is more interesting than its changelog sounds. Anthropic moved its async test suite from pytest-asyncio to AnyIO’s pytest plugin so the same async tests now run under both asyncio and trio. It also updates the bundled Claude CLI to 2.1.165, keeping the SDK aligned with the day’s Claude Code runtime. No fireworks. Just the kind of runtime discipline teams need before embedding agents into long-running Python services.

The release was published June 5 at 05:58 UTC, with PyPI showing claude-agent-sdk 0.2.91 uploaded a minute earlier. The repository had roughly 7,206 stars, 1,074 forks, and 307 open issues during research — in other words, this is not a toy dependency quietly fixing its private test harness. It is part of the Python surface area for developers building Claude Code-style agents outside the terminal.

Async correctness is not test-suite housekeeping

The PR text says the project now runs 212 async tests × 2 backend variants, with verification reporting 966 passed and 5 skipped. Test annotations move from @pytest.mark.asyncio to @pytest.mark.anyio. Fixtures switch from pytest_asyncio.fixture to normal pytest.fixture. Asyncio-specific primitives are replaced with AnyIO equivalents: task groups, fail_after, to_thread.run_sync(abandon_on_cancel=True), and get_cancelled_exc_class.

If that reads like internal cleanup, zoom out. An agent SDK streams events, supervises subprocesses, launches tools, forwards stderr, stores sessions, handles user aborts, talks to MCP servers, and coordinates cleanup when something fails halfway through. Those are exactly the surfaces where event-loop assumptions become bugs. Cancellation timing differs. Task lifetime differs. Cleanup semantics differ. Thread abandonment differs. A library can claim AnyIO compatibility while quietly relying on asyncio behavior until Trio forces the assumption into daylight.

That is the useful part of this release: Anthropic is no longer only testing the common backend. It is testing the abstraction it claims to use. In agent systems, that is not academic. A CLI user can restart a broken session. A service embedding the SDK needs cancellation to work when a request times out, a tool hangs, a user aborts, or a deployment rolls while an agent is mid-run. “It usually exits eventually” is not a reliability plan.

The S3 session-store example makes the point concrete. boto3 is synchronous, so async code has to run it in worker threads. The exact cancellation behavior matters: should the thread be abandoned, awaited, shielded, or allowed to wedge the run? The example moved from asyncio.to_thread, Semaphore, and gather to AnyIO equivalents. That is not just syntactic cleanup; it makes the concurrency contract portable and explicit.

The release is also honest about the boundaries. Redis and Postgres example tests remain pinned to asyncio because redis.asyncio and asyncpg are asyncio-only. That is the right answer. Backend-agnostic at the top level does not magically make every dependency backend-agnostic underneath. Serious runtime engineering is partly knowing where the abstraction stops.

The agent-runtime lesson is bigger than Python

The broader lesson for builders is to test the failure paths your agent system actually uses. If your internal runtime claims to be AnyIO-compatible, run tests under multiple backends. If it only supports asyncio, say so and pin it. Add tests for cancellation during long-running tools, MCP calls, subprocess output streaming, session-store writes, and user aborts. Check that background work is cleaned up. Check that partial session state does not get treated as complete. Check that your wrappers do not smuggle asyncio.create_task() into code that is otherwise supposed to honor structured concurrency.

This is where the gap between demos and infrastructure lives. A demo asks whether the agent can solve the task. A production runtime asks what happens when the task is cancelled at 73%, the network flakes, the storage write stalls, the user closes the browser, the MCP server returns malformed output, and the surrounding service still needs to serve other tenants. The model can be brilliant and the system can still be bad if cancellation and cleanup are sloppy.

The bundled Claude CLI update to 2.1.165 also matters. The same day’s Claude Code runtime release was a terse reliability patch, sitting on top of the larger 2.1.163 governance changes around managed version gates, plugin inventory, hook behavior, session IDs, and permission/runtime fixes. Keeping the Agent SDK aligned with that runtime means Python embedders get the current CLI behavior instead of drifting behind the terminal product. Version drift between local Claude Code, CI actions, and embedded SDK runtimes is how “works in my terminal” becomes a support queue.

There was no broad social reaction within the release window, which is exactly what you should expect. Async backend coverage does not get applause. It prevents the kind of bug that only becomes visible when a customer cancels a run at the wrong millisecond under load. The absence of hype is not absence of value; it is just the market being predictably bad at rewarding maintenance before it becomes an outage.

For teams embedding Claude agents in Python, the action list is practical. Upgrade to claude-agent-sdk==0.2.91 if you depend on the bundled CLI or AnyIO behavior. Mirror the test strategy in your own wrapper layer. Add backend-specific tests if your storage clients are asyncio-only. Treat cancellation as a first-class feature, not an exception path. And if your agent can run tools with side effects, make cleanup behavior part of the approval and audit story.

The verdict: this is a reliability release, not a feature release. Good. The future of coding agents is not just larger context windows and more autonomous workflows. It is boring async correctness under cancellation, storage, tools, streaming, and subprocess supervision. That is where agent systems either earn trust or quietly leak it.

Sources: Claude Agent SDK Python release, PR/issue #1021, commit 113f359, PyPI package metadata, Claude Code v2.1.165 release

Async correctness is not test-suite housekeeping

The agent-runtime lesson is bigger than Python

Sign up for more like this.