TDAD: Two Papers, One Acronym, and the Real State of Test-Driven Development with AI Coding Agents

There's a naming collision worth knowing about. In March 2026, two independent research groups published papers under the same acronym — TDAD — and both are genuinely interesting for different reasons. Together, they represent the state of the art on what it actually means to practice test-driven development with AI coding agents.

This post covers both papers, the tools built around them, and the practical TDD workflow that developers are running with Claude Code and Codex today.

TDAD Paper 1: The Regression Problem

What it is

The first paper (arxiv:2603.17973) — "Test-Driven Agentic Development: Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis" — addresses a specific failure mode that anyone running agents at scale has hit: the agent fixes the target issue but silently breaks three other tests in the process.

In baseline experiments on SWE-bench Verified, a vanilla agent caused 562 pass-to-pass test failures across 100 instances — an average of 6.5 broken tests per generated patch. On one instance (astropy-13977), a single patch broke all 322 previously-passing tests. On another (django-13089), TDD prompting alone turned 4 failures into 352.

These aren't edge cases. They're what capable agents do by default when they have no structural awareness of how code and tests relate to each other.

How TDAD solves it

TDAD builds an AST-derived code–test dependency graph, applies weighted impact analysis to rank which tests are most likely affected by a proposed change, and surfaces the results as a lightweight agent skill: a static test map and a 20-line instruction file.

The design is deliberately minimal. The agent needs only grep and pytest at runtime — no graph database, no MCP server, no API calls. Install it with:

pip install tdad

Run it against your codebase before a coding session to generate the test map. Feed it to the agent as context. The agent then knows exactly which tests it needs to keep passing before committing a patch.

The key finding — and the counterintuitive result

Results on SWE-bench Verified with Qwen3-Coder 30B:

  • Regression rate dropped 70%: from 6.08% → 1.82% (562 → 155 P2P failures)
  • Resolution improved from 24% → 32%
  • An autonomous self-improvement loop raised resolution from 12% → 60% on a 10-instance subset with 0% regression

The counterintuitive result: TDD prompting alone — without the graph — increased regressions to 9.94%, worse than vanilla. Telling the agent "use test-driven development" without telling it which tests to check actively made things worse.

The paper frames this as: agents need context (which tests to verify), not procedure (how to do TDD). Smaller models benefit far more from structural information about their codebase than from methodological instructions. This is probably the most practically important finding for teams running TDD workflows with agents today.


TDAD Paper 2: Prompts as Compiled Artifacts

What it is

The second paper (arxiv:2603.08806) — "Test-Driven AI Agent Definition: Compiling Tool-Using Agents from Behavioral Specifications" — applies TDD principles at a completely different layer: not to the code the agent writes, but to the agent prompt itself.

The problem it targets: deploying tool-using LLM agents in production requires measurable behavioral compliance, and current practices can't provide it. Small prompt changes cause silent regressions. Tool misuse goes undetected. Policy violations surface only after deployment. Sound familiar?

The compilation pipeline

TDAD introduces three roles — all implemented as coding agents:

  • TestSmith: converts a YAML behavioral specification into executable tests (visible + hidden split)
  • PromptSmith: iteratively refines the agent prompt until visible tests pass
  • MutationSmith: generates plausible faulty prompt variants post-compilation to measure whether the test suite would actually catch them

The output is a "compiled prompt" — an agent artifact with a measurable, version-controlled behavioral contract. When requirements evolve, you run the pipeline again and get a regression safety score.

Results

Evaluated on SpecSuite-Core, a benchmark of four deeply-specified agents (policy compliance, grounded analytics, runbook adherence, deterministic enforcement):

  • 92% v1 compilation success with 97% mean hidden pass rate across 24 trials
  • 78% v2 hidden pass rate for evolved specifications
  • 86–100% mutation scores — the test suites catch nearly all intentionally broken prompt variants
  • 97% regression safety score when requirements change

The anti-gaming mechanisms matter here. If the only tests driving compilation are the visible ones, PromptSmith will overfit to them. The hidden test split and mutation testing are what make the compiled prompt actually generalize.


The Practical TDD Stack for Claude Code and Codex

Research aside — here's what developers are actually running for TDD workflows with coding agents today.

TDD Skills (SKILL.md format)

The skills ecosystem has produced several TDD skill implementations, and they differ meaningfully in approach:

Red-Green-Refactor enforcement — The canonical pattern, formalized in Simon Willison's Agentic Engineering Patterns guide. The key discipline: confirm tests fail before implementing. "Use red/green TDD" is now understood shorthand by every major model — it means write tests first, confirm red, implement until green, refactor. The Willison framing makes an important point: skipping the red phase (verifying tests fail) is where agents go wrong. A test that passes before you implement anything isn't testing anything.

Multi-agent TDD skill (glebis/claude-skills) — Separates test authoring and implementation into distinct subagents to prevent the common failure mode where single-context TDD produces tests that mirror implementation details. The test writer never sees the implementation; it only sees the spec. Includes interactive mode (pauses at each RED checkpoint for human review) and autonomous mode (/tdd --auto) for high-confidence flows.

# Interactive — pauses at each RED checkpoint
/tdd "add user authentication with JWT tokens"

# Autonomous — runs all slices, stops only on errors
/tdd --auto "add user authentication with JWT tokens"

Superpowers skill — Includes TDD as one stage in a structured lifecycle (brainstorm → TDD → debug → review). For teams that want the methodology enforced across an entire feature development arc, not just the test-writing step.

TDD Guide skill (alirezarezvani) — Focused on coverage remediation: analyzes existing test coverage, identifies gaps, and guides agents through closing them with failing-first discipline.

MCP Servers That Enable TDD Workflows

Playwright MCP — The most important MCP for end-to-end TDD. The agent can inspect your live running app, generate tests against the actual rendered DOM, then verify them end-to-end. This closes the loop that pure unit test TDD misses: the implementation passes unit tests but behaves incorrectly in the browser. The Playwright MCP + webapp-testing skill combination is what teams doing full-stack TDD with agents are running.

claude mcp add playwright npx @playwright/mcp@latest

Filesystem MCP — Specifically for the test-creation phase: Claude needs to create test files in the correct directories before it writes any implementation code. Explicit Filesystem MCP access ensures the agent can write to test directories without also touching implementation files in the same operation.

Sequential Thinking MCP — Pairs well with TDD for complex features. Before writing any tests, the agent uses sequential thinking to plan: what behaviors need to be verified, what edge cases matter, what the test boundary is. This planning step is what separates TDD from "write some tests and hope." The explicit thought trace also makes the agent's test strategy reviewable before it starts implementing.

GitHub MCP — For CI-integrated TDD: the agent can check whether its tests are passing in CI, pull GitHub Actions logs when they fail, and iterate without you copying build output manually. The full red-green loop, but running in your actual CI pipeline.

The TDAD Tool Itself

For teams with existing test suites and Python codebases, the tool from Paper 1 is directly usable:

pip install tdad
tdad analyze --repo . --output test-map.json

Feed test-map.json to your agent at the start of a coding session. The agent now knows the dependency structure between code and tests — which is the actual information it needs to avoid regressions, as distinct from the instruction to "use TDD."

The Prompt Engineering Layer

Based on the research and developer practice, the most effective TDD prompting pattern for coding agents is:

  1. Spec first — give a behavioral specification in a structured format (YAML or structured prose), not an implementation description
  2. Test boundary explicit — tell the agent exactly what it's allowed to touch when writing tests vs. when implementing
  3. Red confirmation required — the agent must report test failure before proceeding to implementation (no skipping the red phase)
  4. Impact context provided — give the agent the TDAD test map or an equivalent dependency list so it knows what existing tests its changes might affect
  5. Mutation awareness — for high-stakes features, have a second agent (or a second pass) verify that the tests actually fail when the implementation is intentionally broken

What This Changes

The most important shift from both papers: TDD with AI agents is not about teaching the agent a process. It's about giving the agent the structural information it needs to verify its own work.

Telling an agent "use test-driven development" without test coverage maps, dependency graphs, and explicit test boundaries is approximately as useful as telling a new hire "write good tests." The instruction is correct; the context that makes it actionable is missing.

Both TDAD papers operationalize the same underlying insight from opposite directions: one measures which existing tests a code change affects (regression prevention), the other measures whether a new test suite actually exercises the behavioral spec it was built from (prompt verification). Together they point toward a future where agent-generated code ships with behavioral contracts that are as measurable as unit test pass rates.

The tools are here. The methodology is validated. The gap is adoption.