Mock-First Agent Evaluation: The Testing Pattern That Lets You Validate AI Tool Calls Before They Touch Real Systems
Most agent tool evaluation falls into one of two inadequate buckets: "did it work in the demo?" or "did it work in production?" A paper from the IndustriConnect team introduces a rigorous methodology that sits squarely between them — a mock-first evaluation pattern that catches error-handling failures, concurrency issues, and recovery failures before they ever surface in a live system. The core idea is straightforward: before connecting an AI agent to any real external system, you run the full evaluation suite against a deterministic local mock that faithfully simulates the external interface, its error conditions, and its edge cases.
The paper operationalizes this with a four-scenario benchmark structure — normal operation (480 runs), fault injection (210 runs across 7 distinct fault scenarios), stress testing (120 runs across 12 stress scenarios), and endpoint restart recovery (60 runs) — designed to be fully deterministic and therefore statistically comparable across runs. Normal operation achieved 100% success, fault injection exposed the boundaries of structured error handling, stress testing surfaced concurrency limits and race conditions, and recovery testing validated session restoration after unexpected disconnections. Each scenario class maps onto a real failure mode that "it worked in staging" workflows routinely miss.
Beyond the benchmark itself, the paper formalizes three contributions directly portable to any agent tool integration project: a MCP adapter architecture that exposes external tool interfaces as schema-discoverable AI tools; the mock-first evaluation methodology with fault injection and stress testing; and a reproducible benchmark suite. For teams using MCP to connect coding agents to CI pipelines, code repositories, linters, or test runners, the methodology is an extractable template. The "validate locally with faults before connecting to real systems" principle is especially valuable for tool calls with side effects — code commits, PR creation, deploys — where a misbehaving agent loop creates real damage before anyone notices.