Treat Your Coding Agent Like a Distributed System — The AgentOps Engineering Discipline for Production LLM Agents
The third article in an ongoing agentic architectures series makes a clear and uncomfortable argument: agentic systems are distributed systems, and they fail in distributed ways — partially, silently, and at the worst possible time. The piece opens with a concrete production incident: an agent made 47 API calls, hit a rate limit on call 12, and quietly spun for 20 minutes accumulating token costs while doing nothing — because there were no logs. From there, it builds a full AgentOps discipline covering observability and distributed tracing for multi-agent pipelines, a structured taxonomy of cascade failures and runaway loops, human-in-the-loop interrupt patterns at high-stakes decision points, evaluation using golden trajectory comparison, and cost governance enforced at the agent boundary rather than the model boundary. A comparative table maps five levels of agentic architecture complexity to their observability requirements and primary failure modes.
The practical shift this piece argues for is treating agentic system failures as SRE problems, not prompt engineering problems. When a coding agent loop fails silently or burns tokens in a retry state, the fix is structured incident response, distributed trace correlation, and defined SLOs at the agent boundary — not a revised system prompt. The L1-to-L5 complexity taxonomy is a self-assessment tool that tells you exactly which AgentOps practices apply to your current system, preventing the common failure modes of over-engineering observability for a single-agent workflow or under-instrumenting a five-agent production system.