Multi-Agent Systems Need Observability at the Execution-Graph Level, Not Another Prompt Log
The first generation of agent observability treated the model call as the unit of truth. Capture the prompt, capture the completion, count the tokens, maybe attach a trace ID, and call it visibility. That was barely enough for single-agent copilots. It is not enough for multi-agent systems. Once agents hand work to other agents, branch through tools, retry, summarize, and carry context across policy boundaries, the thing you need to debug is no longer a prompt. It is an execution graph.
The New Stack’s “Who’s monitoring the agents?” piece is useful because it names the production failure that does not show up as a clean crash: a request that should take one or two steps turns into dozens of model calls, nothing throws an exception, and nothing alerts. The user still gets an answer. It may even look reasonable. Meanwhile latency has ballooned, cost has doubled, a partial failure was hidden inside a handoff, and sensitive context may have moved through three agents before anyone noticed. That is not a logging problem. That is an operations problem.
Frameworks such as CrewAI, AutoGen, and LangGraph are moving from demos into incident response, internal copilots, automation pipelines, and other workflows where “the agent said something plausible” is not a sufficient service-level objective. These systems behave less like chatbots and more like dynamic control flow: nodes, edges, branches, retries, tools, memories, handoffs, approvals, and terminal decisions. Observability has to follow that shape.
The missing signal is the path, not the prompt
A normal service trace answers familiar questions: which services handled the request, how long each span took, where errors happened, and which dependency got slow. An agent trace needs those answers plus harder ones. Why did the system choose this tool? Which context object made it branch? Did it retry because the model was uncertain, the tool failed, or a policy check blocked it? Which agent summarized the sensitive data? Did the final answer depend on an observation or on another model’s inference? Where did the cost come from?
Prompt logs flatten those questions. They show artifacts, not causality. They may tell you what text entered the model, but not whether that text came from a verified database query, a stale retrieval result, a hallucinated intermediate answer, or a summarizer that removed the caveat. For production debugging, “model saw this prompt” is often the least interesting fact. The interesting fact is how the system assembled that prompt and which branch it took afterward.
OpenTelemetry’s emerging GenAI semantic conventions are a useful base layer. Attributes for provider, model, request, response, token usage, and related telemetry are necessary. Jaeger v2’s OpenTelemetry-first direction, Arize Phoenix, and LangSmith all point toward a healthier ecosystem of traces, spans, evals, datasets, prompt inspection, and LLM call analysis. But agent systems need one more level of modeling: graph-native causality. A span per model call is not enough if nobody can reconstruct the graph that produced the final decision.
Agent failures drift before they explode
The phrase “nothing crashes, so nothing alerts” should be taped to every agent platform roadmap. Traditional production systems often fail with visible symptoms: exceptions, timeouts, HTTP 500s, queue backlogs. Agent systems frequently fail as waste, drift, or plausible wrongness. The support agent calls retrieval eight times instead of twice. The code-review agent loops through a tool catalog because two tools have overlapping descriptions. The incident agent misses a timeout, asks another agent to summarize partial data, and returns a confident diagnosis with the missing evidence quietly erased.
Cost drift is the easiest version to measure, so start there. Track token usage, model calls, tool calls, branch depth, retries, and wall-clock latency per run type. Build baselines. If the normal customer-support path is retrieval → policy check → draft → human approval, then retrieval → web search → filesystem → external model → export is not just “more thorough.” It is a different graph and should be treated as one. Budget enforcement should be a runtime feature: max calls, max depth, max tokens, max tool categories, max sensitive-context hops.
Correctness drift is harder. The final answer may be wrong because one agent made a small mistake three steps earlier and another agent polished it into confidence. That is why traces must distinguish observations from inferences. “Tool returned customer balance: $42” is different from “agent inferred customer is eligible.” Mixing them in a transcript turns debugging into archaeology. Production traces should mark tool outputs, model reasoning summaries, policy decisions, human approvals, and generated artifacts as different event types.
Data movement is the security story
The security risk is not only prompt injection or a rogue tool call. It is gradual data propagation. Agent A reads a customer ticket. Agent B summarizes it into an internal plan. Agent C includes that plan in a prompt to an external model or writes it into a long-lived memory store. No single step looks like “exfiltrate secret,” but the graph crossed a boundary. If your observability system cannot show data lineage and policy zones, it cannot answer the question that matters after an incident: where did this information go?
This is why raw prompt capture is a dangerous comfort blanket. Capturing everything creates its own sensitive data store. Capturing too little leaves you blind. The practical middle ground is structured telemetry with redaction and classification. Log tool names, input categories, output categories, data classes, user/session/run IDs, policy decisions, and hashes or references for large payloads where possible. Preserve enough state for replay and audit, but do not turn your traces into a second breach surface.
For builders using LangGraph, CrewAI, AutoGen/AG2, OpenAI Agents, or homegrown orchestration, the next instrumentation layer should be graph-native from day one. Record nodes, edges, branch reasons, retries, handoffs, tool permissions, context objects, summaries, terminal decisions, and human approval points. Put budgets on the graph, not just the API call. Separate “model said” from “tool observed.” Assign every run an ID that survives across agents and async jobs. Make it possible to replay the path with redacted payloads and to explain why the system stopped.
The ownership model matters too. Agent observability cannot be a dashboard bolted on by the platform team after product teams ship workflows. Each agent graph needs an owner, expected shape, allowed tools, data classification boundaries, and failure policy. If the graph enters an unexpected branch, exceeds a call budget, accesses a new sensitive source, or routes data to a new provider, the system should block, degrade, or page. “It completed successfully” is not a useful status if success means “spent $18 and leaked a customer summary into the wrong memory store.”
The industry is learning the same lesson it learned with microservices, event buses, and serverless: abstraction does not remove operations work; it moves it. Multi-agent frameworks are becoming execution engines. Execution engines need traces that expose graph shape, data movement, policy violations, and cost drift before the postmortem starts with “the answer looked plausible.” Prompt logs are evidence. They are not observability.
Sources: The New Stack, OpenTelemetry GenAI semantic conventions, The New Stack on Jaeger v2 and AI observability, Arize Phoenix docs, LangSmith observability docs