nvidia

Why Agentic AI Sessions Break Conventional Inference — and What NVIDIA's Extreme Co-Design Stack Actually Fixes

Anatoliy Kolodkin

05 May 2026 • 6 min read

Here is a number that should concern anyone running or buying AI infrastructure today: a single live Claude Code session, tracked end-to-end over 33 minutes, made 283 inference requests — 58 from the main agent and 225 from sub-agents — and processed 3.5 million tokens before the first context compaction event. Context grew from 15,000 tokens to 156,000 tokens before that compaction. That is not a benchmark. That is a production trace from a real coding agent doing real work.

NVIDIA published this data on May 5 as part of a technical post explaining why agentic AI sessions are a fundamentally different inference workload from chatbots or standard coding tasks — and why no single processor can solve all the resulting bottlenecks simultaneously. The post is the most honest technical document NVIDIA has published in months, because it shows actual measurement data instead of architecture slides, and it argues from first principles about what is broken rather than selling a predetermined solution.

What the Trace Actually Reveals

The 33-minute Claude Code trace is the anchor of the argument. In a conventional chatbot session, a user types a prompt, the model generates a response, and the session ends or resets. In an agentic coding session, the agent spawns sub-agents, each of which makes its own inference requests, all of which share context accumulated from the parent session. The result is a multiplicative token volume problem: 225 sub-agent requests against 58 main-agent turns, with each sub-agent turn averaging around 85,000 input tokens.

That is where the 15x figure from Anthropic's multi-agent research becomes concrete. Agentic systems do not just make more requests — they make requests with much larger contexts, because every sub-agent turn carries forward the accumulated state of the parent session. A coding agent that can browse files, run tests, call APIs, and spawn specialist sub-agents is not a chatbot with extra steps. It is a distributed inference pipeline where the "conversation" is actually a growing state object that has to be maintained, restored, and compacted across every pause and resumption.

NVIDIA's Dynamo data on KV cache hit rates adds another layer. After the first request in a session, cache hit rates settle between 85% and 97%. Multi-agent setups with eleven Opus-class teammates reach 97.2% aggregate cache hit rate with an 11.7-to-1 read-to-write ratio. That sounds like a win for caching — and it is — but only if the serving infrastructure can efficiently maintain and restore large KV prefixes across pauses, sub-agent spawns, and compaction events. The bottleneck is not the model context window. It is the memory system that has to hold 156,000-token contexts in fast storage and restore them quickly enough that the sub-agent resumption latency is tolerable.

The Prompt Caching Math That Changes the Economics

The caching numbers have a direct cost implication that is worth spelling out. At a 95% cache hit rate, input processing cost drops roughly 85% compared to no caching. Without caching, the same agentic session would cost about six times more in compute. For a single 33-minute coding session, that is the difference between a session that costs a few dollars and one that costs tens of dollars — at human-session timescales, that is manageable. At the query volumes required to run an AI-assisted engineering team with hundreds of concurrent developers, it is the difference between a tool people use and a line item that gets flagged in a CFO review.

The catch is that the 95%+ cache hit rates in NVIDIA's data come from coding agent sessions where tool outputs are small and deterministic. The cache is effective because the agent is doing things that produce consistent intermediate results — compiling a function, running a test, checking a file — rather than generating open-ended creative content. For other agentic domains where outputs are less deterministic, cache hit rates will be lower and the economics will be worse. The 95% figure is real but domain-specific, and teams should measure their own workloads before building cost models on it.

Why No Single Processor Can Win This Alone

The extreme co-design argument is the core of NVIDIA's case, and it is architecturally sound even if it sounds like vendor justification for a long bill of materials. The problem is that agentic inference creates four distinct bottleneck categories that map to different hardware characteristics:

Raw compute throughput for large-context processing maps to GPU HBM capacity and bandwidth — Vera Rubin NVL72 is the answer NVIDIA is selling here, with the explicit claim of "one-tenth the cost per million tokens of Blackwell." Low-latency SRAM-first token generation maps to Groq 3 LPX, which NVIDIA acquired in December 2025 for $20 billion and is now positioning as a jitter absorber in the agent pipeline rather than a standalone inference target. Tool execution latency and CPU-GPU unified execution maps to Vera CPU. And the fabric between all of these — NVLink 6, ConnectX-9 SuperNIC, BlueField-4 DPU, Spectrum-X — has to move large KV cache payloads between heterogeneous processors without introducing the latency variance that breaks sub-agent responsiveness.

The software layer doing the orchestration work is Dynamo with Attention-FFN Disaggregation (AFD). The technical insight is that in MoE agentic systems, the expert computation (FFN) and the attention computation happen at different rates and scale differently across heterogeneous accelerators. Serving them on separate processor types and reassembling the results coherently requires a coordination layer that most inference stacks do not have. NVIDIA is arguing that AFD is that coordination layer — and that without it, the extreme co-design hardware stack is just expensive components that do not work together coherently.

The Groq Integration Is the Quiet Signal Worth Noting

The inclusion of Groq 3 LPX in the extreme co-design stack deserves separate attention. NVIDIA paid $20 billion for Groq in December 2025. The strategic question since has been whether Groq would remain a standalone inference product, get absorbed into the NVIDIA portfolio as a feature, or play a specific architectural role. This post answers that question: Groq's SRAM-first, low-jitter token generation is being integrated as a pipeline latency variance absorber — the thing that keeps sub-agent response times predictable even when the main GPU inference path is handling large-context requests.

That is a specific and defensible architectural role. The problem Groq solves is not raw throughput — it is the latency histogram. A GPU cluster doing large-context inference has a wider latency distribution than a SRAM-first accelerator because memory access patterns vary more at scale. If your agentic pipeline has sub-agents that need to respond within a tight latency window (typing predictions, autocomplete, tool call responses), you need a low-jitter path for those turns even if the larger context turns go to GPU HBM. Groq is that path. The $20 billion price tag makes sense if the alternative is losing the latency-sensitive portion of agentic pipelines to a competitor — or if the alternative is building a worse solution in-house.

What the Rubin Timeline Means for Agentic Economics

The post explicitly positions Vera Rubin NVL72 as the platform that makes long-context agentic pipelines economically tractable. The "one-tenth the cost per million tokens versus Blackwell" claim is specific and verifiable once Rubin ships — but it comes with the HBM4 supply caveat that has already cut Rubin production targets from 2 million to 1.5 million units for 2026 due to SK Hynix qualification delays. Samsung is now the lead Rubin memory supplier, and the ramp has slipped from June to September 2026.

For teams planning inference infrastructure for agentic workloads, the practical implication is that Blackwell fills the gap longer than expected, and the cost improvement from Rubin may arrive later and in smaller volume than NVIDIA's original roadmap implied. That is not a reason to change plans — it is a reason to build cost models that work on both Blackwell and Rubin timelines rather than assuming Rubin's economics arrive on a specific schedule.

The post is ultimately an argument for thinking about agentic AI as a platform problem, not a model problem. The models matter, but the serving infrastructure — memory systems, fabric, disaggregation software, caching strategies — determines whether agentic AI is economically viable at production scale. NVIDIA is betting that the team which solves the platform captures the infrastructure lock-in regardless of which model wins. That bet is coherent, the technical arguments are honest, and the live Claude Code trace is the kind of measurement data the industry needs more of. Whether the extreme co-design stack is accessible to teams outside hyperscalers remains the open question — and the one worth watching as Rubin ramps.

Sources: NVIDIA Developer Blog, NVIDIA Developer Blog (in-vehicle agents post), Anthropic Multi-Agent Research

What the Trace Actually Reveals

The Prompt Caching Math That Changes the Economics

Why No Single Processor Can Win This Alone

The Groq Integration Is the Quiet Signal Worth Noting

What the Rubin Timeline Means for Agentic Economics

Sign up for more like this.