nvidia

NVIDIA’s Agent Stack Is Growing Up at the Cache Layer, Not the Chat Layer

Anatoliy Kolodkin

17 Apr 2026 • 4 min read

NVIDIA keeps talking about agentic AI as if the hard part is giving a model tools. That is the demo-layer version of the problem. The production-layer version is uglier: long-lived coding sessions keep resending huge prefixes, subagents fan out and reconverge, tool calls create awkward pauses, and every cache miss turns “smart assistant” into “expensive reheating of the same context.” The most interesting part of NVIDIA’s latest Dynamo push is that it treats agent inference like a systems problem again.

In its new technical post on Dynamo, NVIDIA argues that coding-agent workloads are dominated by KV-cache behavior rather than raw decode throughput. The company says Claude Code-style sessions can hit 85% to 97% KV-cache hit rates after the first request, and that a four-agent Opus-style setup reached a 97.2% aggregate hit rate with an 11.7x read-to-write ratio. That is a useful framing shift. If those numbers are even directionally right for self-hosted open-model deployments, then the winning inference stack for coding agents will not be the one with the prettiest benchmark chart. It will be the one that avoids recomputing the same 50,000 tokens over and over because a router forgot where the context lives.

NVIDIA’s answer is to push agent awareness into three layers at once. At the frontend, Dynamo now exposes structured hints for things that ordinary chat APIs mostly ignore: request priority, estimated output length, and a speculative-prefill signal that lets the system start warming cache before a likely tool result comes back. It also adds Anthropic-style cache-control semantics with TTLs, which is basically an admission that agent workloads do not behave like stateless prompt-response traffic. They behave like a messy distributed application with expensive shared state.

That part matters more than it sounds. Much of the industry is still benchmarking inference as if the unit of work were a single conversation turn from a generic chatbot. But agent harnesses like Codex, Claude Code, OpenClaw, and similar systems create a different access pattern entirely. The user is not paying for one answer. They are paying for a session that may span dozens or hundreds of calls, each of which carries a huge amount of repeated context. In that world, cache locality is not a micro-optimization. It is the margin.

The routing layer is where NVIDIA’s story becomes more concrete. Dynamo’s router keeps a global index of KV-cache blocks, scores overlap by worker, and then balances cache reuse against decode load rather than blindly round-robining across replicas. NVIDIA says its Flash Indexer work reaches 170 million operations per second for cluster-scale KV routing. That sounds like classic infrastructure chest-thumping, but the more credible proof point is the external one: Baseten reports a 50% reduction in average time to first token, a 34% drop in time per output token, 48% lower P95 latency, 49% lower P99 latency, and roughly 61% to 62% higher throughput on long-context Qwen Coder traffic when KV-aware routing is enabled. Those are not vanity gains. Those are the kind of numbers that change whether a self-hosted coding agent feels viable or just principled.

There is also a subtle strategic move here. NVIDIA is climbing the stack without needing to win the model war outright. Open models will keep rotating. Qwen, MiniMax, GLM, DeepSeek, and whatever comes next can take turns being fashionable. But if Dynamo becomes the orchestration layer that makes those models economically usable for agent backends, NVIDIA still owns a critical control point. That is a much stickier position than merely saying “our GPUs are fast.” Fast is table stakes. Cheap repeated thinking is the real product.

This is also where NVIDIA’s recent infrastructure messaging starts to line up. Earlier this week the company argued that AI buyers should stop purchasing on FLOPS and start purchasing on token economics. Dynamo is the operational expression of that thesis. You do not get lower cost per useful token from silicon alone. You get it from routing, cache retention, engine scheduling, and knowing when not to redo work the cluster has already paid for. The market has spent a year obsessing over the intelligence of agents. NVIDIA is making a credible case that the next round of differentiation comes from how efficiently those agents remember.

Practitioners should read this with both interest and caution. Interest, because NVIDIA is clearly targeting a real bottleneck. If you are serving coding models, you should already be measuring cache-hit rate, time to first token under repeated-prefix workloads, eviction behavior during tool-call gaps, and how often a session lands on the wrong worker. If you are not, you are probably optimizing the wrong layer. Caution, because the feature surface is sprawling. Multi-protocol support, backend normalization across SGLang, vLLM, and TensorRT-LLM, custom routers, speculative prefill, distributed KV hierarchies, cache pinning, and priority scheduling are all good ideas individually. In aggregate, they form a large blast radius for operational bugs.

That maturity question is the main thing to watch next. Dynamo’s architecture is pointing in the right direction, but the hard part is not publishing a blog post full of smart abstractions. The hard part is making those abstractions boring enough that infra teams trust them under real load. The GitHub project is already attracting meaningful traction, and that usually brings the right kind of pressure: fewer keynote claims, more issue threads about the exact failure mode that woke someone up at 3 a.m.

My take is simple. NVIDIA’s latest agent announcement is not really about agents. It is about turning cache management into a product category and treating long-context inference like the stateful distributed system it has quietly become. That is a better read of where coding-agent infrastructure is headed than another demo about tool use. The next moat in agent systems may be less about making models think harder and more about making them stop paying to think the same thought twice.

Sources: NVIDIA Technical Blog, Baseten, Dynamo GitHub

Sign up for more like this.