AgentPerf Is the Coding-Agent Benchmark NVIDIA Needed — and the Warning Label Agent Infrastructure Deserves

AgentPerf Is the Coding-Agent Benchmark NVIDIA Needed — and the Warning Label Agent Infrastructure Deserves

Coding agents are no longer a chatbot feature with a shell bolted on. They are a strange, expensive inference workload: long prompts, growing context, repeated turns, tool-call gaps, cache reuse, and enough concurrency to make tail latency matter more than the happy-path demo. That is why NVIDIA’s AgentPerf result is interesting even if you ignore the obvious vendor victory lap.

The company published a technical breakdown of Artificial Analysis’ new AA-AgentPerf benchmark, which measures hardware for agentic coding workloads rather than one-off completions. NVIDIA’s headline number is dramatic: its GB300 NVL72 rack-scale system supports up to roughly 20x more concurrent coding agents per megawatt than H200-class systems under the benchmark’s launch configuration. More specifically, NVIDIA reports 61.4K concurrent agents per megawatt and 57.5 concurrent agents per GPU for GB300 NVL72 at the 30 tokens/sec SLO tier, versus 2.6K concurrent agents per megawatt and 1.4 per GPU for H200.

That is the number NVIDIA wants procurement teams to remember. The more useful part for builders is the benchmark shape. AA-AgentPerf is trying to measure the factory underneath the agent, not the agent’s marketing page.

The benchmark finally looks like the workload

Most AI infrastructure comparisons still orbit around tokens per second, time-to-first-token, or a model’s score on a coding eval. Those numbers are not useless, but they flatten the weirdest parts of agent behavior. A coding agent does not just answer once. It reads files, asks for more context, calls tools, waits for test output, appends logs, retries failed plans, carries cached prefixes, and keeps a long-lived session warm while the user expects the whole thing to feel interactive.

AA-AgentPerf captures some of that mess by replaying prerecorded agentic coding trajectories built from public repositories across more than 12 programming languages. NVIDIA says request sequence lengths range from 5K to 131K tokens, with a mean around 27K tokens. Tool calls are mapped to representative CPU-side tasks and simulated with a distribution using a one-second median delay, then held constant across systems. The test set is private, which matters because benchmarks that become too predictable eventually become products’ favorite training dataset wearing a lab coat.

At launch, the benchmark focuses on DeepSeek-V4-Pro and evaluates multiple service-level objective tiers. For the listed DeepSeek-V4-Pro setup, the tiers include 30 tokens/sec with P95 time-to-first-token of 10 seconds, 100 tokens/sec with P95 TTFT of 5 seconds, and 300 tokens/sec with P95 TTFT of 3 seconds. The official result is the highest concurrency level that still satisfies the target SLO.

That methodology is not perfect, but it is pointed in the right direction. The right question is no longer “how fast can this system answer one prompt?” It is “how many agent sessions can stay usable while carrying realistic context and waiting on realistic tool gaps?” That is a much better proxy for the bill engineering teams are about to receive.

Per megawatt is the adult metric

The most important normalization in AgentPerf is per megawatt. That sounds like data-center plumbing because it is. Agent adoption turns inference from an API feature into a facilities problem: rack power, cooling, memory bandwidth, interconnect topology, KV-cache movement, CPU orchestration, and service-level objectives all become part of the developer experience.

This is where NVIDIA’s GB300 NVL72 story gets credible. The system links 72 Blackwell Ultra GPUs and 36 Grace CPUs into a rack-scale architecture with a high-bandwidth NVLink domain. NVIDIA’s GB300 NVL72 product page lists 130 TB/s of NVLink bandwidth, 37 TB of fast memory, 20 TB of GPU memory, and up to 576 TB/s of GPU memory bandwidth. For MoE-heavy models and long-running agent sessions, those are not decorative numbers. They determine how cheaply the system can keep experts, cache, and intermediate state moving while thousands of agents are active.

NVIDIA also points to the software side: TensorRT LLM, vLLM, and SGLang can use optimizations such as WideEP and DeepEP to spread MoE expert execution across the NVL72 domain. The post calls out DeepGEMM, Mega MoE, MXFP4/MXFP8 kernels, fused MoE execution, and overlapping NVLink communication with tensor-core compute. Translation: the benchmark rewards hardware/software co-design, not just a newer GPU on the same old serving assumptions.

That does not mean every team needs a rack-scale monster. If your “agent” mostly performs short edits over a small repo, the bottleneck may be the model, the tools, or your product workflow rather than the accelerator fabric. But if you are serving thousands of concurrent coding agents with long contexts and strict interactivity targets, the traditional tokens/sec chart is now lying by omission.

The warning label: simulated tools are not production tools

AgentPerf’s biggest caveat is also its most reasonable design choice: tool calls are simulated. That helps isolate accelerator performance and makes cross-system comparison possible. It also means the benchmark does not capture the full end-to-end pain of production agents.

Real tools are rude. Tests fail slowly. Package installs hang. Sandboxes take time to start. Repo search gets weird on monorepos. Internal APIs rate-limit. Vector stores miss. CI queues back up. Filesystems and containers add their own tax. A benchmark with a one-second median tool delay is useful for comparing inference systems, but it should not be mistaken for a full user-experience benchmark.

The model choice matters too. Launching with DeepSeek-V4-Pro makes sense for MoE-heavy coding workloads, but not every team is running that architecture. Some will use smaller dense models for cheap substeps, hosted frontier models for planning, local Qwen or Nemotron variants for private code, or routers that split work across several models. AgentPerf is a capacity-planning input, not a universal verdict.

The right response is trace replay. Before shopping for hardware or providers, teams should capture their own agent workload: context length per request, cached-token ratio, output tokens, tool-call latency, time-to-first-token, wall-clock task time, retry loops, compaction events, model-routing decisions, and accepted-code rate. Then compare systems against the shape of work they actually do. “Concurrent agents per megawatt” is a useful industry metric. “Cost per accepted PR-sized change in our repo” is the metric that will survive the finance meeting.

This changes coding-agent comparisons

The search market is full of “best AI coding agents” comparisons that rank Codex, Claude Code, Cursor, OpenCode, Gemini, and local tools by UX, features, and model quality. That is fine for individual developers. It is incomplete for engineering organizations.

A serious coding-agent comparison now needs an infrastructure section: cache behavior, context growth, retry rate, tool-call latency, sandbox overhead, model routing, observability, runaway-session controls, self-hosted inference options, and SLOs under concurrency. The user buys the product UI. The company pays for the inference system behind it.

That is why AgentPerf matters even if NVIDIA’s launch-day chart should be treated with the usual vendor-benchmark caution. It pushes the conversation from “which model is smartest?” toward “which stack can keep agents responsive, affordable, and measurable when everyone starts using them at once?” That is the right question. Coding agents are not chatbots with a tool belt. They are distributed systems that happen to speak English.

NVIDIA’s win here is not merely that GB300 NVL72 looks good against H200. The stronger win is that NVIDIA helped frame the benchmark around concurrency, context, cache, SLOs, and watts. If AgentPerf becomes a real procurement metric, the industry will at least be arguing about the right layer of the stack.

Sources: NVIDIA Developer Blog, NVIDIA Blog, Artificial Analysis AA-AgentPerf, NVIDIA GB300 NVL72, Together AI + Cursor case study