Vera’s First Public Benchmarks Say the CPU Is Back in the AI Factory Conversation

Vera’s First Public Benchmarks Say the CPU Is Back in the AI Factory Conversation

The funny thing about the GPU era is how quickly everyone started pretending the CPU had become a footnote. NVIDIA’s first public Vera benchmark story is a useful correction. If AI factories are going to run long-lived agents, reinforcement-learning environments, code sandboxes, database queries, KV-cache plumbing, and thousands of tool calls around the model, the host CPU is not plumbing. It is the part of the system that decides whether the very expensive accelerators spend their day working or waiting.

NVIDIA’s new Vera CPU benchmark recap, based on Phoronix testing, lands as more than an Arm-versus-x86 scorecard. Vera uses 88 NVIDIA custom Olympus cores, supports Armv9.2, and is aimed squarely at the sequential, branch-heavy, memory-sensitive work that surrounds inference. NVIDIA says the single-socket Vera system Phoronix tested is rated at 450W TDP, while its LPDDR5X memory subsystem consumes less than 30W and delivers up to 1.2 TB/s of memory bandwidth. That last number is the one operators should circle.

For years, AI infrastructure buying has been reduced to accelerator procurement: how many GPUs, how much HBM, how many FLOPS, how fast can the cluster train or serve the model. That framing was always incomplete, but agentic workloads make it actively misleading. A coding agent does not just emit tokens. It spins up sandboxes, searches files, runs tests, compiles projects, calls tools, manages logs, evaluates results, and starts the loop again. Reinforcement-learning systems for software and browser tasks do the same thing at industrial scale: run thousands of environments, capture outcomes, update policies, and keep the inference side fed.

The benchmark that matters is not just tokens per second

NVIDIA says Vera sustained 90% of peak memory bandwidth in Phoronix’s STREAM TRIAD test, which it calls the highest percentage of rated peak bandwidth Phoronix has measured for a CPU in that workload. The company also claims more than 4x memory bandwidth per core versus traditional x86 CPUs, a 1.6x geometric-mean increase over the prior-generation Grace CPU, and a 1.5x overall performance advantage against a latest-generation 128-core x86 processor.

There are also developer-shaped numbers. NVIDIA says single-socket Vera compiled a default Linux kernel in 20 seconds, the fastest Phoronix measured in that test, and delivered 2x faster per-core Linux kernel compilation than a 128-core processor. Kernel compilation is not a perfect proxy for agent infrastructure, but it is a decent smell test: lots of files, lots of process coordination, real compiler behavior, and enough memory pressure to expose weak system balance.

Michael Larabel of Phoronix gave NVIDIA the kind of quote every silicon vendor wants, calling Vera “the most formidable competition to Intel and AMD x86_64 processors ever realized” and saying it is “packing a heavy-hitting punch with competitiveness to Intel/AMD x86_64 CPUs that I have never seen out of any other ARM or non-x86_64 processors.” Vendor blogs naturally select the flattering bits, but the underlying point is hard to ignore: Arm server CPUs are no longer interesting only when bundled inside a hyperscaler’s opaque instance type.

The better practitioner read is narrower and more useful: AI systems are becoming mixed-workload systems again. The GPU does the dense math. The CPU runs the messy world around it.

Agent infrastructure is where host CPUs become visible again

Vera’s product positioning makes this explicit. NVIDIA describes the CPU as purpose-built for reinforcement learning and agentic AI, claiming software environments can run up to 50% faster with twice the efficiency of traditional CPU infrastructure. Its architecture materials describe up to 1.5 TB of memory capacity, 3.4 TB/s bisection bandwidth in the second-generation NVIDIA Scalable Coherency Fabric, and 1.8 TB/s coherent bandwidth via second-generation NVLink-C2C. This is not a “CPU for office workloads” pitch. It is a CPU for keeping GPUs surrounded by enough fast host-side work that the cluster behaves like a system.

Prime Intellect’s Vera comments sharpen the point. The company says each Vera socket has 88 cores and 176 total threads via Spatial Multithreading, and reports stable operation with 176 VMs simultaneously per socket for sandbox workloads. It also reported roughly 20% average throughput gain from SMT and 30% greater throughput per CPU than an AMD Zen 5 baseline for realistic RL sandbox workloads. That is the benchmark shape agent teams should care about: concurrent environments per socket, startup latency, isolation overhead, and how many GPU-seconds get wasted while the host side catches up.

This is also why the Redpanda numbers matter, even if they come from a different workload category. Redpanda previously reported up to 5.5x lower streaming latencies than AMD EPYC Turin, up to 73% faster cross-core SQL shuffle throughput than Turin, and a Star Schema Benchmark Q4 join finishing in 384.1 ms versus 426.3 ms on Turin and 642.6 ms on Genoa. Agents increasingly sit next to queues, logs, metadata stores, vector stores, streaming systems, and databases. If those systems are slow, the model waits politely while your architecture burns money.

That is the hidden economics of agentic AI. A cluster can have excellent model throughput and still deliver poor product economics if orchestration, cache movement, sandbox boot, database access, or tool execution creates bubbles in the schedule. The expensive failure mode is not “the GPU cannot run the model.” It is “the GPU is idle because the rest of the system was designed like an afterthought.”

Do not turn this into CPU tribalism

The obvious temptation is to read Vera as NVIDIA taking a swing at Intel and AMD. It is, but that is the least interesting version of the story. Arm versus x86 arguments tend to become religion quickly, and religion is a bad capacity-planning tool. The practical question is whether Vera improves cost per useful agent step, cost per rollout, cost per served request, or cost per compiled-and-tested code change under real concurrency.

There are also caveats. NVIDIA’s post is a vendor recap of independent testing, and Phoronix’s full review was not directly accessible in this run. Compiler support is still maturing: prior Phoronix coverage noted GCC and LLVM/Clang patches for Olympus, including Armv9.2-A and extensions such as SVE2, MEMTAG, FP8DOT2, and cryptographic/vector features, but early enablement does not automatically mean every workload is tuned. NVIDIA says partner systems arrive in the second half of the year, with early systems already delivered to companies including Anthropic, OpenAI, xAI/SpaceXAI, and Oracle Cloud Infrastructure. That is strong validation, not a substitute for your own benchmarks.

So the action item is not “buy Vera immediately.” The action item is to add CPU-side agent economics to infrastructure review. If your team runs coding agents, RL environments, self-hosted inference, or tool-heavy automation, benchmark sandbox boot time, p95 and p99 tool latency, kernel and dependency compilation, Python and JVM behavior, compression, database joins, KV-cache transfer, scheduler overhead, and environment density per socket. Then run the benchmark under realistic concurrency, because isolated microbenchmarks are where infrastructure lies to you politely.

Vera may end up being the right answer for a subset of those workloads. It may also push the rest of the market to take host-side AI infrastructure more seriously. Either outcome is useful. The GPU era still needs a serious CPU, not because the CPU is glamorous again, but because agents make all the unglamorous work visible. NVIDIA’s smartest move here is not proving that Arm can win a benchmark. It is reminding AI teams that the model is only one process in a much larger system.

Sources: NVIDIA Blog, Phoronix, NVIDIA Vera CPU, Prime Intellect, Redpanda