nvidia

Blackwell’s MLPerf Sweep Is Really a Software-Stack Story

Anatoliy Kolodkin

16 Jun 2026 • 5 min read

MLPerf results usually arrive dressed as trophy charts: fastest here, largest there, a few bars that get recycled into cloud sales decks before lunch. NVIDIA’s MLPerf Training 6.0 sweep is more useful than that if you read it as an engineering diff. The headline is Blackwell winning every submitted training benchmark. The substance is that modern AI training — especially mixture-of-experts training — is now won in the seams between kernels, fabrics, routers, CPU orchestration, and precision formats.

That matters because MLPerf 6.0 is less stuck in yesterday’s model zoo than prior rounds. MLCommons added two MoE pretraining workloads: DeepSeek-V3 671B and GPT-OSS-20B. Those are not cute academic stand-ins. They represent the shape of the models behind long-context agents, coding assistants, retrieval-heavy enterprise systems, and the next round of “why is my inference bill doing that?” conversations. Sparse, routed models do not behave like dense transformers with a bigger invoice. They stress communication, scheduling, memory movement, and the parts of the stack developers usually ignore until the profiler starts screaming.

NVIDIA says it was the only platform to submit across every MLPerf Training 6.0 benchmark and the only one to submit results on both new MoE workloads. The numbers are appropriately absurd: 2.02 minutes to train DeepSeek-V3 671B on 8,192 GB300 NVL72 GPUs, 7.43 minutes for GPT-OSS-20B on 512 GB300 NVL72 GPUs, 7.07 minutes for Llama 3.1 405B on 8,192 GB200 NVL72 GPUs, and 0.4 minutes for Llama 2 70B LoRA on 512 GB300 NVL72 GPUs. CoreWeave delivered the DeepSeek-V3 result using GB300 NVL72 systems with Spectrum-X Ethernet; Microsoft Azure scaled the Llama 405B run to 8,192 GB200 NVL72 GPUs.

The benchmark is saying “systems,” not “silicon”

It is tempting to reduce this to “Blackwell fast.” That is technically true and not very interesting. The better read is that NVIDIA keeps turning training into a vertically optimized systems problem. The published stack includes full-iteration CUDA Graphs for token-dropless MoEs, synchronization-free expert operators, paged stashing for GPU-side memory management, CuTe DSL fusions, MXFP8 attention, router and hybrid expert-parallel optimizations, one-forward-one-backward all-to-all overlap, and pipeline-stage balancing.

Those details are not garnish. MoE models introduce dynamic token routing and expert imbalance. That means the training step can spend less time doing useful math and more time negotiating which expert receives which tokens, moving activations across devices, waiting on all-to-all communication, or bouncing through host-side decisions. At 8,192 GPUs, a small orchestration wart stops being a wart and becomes a line item. NVIDIA’s work is aimed at removing those stalls from the hot path.

The most revealing phrase in the technical brief is “full-iteration CUDA Graphs.” CUDA Graphs are not new, but using them across a dynamic MoE iteration is hard because the system wants to vary shapes and decisions as tokens route differently. NVIDIA’s answer is to move more decisions GPU-side: paged stashing, GPU-derived shapes, synchronization-free grouped GEMMs, and fused operators that do not need the CPU to keep approving every step. This is the boring future of frontier training: making the host disappear often matters as much as making the GPU faster.

NVIDIA claims CuTe DSL fusions plus CUDA Graph enablement produced more than 8% end-to-end benefit on DeepSeek-V3 and a 93% end-to-end speedup on GPT-OSS. Router optimization moved top-k and score-related elementwise work from FP64 to FP32 and delivered a 5x kernel speedup. HybridEP metadata and kernel tuning contributed about 5% end-to-end. One-forward-one-backward all-to-all overlap improvements reached nearly 100% communication overlap, producing roughly 8% benefit. These are the numbers practitioners should care about because they point to failure modes you can actually inspect in your own stack.

Ethernet gets serious only when it stops acting generic

The CoreWeave result is also a fabric story. MoE all-to-all traffic is unpleasant: bursty, uneven, and prone to incast when many senders target popular experts at once. Commodity Ethernet with static hashing is not designed around that workload. NVIDIA’s Spectrum-X pitch is that Ethernet can be made suitable for AI fabrics when adaptive routing distributes packets based on real-time link load and congestion control paces senders before buffers melt down.

That distinction matters for platform teams. “Ethernet versus InfiniBand” is a shallow procurement argument if you ignore the workload. The real question is whether the fabric can keep tail latency from leaking into the training step. NVIDIA says Spectrum-X Advanced Adaptive Routing and Spectrum-X Congestion Control are designed exactly for that. If you are planning MoE training or high-volume MoE serving, do not ask whether the network has a familiar logo. Ask whether all-to-all is hidden, whether incast is controlled, whether out-of-order delivery is handled cleanly, and whether your profiler shows communication as a tax or a rounding error.

GB300 NVL72 adds another layer. NVIDIA says GB300 delivered up to 1.6x faster training than GB200 NVL72 at the same scale, helped by higher compute density with NVFP4, expanded memory capacity, and a higher power ceiling. Again: useful, but only in context. Better memory and precision formats are meaningful because the software stack can exploit them. Peak FLOPS alone do not train an MoE model. Good kernels, balanced experts, overlapped communication, and sane pipeline stages do.

What builders should do with this

Most engineering teams are not training DeepSeek-V3 on 8,192 GPUs. That does not make the result irrelevant. MLPerf is useful here as a map of bottlenecks. If NVIDIA is spending effort on CPU-free execution paths, expert routing, all-to-all overlap, pipeline balancing, and fused epilogues, that is where serious training and serving systems are bleeding cost.

The practical move is to profile your workload against the same categories. Are you exposing CPU synchronization in the middle of a supposed GPU-bound step? Are memory-bound epilogues unfused? Is expert imbalance making one stage a celebrity bottleneck? Is all-to-all communication hidden behind compute or sitting in the trace like a stalled elevator? Are pipeline bubbles visible? Are precision transitions deliberate, or accidental overhead? These questions apply whether you are pretraining a large model, fine-tuning a domain MoE, or operating long-context agents whose economics depend on sparse routed inference.

Use MLPerf as a reference point, not a procurement oracle. MLCommons itself warns that benchmark variance remains meaningful — roughly plus or minus 2.5% for imaging and plus or minus 5% for other benchmarks. Vendors optimize for benchmarks because that is the point. Your data pipeline, failure recovery, checkpoint cadence, observability, compliance controls, and cost model will be messier. The value is not “your run will take 2.02 minutes.” The value is seeing what a tuned path attacks.

The larger industry read is straightforward: model economics are being decided below the API layer. AI coding agents, multimodal assistants, and enterprise agent systems depend on models that are sparse, routed, long-running, and expensive to train. If the stack can squeeze waste out of those workloads, more teams get access to better models and lower per-task costs. If not, capability stays trapped behind a few hyperscaler endpoints and pricing pages that quietly punish ambition.

So yes, Blackwell swept MLPerf Training 6.0. Fine. The more important approval is for the engineering direction: stop pretending model scale is just a hardware contest. The frontier is now the invisible work between operators. NVIDIA is very good at selling GPUs, but this result is a reminder that its strongest moat may be the software and fabric work that keeps those GPUs from waiting around expensively.

Sources: NVIDIA Developer Blog, NVIDIA Blog, MLCommons MLPerf Training, Megatron-LM CUDA Graph docs

The benchmark is saying “systems,” not “silicon”

Ethernet gets serious only when it stops acting generic

What builders should do with this

Sign up for more like this.