NVIDIA’s NVFP4 MaxText Recipe Makes 4-Bit Training Look Operational, Not Experimental

NVIDIA’s NVFP4 MaxText Recipe Makes 4-Bit Training Look Operational, Not Experimental

Everyone wants to talk about agents because agents demo well. NVIDIA’s more important June 9 story is less photogenic: making 4-bit training boring enough that infrastructure teams can put it into a budget spreadsheet.

The company published a JAX and MaxText recipe for training large language models with NVFP4 on Blackwell systems, claiming 4-bit mixed-precision pretraining with no measurable accuracy loss versus an FP8 baseline. That sentence sounds like the usual precision-format victory lap until you look at the boundaries NVIDIA chose. This is not “flip the whole model to 4-bit and pray.” It is a selective training recipe: quantize the transformer MLP GEMMs, keep attention in higher precision, use 16-element microscaling, preserve transpose consistency with 2D weight scaling, smooth WGRAD outliers with a Random Hadamard Transform, and apply stochastic rounding only where gradient updates need it.

That is the part practitioners should care about. Precision wins are not won by inventing smaller numbers. They are won by deciding exactly where smaller numbers do not break convergence.

The useful claim is not 4-bit. It is convergence discipline.

NVIDIA’s post says the MaxText recipe runs through Transformer Engine with two public modes: quantization=te_nvfp4, which includes Random Hadamard Transform, and quantization=te_nvfp4_no_rht, a lower-overhead path that may degrade convergence when outliers are not handled well enough. The recommended container is ghcr.io/nvidia/jax:maxtext, and the public example emits step time, TFLOP/s per device, tokens per second per device, and an Nsight Systems trace.

The format itself is extremely tight: NVFP4 uses one sign bit, two exponent bits, and one mantissa bit — E2M1 — with representable magnitude up to ±6. To make that usable for training, NVIDIA layers local FP8 E4M3 scale factors over 16-element blocks under a global FP32 tensor scale. Transformer Engine defaults to 2D scaling for weights and 1D scaling for activations and gradients, so the rowwise and columnwise views needed across forward and backward passes remain numerically consistent.

The 16-element block size is not trivia. NVIDIA argues it is half the size of MXFP4’s 32-element blocks, which reduces the blast radius of a single outlier sharing scale with too many neighbors. In an 8B-parameter, 1T-token experiment cited by NVIDIA, MXFP4 required roughly 36% more tokens to match NVFP4’s final loss. That is exactly the kind of detail that separates a training recipe from a benchmark stunt: the value is not just throughput, it is avoiding a silent tax where your “faster” format burns more tokens to reach the same place.

The recipe also refuses to quantize attention aggressively. QKV projections, attention output projections, and score/context matmuls stay higher precision because softmax can amplify quantization noise and attention activations tend to carry concentrated outliers. MLPs account for much of the training FLOPs, so NVIDIA is taking the speedup where the model is less fragile and leaving the sharper surfaces alone. That is the correct instinct. The industry keeps trying to turn quantization into a single knob; real systems work looks more like a patch set with careful exclusions.

The numbers are large enough to matter in procurement, not just papers.

NVIDIA benchmarked Llama 3 8B and Llama 3.1 405B pretraining at sequence length 8,192 on GB200 and GB300 systems, holding model, hyperparameters, parallelism, and global batch size constant against an FP8 baseline. For Llama 3 8B on GB200, throughput moved from 1,497 to 2,017 TFLOP/s per GPU, a 1.35× speedup. On GB300, the same model moved from 1,759 to 2,301 TFLOP/s per GPU, or 1.31×.

The bigger-model results are the procurement slide. Llama 3.1 405B on 128 GB200 GPUs moved from 1,557 to 2,241 TFLOP/s per GPU, a 1.44× speedup. On 128 GB300 GPUs, it moved from 2,103 to 3,633 TFLOP/s per GPU, a 1.73× speedup. NVIDIA summarizes the gain as an additional 500–700 TFLOP/s per GPU across the tested configurations.

For a small lab, that is impressive. For a team training or post-training large models across hundreds or thousands of accelerators, it is calendar time, power budget, opportunity cost, and experiment velocity. If the convergence story holds outside NVIDIA’s controlled configurations, NVFP4 becomes one of the highest-leverage knobs available to Blackwell operators. Not because 4-bit is aesthetically satisfying, but because step time compounds brutally at scale.

NVIDIA’s accuracy evidence is also more serious than the usual “we ran a tiny benchmark and it looked fine.” The blog shows a 10,000-step Llama 3 8B comparison where FP8 and NVFP4 descend from about 12.2 nats to 3.9 nats, with a converged-regime mean gap of +0.026 nats, described as inside step-to-step noise. The accompanying arXiv paper reports a 12B-parameter model trained on 10T tokens, which NVIDIA describes as the longest publicly documented 4-bit precision training run at the time, with training loss and downstream task accuracy comparable to FP8.

That still is not a blank check. Ten thousand steps on one setup and a large documented paper run are evidence, not universal permission. If your architecture has unusual attention behavior, aggressive MoE routing, custom kernels, unusual optimizer settings, or data distributions with nasty outliers, you should treat NVFP4 parity as a hypothesis to validate, not a guarantee to inherit.

JAX support is a strategic detail, not a footnote.

The MaxText angle matters. NVIDIA’s developer center of gravity is often PyTorch, TensorRT-LLM, Triton, CUDA libraries, and NIM-style deployment. JAX shops are a different constituency: teams that care about XLA, SPMD partitioning, compiler behavior, and training stacks shaped by TPU-era infrastructure. Publishing a public MaxText recipe is NVIDIA telling those teams they can chase Blackwell economics without rewriting their training stack from scratch.

That is good for builders, but it is also classic NVIDIA platform strategy. NVFP4’s value depends on Blackwell-native FP4 conversion instructions, Transformer Engine, CUDA and cuDNN libraries, framework integration, and NVIDIA-published recipes. The performance win is real enough to evaluate seriously; the portability story is not free. Teams should document exactly where their training stack becomes NVIDIA-specific, what an FP8 fallback path looks like, and whether the saved training time justifies the operational coupling.

The supported-device line makes the boundary explicit: Transformer Engine lists NVFP4 training support for SM 10.0 and SM 10.3, with inference support on SM 10.0+. Translation: this is Blackwell-generation infrastructure, not a generic GPU checkbox. Hopper teams should keep treating FP8 as the practical production path. Blackwell teams should put NVFP4 on the evaluation roadmap immediately.

The operator playbook is straightforward. Start with a short parity run against your FP8 baseline. Compare loss curves, downstream evals, gradient stability, tokens per second per device, step time, memory behavior, input-pipeline stalls, communication overhead, and actual Nsight traces. Test both te_nvfp4 and te_nvfp4_no_rht; the no-RHT path may be cheaper, but the RHT path exists because discovering convergence drift late is much more expensive than paying overhead early. Keep attention precision conservative until your own ablations say otherwise.

There is a broader lesson here for agent and inference teams, even if this post is technically about pretraining. The winning optimization pattern is becoming mixed and workload-aware: quantize what can tolerate error, protect the numerically fragile paths, use hardware-native formats, measure full workflow cost, and stop worshipping isolated kernel benchmarks. “4-bit” is not a product requirement. Cost per correct completed task is.

That is why this quiet JAX/MaxText post is more important than another shiny agent demo. NVIDIA is showing the boring machinery that turns a research precision format into something an infrastructure lead can test: public flags, containers, recipe choices, hardware boundaries, performance counters, and convergence evidence. The take is simple: NVFP4 is not interesting because it is smaller than FP8. It is interesting because NVIDIA is starting to make smaller precision operationally believable.

Sources: NVIDIA Developer Blog, Transformer Engine 2.14 NVFP4 documentation, arXiv: Pretraining Large Language Models with NVFP4, NVIDIA JAX-Toolbox MaxText NVFP4 example