nvidia

A 100B-Class NVFP4 Quant on 2× DGX Spark Is the Most Useful Kind of Messy Benchmark

Anatoliy Kolodkin

14 May 2026 • 4 min read

The most useful benchmark posts are usually the messy ones. A fresh NVIDIA Developer Forum thread on quantizing a 100B-class Llama-derived model to NVFP4 across two DGX Spark systems is valuable because it does not sand off the parts operators actually trip over: unified-memory misdetection, exporter quirks, vLLM sidecar fixes, calibration checks, kernel paths, and the difference between “the model fits” and “the system is pleasant to use.”

The model is Kaleto/Anubis-Pro-105B-NVFP4, an NVFP4 compressed-tensors quantization of TheDrummer’s Anubis-Pro-105B-v1, itself a Llama-3.3-derived 105B model. The architecture is substantial: 120 layers, hidden size 8192, 64 attention heads, 8 KV heads, and 128-dimensional heads. The BF16 original is roughly 196GB. The quantized release is about 58GB across 12 safetensor shards, with the lm_head kept in BF16.

That is the headline-friendly part. The better story is the pipeline it took to get there. The author used two DGX Spark systems, each GB10 with 128GB unified memory, connected over a ConnectX-7 200GbE backbone. The model card reports 44GB/s effective NCCL AllReduce over InfiniBand and about 280W total sustained draw. Quantization used nvidia-modelopt 0.43.0, NVFP4 W4A4 with group size 16, Ray, and two actors owning 60 layers each. Calibration used 256 cnn_dailymail samples between 150 and 1200 tokens, with health checks reporting good=420, zero=0, nan=0 on both shards.

“Fits in memory” is not the same as “ready to serve”

The failure modes are the important material. The author says standard single-node modelopt hf_ptq.py silently OOM-kills on Spark for 100B-class models because accelerate.infer_auto_device_map misdetects GB10 unified memory as roughly a 5.2TB GPU. That is the kind of bug that looks absurd in hindsight and completely normal in frontier-adjacent infrastructure. Unified memory, new hardware, large models, and fast-moving Python stacks make a perfect little swamp.

There was also an exporter/runtime mismatch. According to the post, ModelOpt 0.43 writes input_activations.dynamic=false without input_scale keys. vLLM then registers uninitialized parameters and produces bad output until input_scale=1.0 sidecar keys are injected for 840 quantized Linear layers. In other words: the files exist, the model loads, and the output can still be wrong because metadata contracts between exporter and server do not line up.

That is the practitioner warning. Quantization is no longer a single command. It is a multi-project compatibility contract among the quantizer, compressed-tensors schema, model architecture, loader assumptions, attention backend, KV cache dtype, GPU architecture, and serving engine. A missing scale tensor can invalidate the result while leaving enough artifacts in place to fool a shallow smoke test.

The reported stock vLLM numbers on a single Spark are useful because they are not inflated into a miracle. At 4,096 context, the post reports roughly 340 tokens/sec prompt processing and 3.1 tokens/sec decode using about 109GB memory. At 16,384 context, prompt processing rises to about 650 tokens/sec and decode stays around 2.9 tokens/sec. At 32,768 context, prompt processing is about 850 tokens/sec and decode remains about 2.9 tokens/sec. At concurrency 4 and 4K context, aggregate output is about 10.4 tokens/sec and total throughput about 167 tokens/sec. The vLLM config used compressed-tensors quantization, FP8 KV cache, max 4 sequences, 0.85 GPU memory utilization, chunked prefill, and prefix caching.

Local frontier-ish inference is real, but it is ops

Those numbers are not “bad.” They are workload-dependent. For background generation, long-form drafting, storytelling, or offline experimentation, 3 tokens/sec per stream may be acceptable. For an interactive coding agent, it probably feels underwater unless the task is highly asynchronous. The lesson is not that DGX Spark fails. The lesson is that serving economics survive quantization. Prefill, decode, concurrency, KV cache allocation, memory preallocation, and fast-path kernel selection still decide whether the model feels alive.

The calibration detail is another practical reminder. The release used cnn_dailymail, a reasonable generic text corpus, but the model card notes that roleplay, storytelling, or code domains might benefit from domain-matched calibration. That is not a footnote. FP4 compression is shaped by activation distributions. If you quantize a coding model, calibrate on code, tool traces, and repository-style prompts. If you quantize a creative model, include the creative distribution you expect users to hit. “Generic text” is not always neutral; sometimes it is just mismatched.

For teams evaluating Blackwell-era FP4/NVFP4, this thread offers a better checklist than a polished launch post. Verify exporter/runtime versions. Run behavioral smoke tests, not just load tests. Capture fast-path logs. Compare outputs before and after quantization on domain prompts. Watch for silent fallback kernels. Measure both prompt processing and decode. Test concurrency. Keep the unquantized baseline around long enough to catch quality regressions. And assume every “minor” metadata field is load-bearing until proven otherwise.

For NVIDIA, this kind of community artifact may be better marketing than a clean demo. It shows the DGX Spark ecosystem becoming legible: users are finding failure modes, publishing pipelines, sharing Docker images, documenting sidecar fixes, and naming the logs that prove hardware acceleration is actually happening. Platforms mature when the ugly parts become searchable.

The LGTM take: NVFP4 on DGX Spark is promising precisely because the community is publishing the rough edges. Local 100B-class inference is no longer fantasy. It is also not magic. It is distributed quantization, compatibility debugging, calibration discipline, and serving operations wearing a very small workstation-shaped hat.

Sources: NVIDIA Developer Forum, Kaleto/Anubis-Pro-105B-NVFP4 model card, KaletoAI distrib-nvfp4, NVIDIA TensorRT-LLM, NVIDIA Model Optimizer

“Fits in memory” is not the same as “ready to serve”

Local frontier-ish inference is real, but it is ops

Sign up for more like this.