nvidia

DGX Spark’s Nemotron 3 Super Benchmark Is Useful Because It Measures Stability, Not Just Speed

Anatoliy Kolodkin

13 May 2026 • 5 min read

The most useful number in the latest DGX Spark Nemotron benchmark is not 23.45 tokens per second. It is zero.

Zero crashes. Zero out-of-memory errors. A same-evening NVIDIA Developer Forum post reports Nemotron-3-Super-120B-A12B-NVFP4 running on a single DGX Spark at 23.45 tokens/sec for a tg128 single-session test, stable from d0 to d100000, across 104 tests over 5 hours and 49 minutes. That is a better local-AI result than a prettier screenshot with a higher tokens/sec number, because serious local agents do not fail only by being slow. They fail by being flaky.

A flaky local model is worse than an honest slow one. It eats the same operator time, corrupts confidence in the stack, and fails at exactly the wrong moment: after the prompt grows, after a tool loop has state, after the user trusts the session, or after memory fragmentation has accumulated invisibly for hours. So yes, 23.45 t/s from a 120B-total-parameter model on a desktop-class NVIDIA system is interesting. But “ran for nearly six hours without OOM” is the part that deserves the underline.

DGX Spark is becoming a recipe machine, not just a spec sheet.

NVIDIA positions DGX Spark as a workstation-class path into local AI: GB10 Grace Blackwell Superchip, up to 1 PFLOP FP4 theoretical sparse performance, 128 GB coherent unified memory, 273 GB/s memory bandwidth, 4 TB NVMe storage, ConnectX-7 networking at 200 Gbps, and local inference for models up to 200B parameters. NVIDIA also says two Spark systems can work with models up to 405B parameters. Those specs are useful, but they are not the deployment story by themselves.

The forum and Spark Arena record make the deployment story concrete. The run used vLLM serving Nemotron-3-Super-120B-A12B-NVFP4 on one DGX Spark with tensor parallel size 1, a vLLM nightly container identified by a sha256 prefix, NVFP4, Marlin, MTP speculative decoding, and a Nemotron reasoning parser. Spark Arena lists a recipe with gpu_memory_utilization: 0.75, max_model_len: 131072, max_num_batched_tokens: 16384, and max_num_seqs: 4. The environment sets VLLM_NVFP4_GEMM_BACKEND=marlin, disables FlashInfer MoE FP4, and allows long maximum model lengths.

The serving command is not casual either. It includes FP4 quantization, Marlin MoE, FP8 KV cache, float32 Mamba SSM cache, async scheduling, chunked prefill, nemotron_v3 reasoning parsing, Qwen-style tool-call parsing, and an MTP speculative config with one speculative token and a Triton MoE backend. This is what “local large model” means in 2026: not one binary, not one model file, but a pinned runtime recipe with precision choices, parser choices, cache choices, and backend choices.

That complexity is not a knock against DGX Spark. It is the cost of operating near the edge of what a compact workstation can serve. The win is that the recipe exists, is public enough to inspect, and ran long enough to matter.

A 120B model that fits is table stakes. A 120B model that stays up is news.

Nemotron 3 Super is built for exactly the category of workload NVIDIA wants DGX Spark owners to care about: collaborative agents, high-volume automation, tool use, retrieval-augmented generation, and long-context reasoning. The model card describes a 120B total / 12B active LatentMoE architecture mixing Mamba-2, MoE, and attention, native Multi-Token Prediction layers, up to a 1M-token context, and a minimum GPU requirement of one B200 or one DGX Spark. NVIDIA reports benchmark scores including MMLU-Pro 83.33, GPQA 79.42, LiveCodeBench v6 78.44, TauBench V2 average 60.46, IFBench 73.30, Arena-Hard-V2 76.00, AA-LCR 58.06, and RULER-500 at 512k 96.23.

Those numbers make Super look like a credible local agent model rather than a novelty. But the benchmark’s stability claim addresses the more practical question: can you leave it running?

Local coding agents and private enterprise assistants create ugly serving patterns. They do not just ask one question and disappear. They accumulate context, call tools, maintain partial plans, run retries, ingest logs, and sometimes sit idle before resuming with a much larger prompt. Long-context support on a model card is only useful if the serving stack keeps KV cache behavior, memory allocation, parser state, and speculative decoding under control across repeated turns. One clean request proves almost nothing. A six-hour run with 104 tests proves more, though still not everything.

This is where the benchmark is meaningfully different from the usual “look what fits on my box” post. It measured enough duration to expose basic memory cliffs. It pinned a concrete runtime path. It reported stability as a first-class result. That is the discipline local-AI builders should copy.

It also gives a useful contrast with a separate Jetson Thor thread from the same window, where Nemotron 3 Nano looked fast at single concurrency but degraded sharply under higher concurrency and rejected an MTP speculative path in vLLM 0.20.2. The comparison is not “Super good, Nano bad.” It is more interesting than that. Nano shows that a smaller active model can have great single-session latency while still depending heavily on tuned kernels and runtime support under load. Super shows that a larger model can be operationally credible when the recipe aligns with the architecture.

That is the emerging rule for local AI: the model is not the unit of deployment. The model-runtime-hardware recipe is.

The agent question is whether 23.45 t/s is enough for the job.

For interactive pair programming, 23.45 tokens/sec may feel borderline depending on prompt length, time-to-first-token, and how often the agent pauses to inspect files or run commands. For a background maintainer agent that reviews logs, proposes patches, triages tickets, or drafts migrations overnight, it may be perfectly acceptable. For RAG-heavy enterprise workflows, decode speed may matter less than prefill, retrieval latency, parser reliability, and whether the system can sustain multiple sessions without OOM. There is no universal “fast enough.” There is only fast enough for the interaction contract.

That is the analysis practitioners should take from this. If you are evaluating DGX Spark for local/private agents, do not buy the tokens/sec number as a proxy for product fit. Build a harness that resembles your actual work. For coding agents, run repo inspection, long-context bug fixing, shell-tool loops, and JSON tool-call validation. For enterprise assistants, run RAG prompts with realistic document sizes and concurrent sessions. For automation agents, run multi-hour loops and record every restart, parser failure, malformed tool call, and memory-growth pattern.

Measure the unglamorous metrics: time-to-first-token, time-per-output-token, p95 latency, OOM frequency, context-depth stability, restart recovery, and correctness under the exact reasoning parser and tool-call parser you plan to use. Change one variable at a time: MTP speculative tokens, KV-cache dtype, maximum model length, max_num_seqs, MoE backend, Mamba cache dtype, and container digest. If the recipe uses a nightly vLLM image, pin it. Tomorrow’s nightly is not the same dependency, no matter how comforting the tag looks.

There is a procurement lesson here too. DGX Spark is not just “a small box with enough memory.” Its value depends on whether NVIDIA and the community keep producing recipes that make the box boringly usable. Spark Arena matters because it turns anecdote into something closer to an operator artifact: command, runtime, settings, hardware, and observed behavior. NVIDIA’s forums matter because the messy edge cases — parser flags, unsupported speculative paths, Marlin versus FlashInfer, long-context settings — are where product reality shows up before polished docs catch up.

The optimistic read is that local AI is maturing. A single DGX Spark running a 120B/12B-active NVFP4 model with MTP and no OOMs over nearly six hours would have sounded absurdly ambitious not long ago. The skeptical read is equally important: the path is not simple, the stack is moving fast, and “minimum GPU requirement” on a model card is not the same as an operationally supported product.

That tension is exactly where builders should live. DGX Spark is becoming a real workstation for large-model experimentation, private agents, and local inference engineering. But the durable advantage is not the spec sheet. It is the stable recipe: the pinned runtime, the right quantization path, the parser that does not break tool calls, the speculative decoding mode that actually starts, and the boring multi-hour run that ends with zero OOMs.

In local AI, zero is sometimes the number worth publishing.

Sources: NVIDIA Developer Forum, Spark Arena benchmark record, NVIDIA Nemotron-3-Super model card, NVIDIA DGX Spark

DGX Spark is becoming a recipe machine, not just a spec sheet.

A 120B model that fits is table stakes. A 120B model that stays up is news.

The agent question is whether 23.45 t/s is enough for the job.

Sign up for more like this.