nvidia

Nemotron 3 Nano Looks Fast on Jetson Thor — Until Concurrency Makes the Runtime Tell the Truth

Anatoliy Kolodkin

13 May 2026 • 5 min read

Jetson Thor can run serious local AI models now. That is no longer the interesting part.

The interesting part is that a same-day NVIDIA Developer Forum benchmark shows exactly where the glossy edge-AI story starts paying rent: concurrency, runtime support, kernel selection, and parser glue. A user tested NVIDIA’s Nemotron-3-Nano-30B-A3B-NVFP4 on a Jetson AGX Thor Developer Kit with vLLM 0.20.2 and found a split personality. At concurrency 1, Nano looked excellent: 65.67 output tokens per second and 14.18 ms inter-token latency. At concurrency 32, inter-token latency degraded 4.7x while denser Qwen-8B and Mistral-24B variants stayed comparatively steadier.

That is the right kind of uncomfortable result. Not because it proves Nemotron Nano is bad — it does not — but because it proves edge AI has moved past the “can it fit?” phase. The new question is whether the model/runtime/hardware combination keeps behaving when the workload starts looking like a real agent instead of a booth demo.

The edge box is powerful. The serving path is still the product.

The hardware baseline matters. NVIDIA’s Jetson AGX Thor is not a toy board being asked to cosplay as a data center. NVIDIA lists up to 2070 FP4 TFLOPS of sparse AI compute, 128 GB LPDDR5X memory, 273 GB/s memory bandwidth, a 14-core Arm Neoverse-V3AE CPU, Blackwell fifth-generation Tensor Cores, and a configurable 40 W to 130 W power envelope. The forum user reported Ubuntu 24.04.4, CUDA 13.0, Linux 6.8.12-tegra, Blackwell sm_110/sm_110a, 122 GiB of unified memory, and a local NVMe model cache.

That is enough machine to make local agents plausible at the edge: robotics assistants, on-prem developer tools, lab automation controllers, industrial copilots, and private inference appliances that do not phone every prompt to a cloud API. But “plausible” is doing work here. The benchmark used a generic ARM64 vllm/vllm-openai:v0.20.2 container, Docker Compose, FlashInfer with JIT autotuning, --async-scheduling, --enable-chunked-prefill, FP8 KV cache, a 16,384 batched-token limit, tensor parallel size 1, and Nemotron-specific MoE environment variables including VLLM_USE_FLASHINFER_MOE_FP4=1 and VLLM_FLASHINFER_MOE_BACKEND=throughput.

In other words: the result is not “one model is faster than another.” The result is that local inference performance is now a systems integration artifact. Container choice, MoE backend, KV-cache dtype, prefix caching, speculative decoding, parser plugins, and hardware-specific fused kernels all change the shape of the user experience.

That is good news for serious builders and bad news for anyone hoping a model-card badge would replace benchmarking.

Single-user speed is the demo path. Concurrency is the truth serum.

Nemotron 3 Nano is an appealing architecture on paper: 30B total parameters, about 3.5B active parameters, a hybrid MoE design with Mamba-2 layers, MoE layers, attention layers, 128 routed experts plus a shared expert per MoE layer, six activated experts per token, and a claimed 1M-token maximum context. The NVFP4 model card reports serious benchmark numbers: MMLU-Pro 77.4, AIME25 86.7, GPQA 71.9, LiveCodeBench 65.4, TauBench V2 average 45.6, IFBench 70.7, and AA-LCR 33.3. It also describes a careful quantization path: FP8 KV cache, selective BF16 retention around attention and nearby Mamba layers, and Quantization-Aware Distillation.

That is exactly the kind of model you would expect NVIDIA to push for edge and local-agent workloads: small active footprint, Blackwell-friendly precision, long context, tool-use benchmarks, and a commercial-use license. The single-concurrency result supports the pitch. Nano’s inter-token latency at one user beat the denser alternatives in the forum comparison.

But agents are not always one user asking one prompt. They are bursty. They read context, call tools, retry failed operations, stream partial reasoning, and run background work while the user is still typing. A local coding assistant can have one visible conversation and several invisible loops: indexing files, summarizing logs, proposing patches, validating shell output, or maintaining a long-running plan. Robotics and edge assistants can be worse: multiple sensor events, task queues, safety checks, and operator requests compete for the same model server.

That is why the concurrency data matters more than the headline tokens/sec. The forum table showed Qwen3-8B rising from 41.12 tokens/sec at concurrency 1 to 1065.22 tokens/sec at concurrency 32 while time-per-output-token stayed roughly in the 23-28 ms band. Nano rose from 19.04 tokens/sec to 429.57 tokens/sec, but TPOT worsened from 14.32 ms to 65.92 ms and time-to-first-token climbed to 741.70 ms. That is not a rounding error. That is the difference between “local model feels alive” and “local model is rummaging through a drawer.”

The point is not that Qwen-8B is automatically the better agent model. It is smaller, denser, and operating under a different tradeoff curve. The point is that MoE efficiency is not free. Routing, fused kernels, cache behavior, scheduler decisions, and hardware-specific configs determine whether “only 3.5B active parameters” translates into stable latency under load.

MTP support is not inherited by family name.

The sharpest runtime detail in the thread was not the throughput table. It was the failed attempt to enable Multi-Token Prediction speculative decoding. The user tried MTP for Nano and vLLM 0.20.2 rejected it at startup with NotImplementedError: Unsupported speculative method: 'mtp'. The related nemotron_h_mtp path failed the same way. The runtime accepted other speculative methods, but not the one the operator wanted for this model.

That matters because speculative decoding is increasingly marketed as part of the practical performance story for reasoning models. Nemotron-family materials discuss native MTP layers, and other Nemotron variants have recipes where MTP is central to the serving command. But support is not transitive. A model family can include MTP-capable members; a runtime can support MTP for one variant; a specific Nano/vLLM/Jetson combination can still fail.

For practitioners, the lesson is simple: treat speculative decoding like a driver feature, not a model feature. Canary it per model, per container, per GPU class, and per runtime version. If your deployment plan assumes MTP, test startup failure, throughput improvement, output quality, parser compatibility, and rollback behavior before you build product latency budgets around it.

There is also a documentation caveat worth taking seriously. The user intentionally did not start from NVIDIA’s Jetson-Thor-specific vLLM container, choosing a generic ARM64 image to keep an ISO-style comparison across models. They also later found that vLLM’s recipe recommends Nano-specific reasoning and tool parser flags, including nano_v3_reasoning_parser.py, --reasoning-parser nano_v3, --enable-auto-tool-choice, and --tool-call-parser qwen3_coder. The startup logs also warned that vLLM was using a default MoE config because a Thor-specific fused-MoE JSON was missing.

That does not invalidate the benchmark. It makes it more realistic. Developers absolutely start from generic containers. They absolutely miss parser plugins. They absolutely benchmark before discovering a tuned kernel file. If the optimized path requires a specific container, a specific MoE config, exact environment variables, and parser glue, then those are part of the product surface. They are not footnotes.

So what should engineers do with this? First, stop evaluating local agent models at concurrency 1 unless the product is genuinely single-session. Measure TTFT, inter-token latency, output throughput, memory behavior, and failure rates across realistic concurrency. Second, benchmark the tuned vendor path and the generic path. The delta tells you how fragile your deployment is. Third, test tool-calling prompts, not just random chat completions. A model that streams quickly but breaks JSON or parser expectations is not a working agent model. Fourth, treat warnings like “Using default MoE config” as a failed preflight check, not harmless noise.

Jetson Thor running Nemotron Nano is still an important signal. It says edge systems can host capable, commercially usable local models with serious benchmark credentials. But the thread’s real value is less flattering and more useful: edge AI is now in the runtime-debugging phase. The model fits. The hard part is making the serving path boring under pressure.

Sources: NVIDIA Developer Forum, NVIDIA Nemotron-3-Nano model card, vLLM Nemotron-3-Nano guide, NVIDIA Jetson Thor

The edge box is powerful. The serving path is still the product.

Single-user speed is the demo path. Concurrency is the truth serum.

MTP support is not inherited by family name.

Sign up for more like this.