nvidia

Dynamo Snapshot Turns LLM Cold Starts Into an Inference-SRE Problem

Anatoliy Kolodkin

27 May 2026 • 4 min read

Cold start is one of those inference problems that sounds like housekeeping until the invoice arrives.

NVIDIA’s new Dynamo Snapshot feature is aimed at a very specific kind of waste: Kubernetes workers sitting on expensive GPUs while they download model artifacts, initialize engines, warm kernels, capture graphs, and generally perform the ritual dance required before the first token can be served. On cheap CPU services, that is annoying. On B200-class infrastructure, it is idle capital wearing a pod name.

The design is straightforward in concept and sharp in where it draws the line. Dynamo Snapshot checkpoints an inference worker after the engine has initialized, but before the worker registers with the distributed runtime. That matters because the checkpoint should preserve the expensive state — model weights, initialized runtime structures, CUDA state, graph-ready process memory — without trying to resurrect stale network connections, pod identities, control-plane relationships, or distributed runtime membership. Anyone who has debugged a “restored” process with old sockets knows why that boundary is not a detail.

Under the hood, NVIDIA is combining cuda-checkpoint for GPU state with CRIU for host and container state. The Kubernetes path uses a privileged snapshot-agent DaemonSet, a placeholder image, shared checkpoint storage, and Dynamo’s DynamoCheckpoint / DynamoGraphDeployment machinery. The current support matrix is intentionally narrow: experimental, vLLM-focused, x86_64 GPU nodes, NVIDIA driver 580.xx or newer, and ReadWriteMany storage for cross-node restore. Translation: useful enough to test; not something to casually flip on before lunch.

The useful trick is not checkpointing. It is checkpointing less.

The most interesting number in the post is not the headline 21x restore improvement. It is the Qwen3-0.6B checkpoint shrinking from roughly 190 GiB to 6 GiB when Dynamo Snapshot unmaps and releases unused KV cache memory while preserving virtual addresses. That is the systems-engineering move that makes the rest of the feature credible. A cold worker has not served requests yet, so saving a huge empty KV cache is not safety; it is ceremony.

NVIDIA also reports CRIU restore improvements that are large enough to matter operationally: Qwen3-0.6B drops from 6.8 seconds with upstream CRIU to 2.4 seconds with AIO and parallel memfd; Qwen3-8B falls from 24 seconds to 4.7 seconds; gpt-oss-120B goes from 119 seconds to 15 seconds. Those are not just nicer benchmark bars. They change whether autoscaling can react inside a traffic spike rather than after it has already punished users.

The proposed GPU Memory Service is the bigger architectural clue. Once model weights dominate checkpoint size, the classic path — storage to host memory to GPU memory, mostly serialized — becomes the bottleneck. NVIDIA’s split design pulls model weights out of the core CRIU checkpoint: in its example, a gpt-oss-120B core checkpoint falls from 129 GiB to 6.7 GiB, while the weights live in a separate 74 GiB GMS artifact. A proof-of-concept backend striped across eight local NVMe SSDs restored that model in under five seconds.

That is where this stops being a Dynamo-only feature and starts looking like the shape of production inference platforms. Fast restore wants local NVMe, GPUDirect Storage, RDMA or NVLink paths, placement-aware scheduling, and a control plane that understands which node already has the useful weight artifact close by. In other words, “cold start” is not a single optimization. It is storage architecture, scheduler policy, CUDA state management, and serving lifecycle design pretending to be one line in a changelog.

Autoscaling inference is useless if replicas warm up like batch jobs

For practitioners, the immediate takeaway is simple: measure cold start as part of serving SLOs, not as a deployment footnote. If a worker takes minutes to become useful, your autoscaler has two bad options. It can keep too much GPU capacity warm, which burns money quietly. Or it can scale late, which burns user trust loudly. The more bursty the workload — agents, coding assistants, document pipelines, multimodal queues, finance analysis around market open — the more visible that trade-off becomes.

Dynamo Snapshot is also a reminder that inference SRE is becoming its own discipline. The KPIs are not just tokens per second. Teams should track p95 and p99 scale-out time, idle GPU-minutes during warmup, restore success rate, per-model checkpoint size, storage bandwidth, cache hit behavior, graph capture compatibility, and whether restored workers behave identically under real traffic. A five-second restore that occasionally returns a cursed process is not an optimization; it is a pager with better marketing.

The security and operations caveats are real. A privileged DaemonSet doing CRIU operations is not a small platform decision. Driver version constraints matter. Shared storage semantics matter. Multi-GPU, multi-node, TensorRT-LLM, GPU Memory Service, and more complex restore paths are still on the roadmap. If you operate in regulated environments, the checkpoint artifact itself becomes something to classify, protect, expire, and audit. It may contain enough process state to deserve more respect than “some file on a PVC.”

So what should builders do now? If you run vLLM or SGLang-style workloads on Kubernetes, pick one non-critical model and benchmark the whole lifecycle: image pull, model download, engine init, graph capture, checkpoint creation, restore, registration, first-token latency, steady-state latency, and failure recovery. Test with your actual storage class and node topology. Compare it against simpler mitigations — prewarmed pools, smaller models, model locality, better image layering, or explicit traffic shaping. Dynamo Snapshot is promising precisely because it attacks a real cost center, but it needs to beat the boring alternatives under your workload.

The forward-looking read is that LLM serving is inheriting the hardest parts of checkpoint/restore because the economics finally justify the work. We tolerated slow cold starts when inference was a demo or a batch job. Agents, interactive copilots, and always-on model services do not have that luxury. Autoscaling only works when a new replica becomes useful fast enough to matter.

LGTM take: Dynamo Snapshot matters because it treats cold start as an inference economics problem, not a Kubernetes annoyance. The feature is experimental, but the abstraction boundary is right: preserve the expensive state, rebuild the distributed identity, and stop paying GPUs to warm themselves up.

Sources: NVIDIA Developer Blog, Dynamo Snapshot documentation, ai-dynamo/dynamo, CRIU, NVIDIA CUDA checkpointing with CRIU

The useful trick is not checkpointing. It is checkpointing less.

Autoscaling inference is useless if replicas warm up like batch jobs

Sign up for more like this.