nvidia

DeepSeek V4 on NVIDIA Blackwell Is a Long-Context Agent Story Disguised as an Inference Post

Anatoliy Kolodkin

14 May 2026 • 5 min read

DeepSeek V4 is being sold as a million-token model launch. That is the least interesting way to read it.

NVIDIA’s new guidance for running DeepSeek V4 on Blackwell, GPU-accelerated endpoints, NIM, vLLM, SGLang, NemoClaw, AI-Q, and NeMo is really a serving story. The model is large enough to get attention — DeepSeek-V4-Pro is listed at 1.6 trillion total parameters with 49 billion active, while DeepSeek-V4-Flash is 284 billion total with 13 billion active — but the operator question is sharper: can a long-context agent stay useful when the context is full of code, logs, tool calls, retrieval chunks, approvals, and stale memory?

That is where NVIDIA’s post earns coverage. It treats the model launch less like a leaderboard trophy and more like an infrastructure menu. Hosted endpoints for experimentation. NIM containers for self-hosting. vLLM and SGLang recipes for teams willing to own the serving stack. NeMo AutoModel for post-training. NemoClaw, AI-Q Blueprint, and Data Explorer Agent for agent workflows. The message is not subtle: long-context models are becoming systems products, not just model cards.

The million-token claim is a memory topology problem

Both DeepSeek V4 variants support a 1M-token context window, and NVIDIA says the API path allows maximum output length up to 384K tokens. DeepSeek’s own technical framing matters more than the round number: the models combine Compressed Sparse Attention, DeepSeek Sparse Attention, and Heavily Compressed Attention, plus Manifold-Constrained Hyper-Connections and the Muon optimizer. At 1M-token context, DeepSeek reports V4-Pro needs 27% of DeepSeek-V3.2’s single-token inference FLOPs and 10% of V3.2’s KV cache. NVIDIA expresses the same point as a 73% per-token inference FLOP reduction and a 90% KV-cache memory reduction versus V3.2.

That KV-cache number is the operator hook. Long-context claims are cheap until they meet memory. Every extra token carried through an agent session becomes state the serving system must allocate, move, evict, shard, or compress. A million-token model with naive attention economics is not a product feature; it is a budget fire with a README. If DeepSeek’s attention compression holds up under real workloads, it changes what can be hosted economically. It does not remove the need to understand the serving path.

The vLLM recipe gives away the practical reality. NVIDIA points to B300 and H200 eight-GPU data-parallel plus expert-parallel deployments, while the recipe notes H200 context is capped at 800K tokens to leave KV headroom. It also says GB200 NVL4 needs two trays for V4-Pro because the roughly 960GB mixed-precision checkpoint does not fit on one tray. That is the whole story in miniature: the model says “1M context,” the deployment says “show me your memory topology.”

For agents, context length is not correctness

The temptation will be to treat DeepSeek V4 as a shortcut for agent memory. Just stuff the repository, issue history, logs, docs, previous attempts, and tool results into the window and let the model sort it out. That is how teams turn context windows into landfills.

Long context helps, but it is not an architecture by itself. Coding agents need to preserve task constraints after many tool calls. Research agents need to cite the right source after reading a pile of nearly identical documents. Ops agents need to distinguish current logs from old noise. Workflow agents need to honor approval boundaries even when earlier instructions and later tool output conflict. A model that can ingest 1M tokens can still fail by attending to the wrong evidence, following stale state, or producing a plausible patch that violates a real boundary.

DeepSeek’s agent benchmark numbers are promising enough to take seriously. NVIDIA cites DeepSeek’s V4-Pro Max results including Terminal Bench 2.0 at 67.9, SWE Verified at 80.6 resolved, SWE Pro at 55.4 resolved, MCPAtlas Public at 73.6, and Toolathlon at 51.8. Those are relevant because they test closer to the work developers actually care about: terminals, software engineering tasks, MCP-style tool use, and multi-step agent behavior. They are still not a substitute for evaluating your workflow.

If you are building on this stack, the test plan should look more like an infrastructure qualification than a chat demo. Measure long repository ingestion. Measure latency after a large prefill, not just the first happy request. Test multi-file patching, tool-call schema adherence, context retention after dozens of tool outputs, and behavior near the advertised context limit. If the application is document analysis, test citation grounding and retrieval precision under long contexts. If the application is a coding agent, test whether it still follows allowed-file rules when the context is noisy and the model is tired.

NVIDIA is selling the stack, not just the silicon

The Blackwell angle is predictable but still important. NVIDIA cites SemiAnalysis InferenceX results showing DeepSeek-V4-Pro on GB200 NVL72 at more than 150 tokens/sec/user and 30x better performance per watt than H200 at similar interactivity levels. That is the number NVIDIA wants buyers to remember. The more durable point is the breadth of the deployment story: NIM for teams that want a supported API-compatible container, vLLM and SGLang for teams optimizing their own serving fleet, and agent frameworks layered above the inference runtime.

SGLang’s cookbook is especially revealing because it exposes the knobs operators actually need: B200, B300, GB200, GB300, H200, H100; Flash versus Pro; low-latency, balanced, max-throughput, context-parallel, and prefill/decode disaggregation recipes. That vocabulary is where long-context models become production systems. Prefill and decode are different bottlenecks. Context parallelism and expert parallelism are not optional jargon when the model and KV cache are this large. Tool-calling and reasoning parsers are compatibility contracts, not decorative flags.

The mixed-precision detail matters too. DeepSeek’s instruct checkpoints use FP4 and FP8, with MoE expert parameters in FP4 and most other parameters in FP8. That is exactly the kind of detail that decides whether a model is deployable on a given GPU generation. It also creates another failure surface: kernels, quantization formats, parsers, NCCL behavior, memory allocation, and checkpoint layout all have to agree. In 2026, “the model supports it” is not enough. The serving stack has to support it in the exact configuration you plan to run.

The practical LGTM take: do not adopt DeepSeek V4 because the context number is large. Adopt it if the model/runtime/hardware combination makes your agent more correct, more economical, and more observable than the alternatives. Compare it against smaller local models, hosted frontier APIs, retrieval plus disciplined memory, and workflow-level state management. In many products, better retrieval beats bigger context. In others, a million-token window changes the game. The only way to know is to measure the workflow, not the headline.

Context length is not the product. Serving discipline is.

Sources: NVIDIA Technical Blog, DeepSeek-V4-Pro model card, DeepSeek-V4-Flash model card, vLLM DeepSeek-V4-Pro recipe, SGLang DeepSeek-V4 deployment cookbook

The million-token claim is a memory topology problem

For agents, context length is not correctness

NVIDIA is selling the stack, not just the silicon

Sign up for more like this.