ai-models

DeepSeek’s Permanent V4 Pro Price Cut Turns Model Routing Into a Product Requirement

Anatoliy Kolodkin

24 May 2026 • 6 min read

DeepSeek making its V4 Pro discount permanent looks like a pricing story. It is really an architecture story. Once AI systems stop answering isolated prompts and start running agent loops — reading repos, calling tools, summarizing logs, drafting patches, retrying tests, and explaining themselves every few seconds — inference cost becomes a product constraint, not a finance footnote.

Bloomberg reported that DeepSeek will make its 75% V4 Pro API discount permanent, and DeepSeek's own pricing page now says the deepseek-v4-pro model API price will be officially adjusted to one quarter of the original price after the promotion ends on May 31 at 15:59 UTC. The listed numbers are aggressive: $0.003625 per million cached input tokens, $0.435 per million uncached input tokens, and $0.87 per million output tokens for V4 Pro. DeepSeek V4 Flash is even cheaper at $0.0028 cached input, $0.14 uncached input, and $0.28 output per million tokens.

Put those beside OpenAI's published GPT-5.5 pricing — $5 per million input tokens, $0.50 cached input, and $30 per million output tokens for standard context lengths under 270K — and the shape of the market changes. The Decoder calculated DeepSeek V4 Pro as roughly 11.5x cheaper than GPT-5.5 on uncached input and about 34.5x cheaper on output. Against GPT-5.5 long-context output pricing, the gap is even larger. The exact comparison will vary by workload, provider, cache behavior, and model quality, but the direction is not subtle.

Cheap tokens do not automatically beat better models. But they punish lazy model selection. If every step in your agent workflow defaults to the most expensive frontier model because nobody designed a routing layer, DeepSeek just turned that into an architectural smell.

Agents burn tokens differently than chatbots

A chatbot interaction is usually simple: user asks, model answers, session ends or continues lightly. An agent behaves more like a junior automation system with a very chatty inner monologue. It reads files, opens more files, runs searches, emits plans, calls tools, parses tool output, writes code, runs tests, explains failures, revises patches, summarizes state, and asks for approval. That loop can be valuable, but it consumes tokens like CI consumes CPU.

This is why the DeepSeek numbers matter. The company lists both V4 Pro and V4 Flash with a 1 million token context length and 384K maximum output, available behind an OpenAI-format base URL and an Anthropic-format base URL. The models support JSON output, tool calls, chat-prefix completion beta, and FIM completion beta in non-thinking mode only. Those details are not trivia; they are exactly the interface features agent builders need when plugging a model into existing harnesses.

The cached-input price is especially provocative. Agent systems often revisit the same repository context, architecture notes, API docs, previous tool outputs, and instruction scaffolding across a session. If cache hits are reliable and the application is designed to preserve stable context, the economics of repeated repo-scale work change dramatically. One Hacker News commenter in the active pricing discussion described adjusting workflows to read project files early so later requests become cache hits. That is the kind of practical behavior a real price gap creates.

Output cost is the other pressure point. Agents produce a lot of output: plans, diffs, explanations, tool arguments, test summaries, and revised attempts. GPT-5.5 output at $30 per million tokens versus V4 Pro at $0.87 is not a rounding error. Even if the stronger model solves more tasks on the first try, teams should measure whether “expensive success” actually beats “cheap attempt plus objective checks” for bounded workflows like formatting migrations, test-log summarization, dependency-update notes, or first-pass code review.

The useful metric is cost per accepted task

The obvious caveat is that token price is not total cost. A cheaper model that needs twice as many tokens, fails more often, produces lower-quality patches, or requires more human review can erase some of its advantage. Gateway markup, enterprise logging, data-residency requirements, latency, uptime, and security review also matter. If a provider path adds operational risk or policy friction, the official API price is not the price your organization pays.

That is why builders should stop comparing models only on cost per million tokens. The metric that matters is cost per accepted task. For a coding agent, that means the full path from task assignment to reviewed, tested, merged change. Track input tokens, cached input tokens, output tokens, retries, wall-clock time, tool calls, test pass rate, review comments, rollback rate, and final acceptance. Then compare models by workflow type.

A frontier model may still be the right choice for high-risk work: security-sensitive changes, architecture decisions, ambiguous debugging, cross-service behavior, or final review. A cheaper model may be more than good enough for repetitive, verifiable tasks: renaming APIs, updating imports, summarizing CI failures, drafting changelog entries, generating test fixtures, or performing second-pass lint fixes. A local model may be appropriate for private data or offline environments. A governed cloud model may be required for enterprise auditability. The point is not “use DeepSeek for everything.” The point is “stop using one default model for everything.”

This is where model routing becomes product infrastructure. A serious agent stack should classify the task, estimate risk, inspect context size, understand tool requirements, apply data policy, choose a model, define escalation conditions, and record why the route was selected. If tests fail repeatedly, escalate. If the task touches auth, payments, crypto, permissions, or production config, escalate. If the output requires long-form reasoning and the cheap model starts thrashing, escalate. If the work is low-risk and objectively checked, route down. This should be policy, not a developer remembering to click the cheaper dropdown.

Cache-aware design is now part of agent engineering

The DeepSeek pricing table also nudges teams toward a more mature view of context. Many agent products treat context as a bottomless junk drawer: stuff the repo summary, tool history, docs, issue text, style guide, and half the conversation into the prompt and hope the model figures it out. That approach is expensive and brittle. With a large cached-input gap, context engineering becomes a cost-control discipline.

Stable system prompts should stay stable. Repository summaries should be reusable and versioned. Long-lived sessions should hydrate important context once, then avoid constantly perturbing the prefix in ways that break cache hits. Tool output should be compact and structured. Agents should know when to retrieve a file versus when to rely on a cached summary, and they should invalidate assumptions when the code changes. Cache optimization cannot come at the expense of correctness, but ignoring cache behavior is leaving money on the floor.

The Hacker News reaction around the pricing update was unusually substantive for a price-page story: roughly 441 points and 250 comments during the research window. Practitioners asked whether the economics survive gateways, whether Azure-hosted variants can offer comparable pricing, what data-retention terms apply, whether DeepSeek uses more tokens per task, and how cache-hit rates affect real bills. Those are the questions teams ask when agent usage starts to look like infrastructure spend.

This pressures everyone, including the premium labs

OpenAI, Anthropic, and Google are not going to compete only on raw token price. Their enterprise pitch includes model quality, safety work, governance controls, managed runtimes, data commitments, integrations, support, and trust. That bundle matters. If you are operating in a regulated environment, “cheapest model” is not the procurement strategy. If the task is genuinely hard, a better model can be cheaper by succeeding quickly.

But the DeepSeek cut still changes the negotiation. It gives product teams a concrete reason to unbundle their agent workloads. Premium models become the escalation layer, review layer, or high-difficulty solver. Cheaper models become the bulk-work layer. Local/open models cover privacy-sensitive or offline cases where feasible. Gateways and routing platforms become more important because no serious buyer wants every model swap to become an application rewrite.

It also puts pressure on developer-tool UX. A model picker is not enough. The product should expose intent and policy: optimize for cost, optimize for quality, keep data local, require enterprise logging, run cheap first then escalate, or use frontier review before merge. The agent should make the routing decision explainable enough that teams can audit it later.

For engineering leaders, the action item is straightforward: instrument before you optimize. Add per-step model logging, track cache hit rates, separate input from output costs, record retries and final acceptance, and label workflows by risk and task class. Then define routing policy and escalation rules.

DeepSeek's permanent discount does not mean the cheapest model wins. It means the one-model-default era is getting harder to justify. Agentic systems are too token-hungry, too varied, and too policy-sensitive for that. The teams that treat model selection as infrastructure will get better economics and safer workflows. The teams that do not will eventually discover that their AI strategy is just a very expensive autocomplete bill with no routing table.

Sources: Bloomberg, DeepSeek pricing docs, The Decoder, OpenAI API pricing

Agents burn tokens differently than chatbots

The useful metric is cost per accepted task

Cache-aware design is now part of agent engineering

This pressures everyone, including the premium labs

Sign up for more like this.