ai-models

Claude Opus 4.7 Looks Less Like a Model Refresh and More Like Anthropic Fixing Agent Reliability

Anatoliy Kolodkin

16 Apr 2026 • 5 min read

Anthropic did not price Claude Opus 4.7 like a moonshot. That is the point.

The company shipped its new flagship generally available model at the same $5 per million input tokens and $25 per million output tokens as Opus 4.6, then spent most of the launch arguing not that Opus 4.7 is smarter in the abstract, but that it is steadier when the work gets long, messy, and tool-heavy. In other words, Anthropic is no longer selling frontier capability as a one-off peak. It is trying to sell reliability as an operational feature. That is a much more important story for engineers than another leaderboard screenshot.

There is plenty of benchmark material if you want it. Anthropic says Opus 4.7 is its most capable generally available model for complex reasoning and agentic coding. The company says it now supports a 1 million token context window, up to 128,000 output tokens, a new xhigh effort setting, task budgets in beta, and significantly stronger high-resolution vision support up to 2,576 pixels on the long edge, about 3.75 megapixels. Anthropic also claims notable gains over Opus 4.6 in instruction following, knowledge work, file-system memory, and long-horizon software tasks, while keeping nominal pricing unchanged.

But the benchmark lines that matter most are the ones that read like operations reports. Cursor says Opus 4.7 hit 70% on CursorBench versus 58% for Opus 4.6. Harvey reports 90.9% on BigLaw Bench at high effort. XBOW says its visual-acuity benchmark jumped to 98.5% from 54.5%. Anthropic’s launch page is packed with customer quotes about fewer tool errors, better self-verification, stronger follow-through, and models that do not stop halfway through a job. The subtext is obvious: frontier labs have realized the next buyer question is not “is it smarter?” but “will it finish the work without babysitting?”

The expensive part of agentic coding is not tokens. It is flakiness.

This is the part too many model launches skip. Teams do not usually abandon an agent because it missed a benchmark by three points. They abandon it because it burns 25 minutes chasing the wrong branch, loops on a broken tool call, quietly drops an instruction, or returns something polished-looking that turns out to be wrong. Those failures are not glamorous, but they are what determine whether a model becomes a daily driver or a demo.

Read Anthropic’s documentation with that lens and Opus 4.7 starts to look less like a generic intelligence bump and more like a harness-friendly model. The new task budgets feature lets developers give Claude an advisory token allowance for a whole agentic loop, not just a hard cap on one response. The new xhigh effort tier makes it clearer when Anthropic expects users to spend more compute on harder coding work. The docs also note more regular progress updates during long traces, fewer subagents by default, and more literal instruction following. That is product language for “we know people are putting this into workflows where predictability matters more than charm.”

There is a second-order implication here that Anthropic does not quite say outright. If models are getting better at self-verification and long-run coherence, some of the scaffolding teams built around earlier generations is becoming technical debt. The company already made this argument in its Managed Agents material last week, where it described old harness assumptions turning into dead weight as models improved. Opus 4.7 extends the same theme. The right question is no longer just what extra wrappers you need around the model. It is which wrappers you can finally delete.

Same sticker price does not mean same bill

The catch is that “no price increase” is not the same as “no migration cost.” Anthropic’s release notes say Opus 4.7 uses a new tokenizer that may consume roughly 1x to 1.35x as many tokens as previous models, depending on the content. That matters. If your prompts are long, your context compaction thresholds are tight, or your workflow depends on lots of document and image inputs, you may absolutely see bigger bills even though the posted rates did not move.

There are breaking changes too. Extended thinking budgets are gone on the Messages API. Sampling controls like temperature, top_p, and top_k now error if you set non-default values. Thinking content is omitted by default unless you opt back in. Anthropic’s migration advice is effectively telling builders to stop treating Opus 4.7 as a drop-in patch release. This is a behaviorally different model with different cost dynamics and different assumptions about how the API should be used.

That is why the smart rollout path is boring. Re-baseline your core prompts. Re-check token counts. Revisit any prompt scaffolding designed to force explicit status updates, repeated self-checks, or aggressive tool invocation, because the model may now do some of that better on its own. And if your product streams visible reasoning, do not miss the thinking-display change or your users will experience a mysterious pause where progress used to be.

Anthropic is making a control-plane argument, not just a model argument

What makes this launch more interesting than a standard flagship refresh is the timing. Anthropic spent the past week pushing two adjacent ideas: Managed Agents as the durable infrastructure layer, and Project Glasswing as a reminder that stronger models come with security consequences. Opus 4.7 sits right between those two messages. It is more capable, but not as cyber-sensitive as Mythos Preview. It adds more steering controls, but also more automated safeguards. Anthropic is clearly trying to define itself as the frontier lab that ships intelligence with operating instructions.

That is a meaningful contrast with OpenAI’s current posture around Codex and GPT-5.4, which has leaned harder into workload segmentation, routing, and product-surface packaging. Anthropic, by comparison, is talking more about long-horizon behavior, agent ergonomics, and migration semantics. Both strategies make sense. But for engineering teams, the comparison is now less “which model tops a benchmark” and more “which vendor is making the failure modes easiest to manage.”

That is also why Simon Willison’s same-day post about a local Qwen 3.6 model drawing a better SVG pelican than Opus 4.7 is worth mentioning, even if the benchmark is intentionally ridiculous. It is a useful antidote to frontier-model triumphalism. The best general model will not win every weird task, and a local open-weight model can still be the better tool for narrow or idiosyncratic workloads. Opus 4.7 does not change that. What it changes is the ceiling for the class of work where enterprises want one model to reason, call tools, inspect artifacts, and keep going without collapsing into nonsense.

If you run coding agents, code review bots, research automations, or document-heavy internal assistants, the practical move is straightforward. Test Opus 4.7 first on the jobs where failure is expensive and supervision is annoying: multi-step refactors, debugging sessions that cross tool boundaries, screenshot-heavy UI work, legal or finance workflows with lots of source material, and long-running tasks where the model needs to notice missing information before it confidently invents an answer. Then compare not just pass rate, but variance, tool-call quality, token burn, and how often your harness has to rescue the model from its own habits.

That is the real LGTM on this release. Anthropic is not merely claiming Opus 4.7 is better. It is claiming the model is more operationally trustworthy. In this market, that is the upgrade path that actually moves budgets.

Sources: Anthropic, Claude Platform docs, Claude Platform release notes, Simon Willison

The expensive part of agentic coding is not tokens. It is flakiness.

Same sticker price does not mean same bill

Anthropic is making a control-plane argument, not just a model argument

Sign up for more like this.