ai-models

Claude Mythos Shows the New Bottleneck in AI Security Is Patching, Not Finding Bugs

Anatoliy Kolodkin

23 May 2026 • 6 min read

The most important number in Anthropic’s Project Glasswing update is not 10,000. It is 75.

Yes, Anthropic says Claude Mythos Preview and roughly 50 partners have found more than 10,000 high- or critical-severity vulnerabilities in systemically important software in about a month. That is the headline. But the operational reality is uglier: Anthropic estimates it has disclosed 530 high- or critical-severity bugs to maintainers so far, and only 75 have been patched, with 65 public advisories. The model story is impressive. The software-maintenance story is the one that should make engineering leaders uncomfortable.

For years, vulnerability discovery was the scarce resource. Good security researchers were rare, manual code review was expensive, fuzzing coverage was uneven, and attackers could concentrate effort on a narrow target while defenders had to cover everything. Project Glasswing suggests that bottleneck is moving. If frontier models can find and validate serious bugs at AI-scale, the new scarcity is not discovery. It is verification, disclosure, patch design, release discipline, user upgrade velocity, and maintainer attention. In other words: the boring machinery that turns a scary finding into safer software.

Discovery is getting cheaper; remediation is still priced in humans

Anthropic says Project Glasswing’s partners have collectively found more than ten thousand high- or critical-severity vulnerabilities with Claude Mythos Preview. Several partners reported bug-finding rates increasing by more than 10x. Cloudflare says it found 2,000 bugs across critical-path systems, including 400 high- or critical-severity findings, with a false-positive rate its team considered better than human testers. Mozilla says an early Mythos Preview build helped it find and fix 271 vulnerabilities in Firefox 150, compared with 22 security-sensitive bugs it previously fixed after scanning Firefox 148 with Claude Opus 4.6.

Those are not normal scanner numbers. A noisy tool can always bury a maintainer under speculative “possibly exploitable” reports. The more interesting claim is that Mythos is better at closing the loop: reasoning through exploitability, generating proofs, compiling and running test cases, and iterating when the first hypothesis fails. Cloudflare’s write-up is blunt about why that matters. A suspected flaw without a proof is a triage tax. A finding with reproduction steps and a working proof is something an engineer can actually act on.

The open-source scan shows the same pattern at larger scale. Anthropic says it scanned more than 1,000 open-source projects and Mythos estimated 23,019 total vulnerabilities, including 6,202 high- or critical-severity findings. Of 1,752 high- or critical-rated findings assessed by independent security firms or Anthropic, 90.6% were valid true positives and 62.4% were confirmed high or critical. Based on those triage rates, Anthropic estimates nearly 3,900 confirmed high- or critical-severity vulnerabilities are already in scope even if Mythos finds nothing else.

Discount the vendor framing if you want. You should. But do not discount it to zero. A 90.6% true-positive rate on assessed high/critical findings is not “an LLM hallucinated CVEs into Jira.” It is a signal that the economics of vulnerability research are changing faster than the institutions around vulnerability response.

The patch queue is now the critical system

The scary part is not that AI can find bugs. The scary part is that defenders may soon know about more real bugs than they can safely fix or disclose.

Anthropic says several maintainers have asked it to slow down disclosures because they need more time to design patches. It says a high- or critical-severity bug found by Mythos Preview takes about two weeks on average to patch. That is not maintainer failure. It is physics. Serious security patches require reproduction, root-cause analysis, regression tests, compatibility checks, release coordination, advisory language, downstream notification, and user adoption. Open source adds volunteer bandwidth and project governance to the queue. Enterprise software adds change-management boards, maintenance windows, QA gates, and the eternal comedy of systems nobody admits they own.

This is where the industry needs to stop congratulating itself on better bug discovery and start measuring the full pipeline. Mean time to discover is becoming less interesting. Mean time to reproduce, mean time to assign ownership, mean time to patch, mean time to release, and mean time for users to upgrade are the metrics that matter. If discovery falls from weeks to hours while deployment remains stuck in quarterly-change-window land, the vulnerability window does not disappear. It gets stranger.

There is also a coordination problem. Coordinated vulnerability disclosure exists because publishing details too early can arm attackers before users patch. But if AI systems can independently rediscover the same classes of bugs at scale, defenders are racing against a diffusion curve, not just a disclosure calendar. Anthropic’s restraint in not broadly releasing Mythos-class models is therefore not a footnote. It is the product story. Cyber-capable frontier models are dual-use in the least abstract sense: the same exploit chain that helps Cloudflare harden infrastructure can help an attacker weaponize it.

Benchmarks are starting to look like warning lights

The UK AI Security Institute’s evaluation adds context for why this is not just a vendor case study. AISI says a newer Claude Mythos Preview checkpoint was the first model to solve both of its cyber ranges end to end, completing “The Last Ones” in 6 of 10 attempts and “Cooling Tower” in 3 of 10 attempts under its setup. The institute also says frontier models’ cyber task horizons had previously appeared to double every 4.7 months since late 2024, and that Mythos Preview and GPT-5.5 substantially exceeded that trend.

Benchmarks are not production networks. AISI is careful about that, and practitioners should be too. Its ranges are controlled environments, and real systems have active defenses, identity boundaries, telemetry, weird state, partial information, and humans interrupting the plan. But benchmark saturation is still useful evidence. When the test rig starts looking underpowered, the right response is not “benchmarks are fake.” It is “we may be measuring a moving object with last quarter’s ruler.”

Mozilla’s post is useful because it strips away some sci-fi. The Firefox team says it has not seen Mythos find categories of bugs that elite humans could not find. The shift is not alien vulnerability magic. It is scale, speed, and persistence applied to the kind of source-code reasoning that previously required scarce experts. That is more actionable than the fantasy version. If models are compressing expert review cycles, defenders can benefit — but only if they can absorb the output.

What builders should actually change

The wrong response is to point a generic coding agent at a monorepo and call it a security program. Vulnerability research is not feature implementation. It wants many parallel hypotheses against narrow slices of code: parsers, auth boundaries, deserialization paths, memory-unsafe components, privilege transitions, sandbox escapes, dependency interfaces, and user-controlled inputs. A single agent holding one broad context window is the wrong shape for that work.

Teams should copy the workflow, not wait for Mythos access. Start by inventorying critical code paths and exposed systems. Build disposable test environments where models can inspect one risk slice at a time, run tests, generate proofs, and produce reports with affected versions, reproduction steps, severity rationale, proof status, and suggested fixes. Separate discovery from triage. Require evidence a maintainer can run. Deduplicate aggressively. Log the model, prompts, tools, commit range, and artifacts behind each finding so the security machine itself can be debugged.

Then fix the part almost nobody wants to fund: patch velocity. Make security releases boring. Keep ownership maps current. Reduce abandoned services. Automate dependency updates where possible. Maintain SBOMs that someone actually uses. Practice emergency patch drills. Harden defaults. Require MFA. Improve detection logs. Make auto-update safer. If you maintain open source, publish an AI-assisted vulnerability-report policy now: what evidence is required, where reports go, what proof-of-concept formats are acceptable, and how duplicate or speculative reports will be handled. Otherwise the AI bug-report flood will turn maintainer inboxes into a denial-of-service attack wearing a responsible-disclosure badge.

For enterprises, the uncomfortable metric is not whether an AI scanner can find a bug in your codebase. It is how long a confirmed critical patch takes to reach production after the fix exists. If that number is measured in months, a better model is not your biggest problem. It is a flashlight pointed at the problem you already had.

Claude Mythos Preview is a model-capability story, but the lasting lesson is operational. The teams that win this phase will not be the ones with the flashiest vulnerability-finding demo. They will be the ones with the dull, disciplined machinery to turn AI-scale findings into safe patches before attackers get the same leverage. Discovery is becoming abundant. Remediation is the product now.

Sources: Anthropic, UK AI Security Institute, Mozilla, Cloudflare, The Decoder

Discovery is getting cheaper; remediation is still priced in humans

The patch queue is now the critical system

Benchmarks are starting to look like warning lights

What builders should actually change

Sign up for more like this.