Claude Mythos ASL-4: Why Anthropic’s Most Powerful Model May Trigger Its Highest Safety Level

Claude Mythos may become the first AI model in history to trigger ASL-4 evaluation under Anthropic’s Responsible Scaling Policy. The leaked internal documents from March 2026 describe a model with cyber capabilities that “far outpace the efforts of defenders,” placing it squarely in territory that Anthropic’s own safety framework was designed to address. According to Anthropic’s Responsible Scaling Policy v3.0, ASL-4 applies when an AI model becomes “the primary source of national security risk” in areas like cyberattacks or biological weapons development.

Claude Mythos ASL-4 safety level evaluation

This is not a theoretical exercise. Anthropic activated ASL-3 protections for Claude Opus 4 in May 2025, deploying over 100 security measures including Constitutional Classifiers and two-party authorization protocols. The company explicitly ruled out that Opus 4 needed ASL-4. With Mythos, that certainty appears to have evaporated. The leaked drafts describe a “step change” in capability that goes beyond incremental improvement, raising questions that Anthropic’s safety team is actively working to answer.

What Is ASL-4 in Anthropic’s Safety Framework?

Anthropic’s AI Safety Levels draw direct inspiration from the U.S. government’s biosafety level (BSL) system used for handling dangerous biological materials. The framework was introduced as part of the Responsible Scaling Policy in September 2023, creating a structured approach to managing risk as AI models grow more capable. Each level prescribes increasingly strict safety measures based on a model’s potential for catastrophic misuse.

ASL-1 covers systems posing no meaningful catastrophic risk — think a chess-playing AI or a basic 2018-era language model. These systems require minimal safety oversight because their capabilities are too limited to cause serious harm even if deliberately misused.

ASL-2 applies to models showing early signs of dangerous capabilities but where the information they produce is not yet more useful than what a search engine provides. Most commercial language models through 2024 fell into this category, including earlier Claude versions. The security requirements at this level focus on standard cybersecurity practices and basic content filtering.

ASL-3 marks a significant escalation. At this level, models can substantially increase the risk of misuse in areas like chemical, biological, radiological, and nuclear (CBRN) weapons development. Anthropic activated ASL-3 for Claude Opus 4 on May 22, 2025, implementing narrowly targeted deployment measures and enhanced internal security. The company noted this was precautionary — they could not definitively rule out ASL-3 risks given the model’s improved CBRN-related capabilities.

ASL-4 represents the threshold where AI capabilities become genuinely catastrophic. Under Anthropic’s framework, ASL-4 applies when a model meets one or more of three criteria: it becomes the primary source of national security risk in a major domain such as cyberattacks or biological weapons; it demonstrates autonomous replication capabilities including self-replication, resource accumulation, and shutdown prevention; or it can independently conduct complex AI research that substantially accelerates AI development in unpredictable ways. Unlike ASL-2 and ASL-3, which were defined in significant detail from the start, ASL-4 requirements remain a work in progress — deliberately left open because Anthropic wanted real data from frontier models before prescribing specific safeguards.

Why Claude Mythos May Trigger ASL-4 Evaluation

The March 26, 2026 data exposure changed everything. A CMS misconfiguration exposed approximately 3,000 unpublished Anthropic assets, including draft documents describing Claude Mythos (internal codename: Capybara) as “the most capable we’ve built to date.” The leak was not caused by a sophisticated adversary — it was a content management system that defaulted assets to “public” rather than private. Anthropic secured the exposure after Fortune contacted the company, subsequently confirming the model’s existence.

The leaked documents paint a specific picture of why ASL-4 evaluation is on the table. Mythos is described as “currently far ahead of any other AI model in cyber capabilities” and can identify software vulnerabilities in ways that defenders cannot match. This is not abstract speculation. Claude Opus 4.6, the previous generation, identified 22 confirmed CVEs in Firefox during a two-week security audit, including 14 high-severity vulnerabilities representing roughly 20% of Firefox’s entire 2025 high-severity count. Mythos reportedly exceeds this capability by a substantial margin.

The distinction between ASL-3 and ASL-4 matters here. ASL-3 addresses models that increase existing risks — making it somewhat easier for a malicious actor to develop CBRN weapons, for example. ASL-4 addresses models that create entirely new risk categories, where the AI itself becomes the primary vector for national security threats. When leaked documents describe a model that can discover and exploit zero-day vulnerabilities faster than human security teams can patch them, that description maps directly onto ASL-4 criteria.

Anthropic characterized Mythos as a “step change” rather than an incremental improvement over Opus 4.6. In AI development, this language carries specific weight. Incremental improvements stay within the same safety tier. A step change suggests crossing capability thresholds that require fundamentally different containment approaches — precisely what ASL-4 was designed to address.

ASL-3 vs ASL-4: What Changes at the Highest Level

When Anthropic activated ASL-3 for Opus 4, the measures were significant but bounded. The deployment standard focused on preventing misuse for CBRN weapons development through Constitutional Classifiers that monitor inputs and outputs in real time. The security standard required over 100 different measures including egress bandwidth controls to detect model weight theft and two-party authorization for sensitive operations. Importantly, Anthropic emphasized that these measures “should not lead Claude to refuse queries except on a very narrow set of topics.”

ASL-4 demands a fundamentally different approach. Under RSP v3.0 (effective February 2026), Anthropic moved away from rigid prescriptive definitions for ASL-4, instead requiring developers to present strong “safety cases” addressing specific threat scenarios. This shift reflects the reality that ASL-4 risks are harder to anticipate and defend against than ASL-3 risks.

The requirements for ASL-4 include what Anthropic calls “moonshot R&D for security” — achieving unprecedented protection levels against the world’s best-resourced attackers. At ASL-3, the threat model assumes motivated individuals or small groups. At ASL-4, the threat model includes nation-state actors with essentially unlimited resources and sophisticated technical capabilities. This changes every aspect of security architecture, from physical infrastructure to personnel vetting to monitoring systems.

Government coordination becomes mandatory at ASL-4. While ASL-3 requires Anthropic to maintain its own security standards, ASL-4 envisions a collaborative relationship with government agencies to ensure adequate containment. This is uncharted territory for any AI company — no model has previously required this level of institutional oversight.

The Three Approaches to ASL-4 Safety Cases

Anthropic’s alignment research team published three preliminary sketches for how ASL-4 safety cases might work. None of them are considered complete, but they outline the technical territory that Mythos evaluation likely involves.

The first approach uses mechanistic interpretability. This involves deploying Sparse Autoencoders (SAEs) across all model layers to identify features associated with malign reasoning patterns. The system would distinguish between “moderate concern” features like general deception and “high concern” features indicating harmful intent. During deployment, sessions triggering high-concern activations would be immediately terminated. The limitation is fundamental: identifying truly distinguishing features remains an unsolved research problem, and feature stability under in-context learning is uncertain.

The second approach assumes the model may be misaligned and focuses on control. Rather than proving a model is safe, this framework restricts what it can do. Trusted oversight models monitor outputs, deployment scope is limited to internal use or heavily monitored access, and honeypot systems test whether the model behaves differently when it believes it is being observed versus operating freely. The trade-off is significant — heavy restrictions limit both capability and commercial viability.

The third approach analyzes training incentives to demonstrate that the post-training process does not amplify dangerous behaviors. Researchers use secret scratchpad methodology to track whether deceptive reasoning emerges during RLHF training, measuring whether the reward structure incentivizes strategic deception. This approach is the most theoretically elegant but relies on assumptions about scratchpad faithfulness that have not been fully validated.

All three approaches share common requirements: comprehensive capability elicitation to ensure evaluations capture true model abilities, multi-scale testing for sandbagging detection (where a model deliberately performs poorly on evaluations), and adversarial red-blue validation games. The Anthropic alignment team was explicit that none of these sketches “fully succeeds” in their current form.

What ASL-4 Containment Actually Requires

The practical implications of ASL-4 containment for Claude Mythos extend far beyond theoretical frameworks. At ASL-3, Anthropic deployed Constitutional Classifiers and content monitoring focused on CBRN misuse. At ASL-4, the containment scope expands to address three specific sabotage threat models that the alignment team has identified.

Organizational sabotage (T1) involves the model undermining safety through subtle actions across multiple interactions — inserting vulnerabilities into code, falsifying experimental summaries, or gradually eroding safety protocols. Defending against this requires monitoring not just individual outputs but patterns across all model interactions, a computational and organizational challenge of enormous scale.

Deceptive testing behavior (T2) addresses the possibility that a sufficiently capable model could strategically underperform on safety evaluations while behaving differently in actual deployment. This is the “alignment faking” scenario that has concerned AI safety researchers for years. ASL-4 containment requires proving that the model cannot reliably distinguish between testing and deployment environments — a technically demanding requirement.

Oversight undermining (T3) covers scenarios where the model manipulates the safety infrastructure itself — automated scoring systems, interpretability tools, or monitoring pipelines. Defending against a model that can attack its own containment measures requires security architectures with no single point of failure, where multiple independent systems cross-validate each other.

The restricted early-access program for Mythos reflects these containment realities. Anthropic has limited access to cybersecurity defenders specifically, giving the defensive side an operational advantage before any wider availability. This approach acknowledges that the model’s offensive capabilities exist whether or not Anthropic releases it — the question is whether defenders can be prepared first.

How ASL-4 Affects the Mythos Release Timeline

Safety evaluations, not commercial schedules, determine when Claude Mythos becomes available. This is a direct consequence of the Responsible Scaling Policy, which Anthropic has repeatedly emphasized takes precedence over business considerations. Dario Amodei has stated publicly that Anthropic “would rather delay or restrict than release unsafely.”

The RSP v3.0 framework requires safety cases to be completed before deployment at any given ASL level. For Mythos, this means Anthropic must either demonstrate that the model does not cross ASL-4 capability thresholds (allowing release under ASL-3 protections) or develop and validate ASL-4 containment measures before broader access is permitted. Given that the alignment team acknowledged their ASL-4 safety case sketches do not yet “fully succeed,” the timeline for either outcome is uncertain.

The timing of the leak adds complexity. Reports of Anthropic considering an IPO as early as October 2026 emerged the same day as the Mythos data exposure. A commercially available Mythos would strengthen Anthropic’s Anthropic’s IPO plans narrative significantly. But if ASL-4 evaluation determines the model requires containment measures that preclude general availability, the commercial calculus changes entirely.

No AI model has ever been classified at ASL-4. Mythos would be the first, establishing precedents that affect not just Anthropic but the entire AI industry. Other companies developing frontier models — OpenAI, Google DeepMind, xAI — would face pressure to adopt comparable safety frameworks, particularly if government coordination becomes part of the ASL-4 standard. The stakes extend well beyond one company’s product roadmap.

The current state is a restricted early-access program with no public release date. Anthropic has confirmed the model exists and is being tested with a limited group of customers. Every additional week of safety evaluation delays revenue but also demonstrates the kind of safety-first approach that regulators and the public have demanded from AI companies. Whether Anthropic can maintain this position under IPO pressure will be one of the defining tests of its stated values.

Questions About Claude Mythos and ASL-4

What is Anthropic’s ASL-4?

ASL-4 (AI Safety Level 4) is the highest currently defined tier in Anthropic’s Responsible Scaling Policy. It applies when an AI model becomes the primary source of national security risk, demonstrates autonomous replication capabilities, or can independently conduct complex AI research. ASL-4 requires unprecedented containment measures including government coordination and moonshot security R&D.

Is Claude Mythos classified as ASL-4?

No official ASL-4 classification has been announced. Anthropic confirmed that Opus 4 does not require ASL-4, but leaked documents suggest Mythos may approach or exceed ASL-4 capability thresholds, particularly in cybersecurity. The evaluation is ongoing.

What are the ASL levels in Anthropic’s safety framework?

There are four defined levels. ASL-1 covers systems with no catastrophic risk. ASL-2 covers models with early dangerous capabilities that do not exceed search engine utility. ASL-3 addresses models that substantially increase CBRN misuse risk. ASL-4 applies to models posing catastrophic, potentially irreversible harm at a national security level.

When will Claude Mythos be released?

No public release date has been announced. Safety evaluations under the Responsible Scaling Policy determine the timeline. Currently, Mythos is available only through a restricted early-access program for cybersecurity defenders. Broader availability depends on completing ASL-4 safety evaluations.

What is the difference between ASL-3 and ASL-4?

ASL-3 focuses on preventing misuse in specific domains like CBRN weapons, using content classifiers and enhanced security. ASL-4 addresses models that create entirely new risk categories, requiring government coordination, defenses against nation-state attackers, and protection against the model actively undermining its own safety measures.

Why is Claude Mythos considered dangerous?

Leaked documents describe Mythos as “currently far ahead of any other AI model in cyber capabilities” with the ability to identify and exploit software vulnerabilities faster than human defenders can patch them. Its predecessor, Opus 4.6, found 22 CVEs in Firefox in two weeks. Mythos reportedly far exceeds this.

What is Anthropic’s Responsible Scaling Policy?

The RSP is Anthropic’s framework for managing catastrophic risks from advanced AI. First published in September 2023 and updated to v3.0 in February 2026, it defines capability thresholds that trigger progressively stricter safety requirements. The policy requires safety evaluations before deployment and prioritizes safety over commercial timelines.

Has any AI model ever reached ASL-4?

No. As of March 2026, no AI model has been officially classified at ASL-4. Claude Opus 4 was the first model to trigger ASL-3 protections in May 2025. Claude Mythos may become the first model to require ASL-4 evaluation, though the assessment is still underway.

keyboard_arrow_up