Claude Mythos Capabilities: What Anthropic’s Leaked Model Can Actually Do
Claude Mythos is the first model in Anthropic’s new Capybara tier, and its capabilities are why the AI world has been paying attention since the March 2026 leak. Anthropic’s own internal materials describe it as a general-purpose model with meaningful advances in reasoning, coding, and cybersecurity — but the details go much further than that headline suggests. Six distinct capability areas separate Mythos from every other AI model available today, including Anthropic’s own Claude Opus 4.6.

Claude Mythos Capabilities Overview
The leaked draft blog post and Anthropic’s subsequent confirmation outline a model that does not just improve on its predecessor in one or two areas. Mythos represents what Anthropic calls a “step change” — a term the company chose deliberately to distinguish this advance from the incremental updates that characterize most model releases.
The Six Core Capabilities
Based on the leaked materials and independent analysis, Claude Mythos demonstrates breakthrough performance in six areas:
- Software coding — dramatically higher scores than Opus 4.6 on programming benchmarks
- Academic reasoning — significantly improved multi-step logical analysis
- Cybersecurity — “far ahead of any other AI model in cyber capabilities”
- Complex multi-step reasoning — enhanced ability to connect ideas across domains
- agent workflows workflows — improved consistency in autonomous task execution
- Vulnerability discovery — proactive identification of security weaknesses in production code
These are not marketing claims from a press release. They come from internal draft materials that Anthropic never intended to publish — which arguably makes them more credible than a typical product announcement. The company was writing for itself, not for an audience.
How Benchmarks Compare to Opus 4.6
Anthropic has not released exact benchmark numbers for Mythos. The leaked draft uses the phrase “dramatically higher scores on tests” when comparing to Opus 4.6 across coding, reasoning, and cybersecurity evaluations. The word “dramatically” is significant — Anthropic’s previous model updates typically used “improved” or “enhanced.” The deliberate choice of a stronger qualifier in an internal document suggests the performance gap is substantial, not marginal.
For context, the jump from Opus 4 to Opus 4.6 brought meaningful but expected improvements. The jump from Opus 4.6 to Mythos (Capybara) is described as a different kind of advance entirely — not just better performance on existing benchmarks, but new capabilities that Opus cannot replicate at all.
Coding and Software Development
Software engineering is where Claude models have already built their strongest reputation, and Mythos pushes that lead further.
Code Generation Performance
Mythos achieves dramatically higher scores than Opus 4.6 on software coding tests. While the exact benchmarks remain unpublished, the performance claims are consistent with what developers have seen in Claude’s trajectory. Claude Sonnet 4.6 already ranks among the top AI coding assistants. Claude Opus 4.6 handles more complex architectural reasoning. Mythos goes beyond both, tackling problems that require understanding entire codebases rather than individual files or functions.
The practical difference is not just writing better code snippets. It is understanding how a change in one module affects behavior across a distributed system, identifying subtle architectural debt, and generating solutions that account for edge cases that simpler models miss entirely.
Large Codebase Refactoring
One of the most frequently cited Capybara use cases is large-scale codebase refactoring — reorganizing and improving codebases with thousands of interconnected files. This is a task where Opus 4.6 performs well in isolation but struggles with consistency across an entire project. Mythos is designed to maintain coherent understanding across massive codebases, making refactoring decisions that account for dependencies, side effects, and architectural patterns that span the full project.
This capability has direct commercial value. Enterprise codebases accumulate technical debt over years, and refactoring them manually requires teams of senior engineers working for months. An AI model that can reliably analyze and restructure such codebases at scale could compress that timeline dramatically.
Why Developers Should Care
For developers using Claude Code or the Claude API, Mythos represents a qualitative shift. Current models are excellent assistants — they help you write code faster and catch bugs you might miss. Mythos moves toward being an autonomous development partner that understands not just the code you are writing but the broader system it lives in, the patterns it should follow, and the problems it needs to avoid.
The trade-off is cost. Capybara-tier pricing is expected to be 2-5x higher than Opus, making it impractical for routine coding tasks. The sweet spot is high-value development work: critical infrastructure, security-sensitive applications, and large-scale migrations where the cost of errors exceeds the cost of using a premium model.
Academic Reasoning and Knowledge Synthesis
The second major capability area is reasoning — specifically, the kind of multi-step, cross-domain thinking that separates strong AI models from exceptional ones.
Multi-Step Reasoning Performance
Mythos shows “significantly improved” performance on academic reasoning tests compared to Opus 4.6. This encompasses mathematical proofs, formal logic, scientific hypothesis evaluation, and any task that requires chaining multiple logical steps while maintaining consistency throughout the chain.
Reasoning failures in AI models typically compound — a small error in step two leads to a completely wrong answer by step five. Mythos appears to have reduced this compounding effect, maintaining accuracy across longer reasoning chains than any previous Claude model.
Cross-Domain Knowledge Integration
Perhaps the most intriguing capability described in the leaked materials is Mythos’s ability to create “deep connective tissue between ideas and knowledge.” This means the model does not just know things — it understands how concepts in one domain relate to concepts in another.
A medical researcher asking about drug interactions might receive an answer that connects pharmacology with genomics, epidemiological data, and relevant clinical trial methodology — not as a list of separate facts, but as an integrated analysis that reflects how those fields actually intersect. This cross-domain synthesis is qualitatively different from retrieving information and placing it side by side.
Research and Scientific Analysis
For academics and researchers, Mythos positions itself as the most capable AI research assistant available. The combination of enhanced reasoning, cross-domain knowledge, and large context windows enables analysis of complex papers, identification of methodological weaknesses, and generation of novel hypotheses that draw on multiple fields simultaneously.
Anthropic has not specified Mythos’s context window size, but Capybara-tier models are expected to handle the extended context windows already available in Opus (200K tokens) with improved retention and comprehension across the full window.
Cybersecurity Capabilities
This is the capability that generated the most attention — and the most concern.
Vulnerability Discovery at Scale
Anthropic’s leaked draft stated that Mythos is “currently far ahead of any other AI model in cyber capabilities.” The model demonstrated an ability to surface previously unknown vulnerabilities in production codebases. This is not theoretical — the draft materials describe actual testing where Mythos identified security weaknesses that other tools and models could not find.
The practical implication is significant. Traditional vulnerability scanning tools operate on known patterns and signatures. Human penetration testers bring creativity but work slowly. Mythos combines the speed of automated scanning with the creative reasoning of human testers, finding vulnerabilities that are novel — not just variations of known exploits, but genuinely new attack vectors.
The Dual-Use Problem
Every cybersecurity capability is inherently dual-use. The same model that finds vulnerabilities to help defenders patch them can find vulnerabilities to help attackers exploit them. Anthropic acknowledged this directly in the leaked draft, warning that Mythos “presages an upcoming wave of models that can exploit vulnerabilities in ways that far outpace the efforts of defenders.”
This admission is remarkable because it comes from the model’s own creator. Anthropic is not just acknowledging that Mythos could be misused — the company is stating that the entire class of models it represents will shift the balance of power between cyber attackers and defenders, potentially in favor of attackers.
The dual-use problem is why Anthropic chose to restrict initial access to cybersecurity defense organizations. By giving defenders a head start, the company aims to partially offset the advantage that attackers would gain from similar capabilities.
Real-World Precedent: Claude Code Exploits
The cybersecurity concern is grounded in real events, not just hypotheticals. Anthropic’s own leaked materials referenced a previous incident where a Chinese state-sponsored hacking group used Claude Code’s agentic capabilities to infiltrate approximately 30 organizations worldwide. That attack used Claude’s existing capabilities — the current Sonnet and Opus models — which are significantly less powerful than Mythos in the cybersecurity domain.
If existing Claude models enabled attacks against 30 organizations, a model that is “far ahead” in cyber capabilities raises the question of what it could enable at scale. This is the specific concern driving Anthropic’s restricted release strategy and the sharp stock declines among cybersecurity companies like CrowdStrike (down 6.4%), Palo Alto Networks (down 7%), Zscaler (down 5.8%), and Fortinet (down 4%) following the leak.
Agent Workflows and Autonomous Tasks
The fifth capability area extends beyond individual queries into sustained, autonomous task execution.
Enhanced Consistency in Agent Mode
AI agents — models that execute multi-step tasks without human intervention between steps — require a level of consistency that single-query interactions do not. An agent needs to maintain its understanding of the goal, remember what it has already done, adapt to unexpected results, and stay on track across dozens or hundreds of individual actions.
Mythos demonstrates “enhanced consistency” in agent workflows, meaning it maintains coherent task execution over longer autonomous sequences with fewer errors, hallucinations, or goal drift than Opus 4.6. For teams building AI-powered automation, this consistency improvement directly reduces failure rates and the need for human oversight.
Complex Multi-Step Task Execution
The improvement in agent capability connects directly to Mythos’s enhanced reasoning. A model that reasons better across long chains also executes multi-step tasks more reliably. Tasks like conducting market research (search, read, analyze, synthesize, write report), performing code audits (identify patterns, trace dependencies, evaluate risks, generate recommendations), or managing complex data pipelines all benefit from the combination of better reasoning and better agent consistency.
How Claude Mythos Compares to Competitors
Mythos vs Claude Opus 4.6
The comparison is straightforward: Mythos is better at everything Opus does, with the trade-offs being slower speed and higher cost. Opus 4.6 at $5/$25 per million tokens remains the practical choice for most high-complexity tasks. Mythos at an estimated $10-25/$50-125 per million tokens is for the subset of problems where Opus’s performance ceiling is the bottleneck.
The relationship is complementary, not competitive. Opus continues to serve as Anthropic’s premium general-purpose tier. Capybara exists for a narrower set of high-value applications.
Mythos vs GPT-5
OpenAI’s GPT-5 launch was widely described as underwhelming — Futurism called it a “major letdown” compared to pre-release expectations. Anthropic’s internal assessment, which characterizes Mythos as “far ahead of any other AI model” in cybersecurity, implicitly positions it above GPT-5 in at least that domain. Whether Mythos leads in coding and reasoning as well is harder to assess without published benchmarks from both models.
The competitive dynamic matters for enterprises choosing between platforms. If Mythos delivers on its internal benchmarks, Anthropic has a strong argument for being the performance leader in AI, at least at the top tier.
Mythos vs Gemini
Google’s Gemini models continue to evolve, with Gemini Ultra and specialized variants competing in the same space. Anthropic’s leaked claims do not mention Gemini specifically, but the blanket statement about being “far ahead of any other AI model” would include Google’s offerings. The cybersecurity focus is where Anthropic claims the clearest lead — neither Google nor OpenAI has positioned a model as a cybersecurity specialist the way Anthropic has with Mythos.
Why Anthropic Is Restricting Access
Cybersecurity Risk Assessment
Anthropic determined that Mythos “poses unprecedented cybersecurity risks” — a phrase from the leaked draft that drove much of the media coverage. The risk assessment is not about the model being malicious but about its capability being too powerful to release without safeguards. A model that can find zero-day vulnerabilities faster than any human team could be weaponized by adversaries if access is not carefully controlled.
Deliberately Slow Rollout
Anthropic has adopted what it calls a “deliberately slow” rollout approach for Mythos. This means expanded access will come in stages, each contingent on safety evaluations rather than commercial timelines. The company explicitly stated that release timing is “determined by safety evaluations, not commercial schedule” — even though the model’s commercial potential is enormous, especially ahead of the anticipated October 2026 IPO.
Defensive Use First
The initial restricted access prioritizes cybersecurity defense organizations. Anthropic’s rationale is that giving defenders a head start improves the overall security landscape before the model (or equivalent capabilities from competitors) becomes broadly available. Defense teams can use Mythos to audit their own systems, discover vulnerabilities before attackers do, and harden their infrastructure.
This approach mirrors how the security industry handles zero-day disclosures — responsible disclosure gives defenders time to patch before exploits become public knowledge. Anthropic is applying the same principle to an entire model’s capability set.
Questions About Claude Mythos Capabilities
What are Claude Mythos’s main capabilities?
Claude Mythos excels in six core areas: software coding, academic reasoning, cybersecurity, complex multi-step reasoning, agent workflows, and vulnerability discovery. It achieves “dramatically higher scores” than Claude Opus 4.6 across all these domains.
How good is Claude Mythos at coding?
Mythos scores dramatically higher than Opus 4.6 on software coding benchmarks. It handles large codebase refactoring, architectural decisions, and complex dependency analysis at a level no previous Claude model could match. Exact benchmark scores have not been published.
Can Claude Mythos find security vulnerabilities?
Yes. Mythos demonstrated the ability to surface previously unknown vulnerabilities in production codebases during internal testing. Anthropic described it as “currently far ahead of any other AI model in cyber capabilities.”
Is Claude Mythos better than GPT-5?
Anthropic’s internal materials describe Mythos as “far ahead of any other AI model” in cybersecurity, which would include GPT-5. Independent benchmark comparisons across coding and reasoning are not yet available since Mythos is not publicly accessible.
What benchmarks does Mythos score highest on?
The leaked materials highlight cybersecurity as the area of greatest advantage. Mythos is described as “far ahead” of all competitors in cyber capabilities. Software coding and academic reasoning show “dramatically higher” scores compared to Opus 4.6.
How does Mythos compare to Opus?
Mythos outperforms Opus 4.6 across all measured benchmarks. The improvement is described as “dramatic” — a term Anthropic used in internal documents, not marketing materials. The trade-offs are slower speed and an estimated 2-5x higher cost.
Is Claude Mythos a cybersecurity threat?
Anthropic itself acknowledged that Mythos “poses unprecedented cybersecurity risks” due to its dual-use capability. The model can both discover vulnerabilities for defenders and potentially enable exploitation by attackers. A Chinese state-sponsored group previously used weaker Claude models to infiltrate roughly 30 organizations.
What is Claude Mythos agent mode?
Mythos demonstrates enhanced consistency in autonomous agent workflows — executing multi-step tasks without human intervention. It maintains goal coherence across longer task sequences than Opus 4.6, reducing errors and drift during complex automated operations.
