Claude Mythos Benchmark Scores: What the Leaked Data Actually Shows

No official benchmark numbers exist for Claude Mythos — Anthropic has not published test results and the model remains in restricted early access. What we have are claims from leaked internal documents describing the model as scoring “dramatically higher” than Claude Opus 4.6 on tests of software coding, academic reasoning, and cybersecurity.

To understand what “dramatically higher” actually means, we need to look at where Opus 4.6 currently stands — because it already leads or ties for first place on most major AI benchmarks as of March 2026.

Claude Mythos benchmark scores and performance comparison

Anthropic claims Mythos achieves dramatically higher scores than Opus 4.6
No public benchmark numbers have been released for Mythos
Opus 4.6 already leads Terminal-Bench 2.0, SWE-bench Verified, and GPQA Diamond
Cybersecurity is described as the domain where Mythos is “far ahead of any other AI model”
Third-party benchmarks will come when access expands beyond defensive cybersecurity organizations

What Anthropic Claims About Mythos Performance

The leaked draft blog post — one of approximately 3,000 documents exposed through a CMS misconfiguration — described Claude Mythos as “by far the most powerful AI model we have ever developed.” An Anthropic spokesperson later confirmed the model exists and called it “a step change” in AI capability and “the most capable we have built to date.”

Three performance domains are specifically mentioned in the leaked materials. Software coding is the first, where Mythos reportedly generates, debugs, and refactors complex codebases at a level beyond any previous Claude model. Academic reasoning is the second, covering multi-step logical analysis and problem-solving tasks. Cybersecurity is the third and most discussed, with Anthropic stating that Mythos is “currently far ahead of any other AI model in cyber capabilities.”

The absence of specific numbers is deliberate. Unlike a typical model launch where companies race to publish benchmark tables, Anthropic is withholding detailed performance data because the cybersecurity scores raise dual-use concerns. Publishing exact vulnerability-exploitation metrics would effectively advertise offensive capabilities to bad actors.

Claude Opus 4.6 Benchmarks: The Baseline Mythos Must Beat

To gauge what Mythos might achieve, start with Claude Opus 4.6 — the model it reportedly surpasses by a wide margin. Opus 4.6 is already among the top-performing AI models on every major benchmark as of March 2026.

Coding Benchmarks

Opus 4.6 paired with the ForgeCode agent framework holds the #1 spot on Terminal-Bench 2.0 at 81.8%, tied with GPT-5.4. On SWE-bench Verified, the standard for real-world software engineering tasks, Opus 4.6 scores 80.84% across 25 averaged trials — the second-highest result ever recorded. On MCP Atlas, which tests tool-use integration, it reaches 62.7% at high effort.

Reasoning Benchmarks

The reasoning numbers are equally strong. Opus 4.6 scores 91.31% on GPQA Diamond, a graduate-level science question set. On ARC-AGI-2, it reaches 68.8% — nearly doubling Opus 4.5’s score of 37.6%. On AIME 2025 (competition mathematics), it scores 99.79%. The GDPval-AA evaluation shows Opus 4.6 outperforming GPT-5.2 by 144 Elo points.

Long Context and Search

Opus 4.6 handles million-token contexts with a 78.3% score on MRCR v2 (8-needle, 1M tokens) compared to Sonnet 4.5’s 18.5%. It leads all models on BrowseComp for information retrieval and holds the highest industry score on DeepSearchQA for multi-step agentic search. On Humanity’s Last Exam, it leads all frontier models when given tool access.

Projected Mythos Benchmark Numbers

The phrase “dramatically higher” in Anthropic’s leaked materials leaves room for interpretation, but it clearly signals more than incremental improvement. Opus 4.6 already represents a major leap over Opus 4.5 — and Mythos is positioned as a comparable leap beyond Opus 4.6.

Coding Projections

If the improvement parallels the Opus 4.5-to-4.6 jump on ARC-AGI-2 (from 37.6% to 68.8%), Terminal-Bench 2.0 scores could approach or exceed 90%. SWE-bench Verified might push past 85-88%, entering territory where the model solves the vast majority of real-world software engineering issues on the first attempt. These projections assume a proportional capability gain — the actual numbers could be higher or lower depending on how Mythos was optimized.

Cybersecurity: The Expected Standout

This is where the biggest performance gap is expected. The leaked documents do not describe cybersecurity as marginally better — they describe it as a category where Mythos “presages an upcoming wave of models that can exploit vulnerabilities in ways that far outpace the efforts of defenders.” No existing public benchmark adequately measures this capability, which is part of why Anthropic has been cautious about releasing specifics. The model’s ability to discover and chain zero-day exploits is reportedly what triggered the restricted release strategy.

Reasoning Projections

With Opus 4.6 already at 91.31% on GPQA Diamond and 99.79% on AIME 2025, ceiling effects limit how much improvement is possible on some reasoning benchmarks. The gains are more likely to show on harder evaluations like ARC-AGI-2, Humanity’s Last Exam, and novel multi-step reasoning tasks that current models struggle with.

How Mythos Compares to GPT-5 and Gemini 3.1 Pro

Even without Mythos scores, the competitive landscape tells a story. Claude Opus 4.6 already beats or matches the best models from OpenAI and Google on most benchmarks. Mythos would extend that lead significantly.

Terminal-Bench 2.0 Leaderboard (March 2026)

Rank	Agent + Model	Score
1	ForgeCode + Claude Opus 4.6	81.8%
2	ForgeCode + GPT-5.4	81.8%
3	TongAgents + compared to Gemini 3.1 Pro	80.2%
5	SageAgent + GPT-5.3-Codex	78.4%
7	Capy + Claude Opus 4.6	75.3%

SWE-bench Verified (March 2026)

Rank	Model	Score
1	Claude Opus 4.5	80.9%
2	Claude Opus 4.6	80.8%
3	Gemini 3.1 Pro	80.6%
5	GPT-5.2	80.0%
6	Claude Sonnet 4.6	79.6%

The gap between top models is currently less than 2 percentage points on most benchmarks. If Mythos delivers the claimed “dramatic” improvement, it would break away from this tight cluster and establish clear separation — potentially the widest gap between #1 and #2 that these benchmarks have seen.

GPT-5 Comparison Context

OpenAI’s GPT-5 launched in August 2025 and was widely considered disappointing relative to expectations. The latest variant, GPT-5.4, ties Opus 4.6 on Terminal-Bench but trails on reasoning-heavy evaluations like GDPval-AA (where Opus 4.6 leads by 144 Elo points). OpenAI is developing a next-generation model codenamed “Spud”, but no performance details have surfaced. Mythos may already represent the capability level that GPT-5 was expected to reach.

Why Anthropic Is Withholding Benchmark Data

Most AI companies treat benchmark scores as marketing tools — publishing them on launch day to generate headlines. Anthropic’s approach with Mythos is the opposite, and the reasons are specific.

The cybersecurity performance creates a problem that coding and reasoning scores do not. Publishing exact numbers on vulnerability discovery and exploitation rates would tell attackers precisely how capable the tool is. Anthropic acknowledged this directly, stating it wants to “act with extra caution and understand the risks it poses — even beyond what we learn in our own testing.”

The restricted early access program serves a dual purpose: it lets Anthropic gather real-world safety data while giving defensive cybersecurity teams a head start. Only after this evaluation phase will broader access — and presumably public benchmarks — follow. No timeline has been given for when independent evaluators will be able to run their own tests.

Questions About Claude Mythos Benchmarks

What are Claude Mythos benchmark scores?

No official scores have been published. Leaked Anthropic documents describe the model as achieving “dramatically higher scores” than Claude Opus 4.6 on tests of software coding, academic reasoning, and cybersecurity.

How does Claude Mythos compare to Claude Opus 4.6?

Mythos belongs to the new Capybara tier above Opus and is described as a “step change” in capability. The improvement is reportedly significant across all three tested domains, with cybersecurity showing the largest gap.

Is Claude Mythos the best AI model for coding?

Based on leaked claims, likely yes. Claude Opus 4.6 already ties for #1 on Terminal-Bench 2.0 at 81.8% and scores 80.84% on SWE-bench Verified. Mythos reportedly scores dramatically higher.

When will official Claude Mythos benchmarks be released?

No timeline has been announced. Anthropic is prioritizing safety evaluations and restricted testing over public benchmarks. Third-party results will become available when access expands.

How does Claude Mythos compare to GPT-5 on benchmarks?

Claude Opus 4.6 already outperforms GPT-5.2 on most benchmarks, including a 144 Elo point lead on GDPval-AA. Mythos claims to dramatically exceed Opus 4.6, which would place it well ahead of any current GPT-5 variant.

What benchmark is Claude Mythos best at?

Cybersecurity appears to be the standout domain. Anthropic describes Mythos as “currently far ahead of any other AI model in cyber capabilities,” and this is reportedly the area with the largest performance gap over Opus 4.6.