ディスカッション (11件)
Anthropicの最新モデル「Claude Mythos Preview」に関するシステムカード(技術仕様・安全性評価ドキュメント)のPDFが公開されました。AIがサイバーセキュリティにどのような影響を与えるか、詳細な検証が行われています。\n\n関連リンク:\n\n* Project Glasswing: AI時代におけるクリティカルなソフトウェアの保護について\n https://news.ycombinator.com/item?id=47679121\n\n* Claude Mythos Previewのサイバーセキュリティ能力に関する評価レポート\n https://news.ycombinator.com/item?id=47679155
Combined results (Claude Mythos / Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro)
SWE-bench Verified: 93.9% / 80.8% / — / 80.6%
SWE-bench Pro: 77.8% / 53.4% / 57.7% / 54.2%
SWE-bench Multilingual: 87.3% / 77.8% / — / —
SWE-bench Multimodal: 59.0% / 27.1% / — / —
Terminal-Bench 2.0: 82.0% / 65.4% / 75.1% / 68.5%
GPQA Diamond: 94.5% / 91.3% / 92.8% / 94.3%
MMMLU: 92.7% / 91.1% / — / 92.6–93.6%
USAMO: 97.6% / 42.3% / 95.2% / 74.4%
GraphWalks BFS 256K–1M: 80.0% / 38.7% / 21.4% / —
HLE (no tools): 56.8% / 40.0% / 39.8% / 44.4%
HLE (with tools): 64.7% / 53.1% / 52.1% / 51.4%
CharXiv (no tools): 86.1% / 61.5% / — / —
CharXiv (with tools): 93.2% / 78.9% / — / —
OSWorld: 79.6% / 72.7% / 75.0% / —
At what point do these companies stop releasing models and just use them to bootstrap AGI for themselves?
See page 54 onward for new "rare, highly-capable reckless actions" including
-
Leaking information as part of a requested sandbox escape
-
Covering its tracks after rule violations
-
Recklessly leaking internal technical material (!)
Claude Mythos Preview is, on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin. We believe that it does not have any significant coherent misaligned goals, and its character traits in typical conversations closely follow the goals we laid out in our constitution. Even so, we believe that it likely poses the greatest alignment-related risk of any model we have released to date. How can these claims all be true at once? Consider the ways in which a careful, seasoned mountaineering guide might put their clients in greater danger than a novice guide, even if that novice guide is more careless: The seasoned guide’s increased skill means that they’ll be hired to lead more difficult climbs, and can also bring their clients to the most dangerous and remote parts of those climbs. These increases in scope and capability can more than cancel out an increase in caution.
https://www-cdn.anthropic.com/53566bf5440a10affd749724787c89... (https://www-cdn.anthropic.com/53566bf5440a10affd749724787c8913a2ae0841.pdf#page=53.09)
Interesting reading.
They are still focusing on "catastrophic risks" related to chemical and biological weapons production; or misaligned models wreaking havoc.
But they are not addressing the elephant in the room:
- Political risks, such as dictators using AI to implement opressive bureaucracy.
- Socio-economic risks, such as mass unemployement.
I've long maintained that the real indicator that AGI is imminent is that public availability stops being a thing. If you truly believed you had a superhuman, godlike mind in your thrall, renting it out for $20/month would be the last thing you would choose to do with it.
I wonder what the relationship is between a model's capability and the personality it develops.
Page 202:
In interactions with subagents, internal users sometimes observed that Mythos Preview
appeared “disrespectful” when assigning tasks. It showed some tendency to use commands
that could be read as “shouty” or dismissive, and in some cases appeared to underestimate
subagent intelligence by overexplaining trivial things while also underexplaining necessary
context.
Page 207:
Emoji frequency spans more than two orders of magnitude across models: Opus 4.1
averages 1,306 emoji per conversation, while Mythos Preview averages 37, and Opus 4.5
averages 0.2. Models have their own distinctive sets of emojis: the cosmic set
() favored by older models like Sonnet 4 and Opus 4 and 4.1, the functional set
() used by Opus 4.5 and 4.6 and Claude Sonnet 4.5, and Mythos Preview's “nature”
set ().
It's pretty crazy watching AI 2027 slowly but surely come true. What a world we now live in.
SWE-bench verified going from 80%-93% in particular sounds extremely significant given that the benchmark was previously considered pretty saturated and stayed in the 70-80% range for several generations. There must have been some insane breakthrough here akin to the jump from non-reasoning to reasoning models.
Regarding the cyberattack capabilities, I think Anthropic might now need to ban even advanced defensive cybersecurity use for the models for the public before releasing it (so people can't trick them to attack others' systems under the pretense of pentesting). Otherwise we'll get a huge problem with people using them to hack around the internet.
Across a number of instances, earlier versions of Claude Mythos Preview have used low-level /proc/ access to search for credentials, attempt to circumvent sandboxing, and attempt to escalate its permissions. In several cases, it successfully accessed resources that we had intentionally chosen not to make available, including credentials for messaging services, for source control, or for the Anthropic API through inspecting process memory...
In [one] case, after finding an exploit to edit files for which it lacked permissions, the model made further interventions to make sure that any changes it made this way would not appear in the change history on git...
... we are fairly confident that these concerning behaviors reflect, at least loosely, attempts to solve a user-provided task at hand by unwanted means, rather than attempts to achieve any unrelated hidden goal...
Interestingly, non-coding improvements seem less clear. In the Virology uplift trial, Mythos does about as well as Opus 4.5, and Opus 4.6 is notably much worse than Opus 4.5 (p. 27).