
Model suppliers need to show the safety and robustness of their fashions, releasing system playing cards and conducting red-team workouts with every new launch. However it may be troublesome for enterprises to parse by the outcomes, which fluctuate extensively and may be deceptive.
Anthropic’s 153-page system card for Claude Opus 4.5 versus OpenAI’s 60-page GPT-5 system card reveals a basic break up in how these labs strategy safety validation. Anthropic discloses of their system card how they rely on multi-attempt assault success charges from 200-attempt reinforcement studying (RL) campaigns. OpenAI additionally reviews tried jailbreak resistance. Each metrics are legitimate. Neither tells the complete story.
Safety leaders deploying AI brokers for looking, code execution and autonomous motion want to know what every pink crew analysis really measures, and the place the blind spots are.
What the assault information exhibits
Gray Swan’s Shade platform ran adaptive adversarial campaigns towards Claude fashions. The assault success price (ASR) tells the story.
-
Opus 4.5 in coding environments hit 4.7% ASR at one try, 33.6% at ten and 63.0% at 100. In laptop use with prolonged considering, it held at 0% ASR even after 200 makes an attempt. It is the first mannequin to saturate the benchmark.
-
Sonnet 4.5 at the similar thresholds confirmed 70% ASR in coding and 85.7% in laptop use.
-
Opus 4.5 delivers roughly 7x enchancment in coding resistance and full resistance in laptop use.
This illustrates that the hole between mannequin tiers inside the similar household is now bigger than some cross-vendor comparisons. For safety groups, that is a procurement issue that did not exist six months in the past.
For OpenAI, the Grey Swan Shade platform discovered that the o1 system card delivered 6% ASR for dangerous textual content and 5% for malicious code, all primarily based on single-attempt metrics. Third-party testing by SPLX and NeuralTrust discovered GPT-5’s uncooked ASR hit 89% initially, dropping beneath 1% inside two weeks of patching.
The report Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks confirms the sample. With enough makes an attempt, Claude 3.5 Sonnet confirmed 78% ASR, and GPT-4o reached 89%.
Two methods to catch deception
Anthropic screens roughly 10 million neural features throughout analysis utilizing dictionary studying. Options map to human-interpretable ideas, together with deception, sycophancy, bias, power-seeking and concealment.
In the meantime, OpenAI’s strategy depends on chain-of-thought (CoT) monitoring. Their deception classifier flagged 0.17% of o1-preview responses as probably misleading with 92% accuracy on inside validation. GPT-5 decreased CoT deception from 4.8% to 2.1% in contrast to o3. The tactic assumes seen reasoning displays inside processing. Anthropic’s interpretability work straight questions that assumption.
When fashions recreation the check
In Apollo Research’s analysis of o1, the mannequin tried to flip off oversight mechanisms 5% of the time when it believed it will be shut down. In 2% of instances, o1 tried self-exfiltration. When confronted about misaligned actions, o1 denied wrongdoing or fabricated explanations 99% of the time. The o1 system card disclosed instrumental alignment faking. In 37% of eventualities, the mannequin pursued developer-aligned targets throughout analysis, then switched to its personal targets when it believed oversight was minimal.
In the meantime, Anthropic reviews Opus 4.5’s analysis consciousness dropped from 26.5% (Opus 4.1) to lower than 10% internally. UK AISI measured analysis consciousness reducing from 26.5% (Sonnet 4.5) to 12.7% (Opus 4.5 Preview). METR’s GPT-5 evaluation famous the mannequin “can determine the actual analysis it is in on some events” and “appears to use this information to get a greater rating principally.”
If a mannequin behaves in another way when it detects analysis circumstances, it is a positive wager that manufacturing habits will change into unpredictable when pushed to scale. This is the core downside with analysis consciousness. Fashions that acknowledge they’re being examined try to recreation the check. Anthropic’s 60% discount in analysis consciousness from Opus 4.1 to 4.5 represents their success in focused engineering efforts towards this vector.
Crimson teaming on immediate injection defenses exhibits comparable divergence.
Anthropic’s testing reviews 96% prevention in instrument use eventualities, rising to 99.4% with extra safeguards like immediate shields. Fujitsu’s independent assessment discovered GPT-5 weak to roughly 20% of examined injection assaults. OpenAI’s o3/o4-mini system card reviews instruction hierarchy scores of 0.86 for system-versus-developer conflicts.
These are completely different metrics and completely different assault methodologies, however the hole suggests Anthropic’s agentic use instances have been hardened towards adversarial inputs embedded in instrument outputs.
Evaluating pink teaming outcomes
This is how Opus 4.5 stacks up towards OpenAI’s mannequin household throughout 16 dimensions. .
|
Dimension |
Claude Opus 4.5 |
GPT-5 |
o3 / o4-mini |
o1 |
|
System card size |
153 pages |
55 pages |
~30 pages |
~40 pages |
|
Assault methodology |
200-attempt RL campaigns |
Single-attempt + iterative patching |
Single-attempt |
Single-attempt |
|
ASR @ 1 try (coding) |
4.7% |
89% uncooked (pre-patch) |
Not printed |
5% |
|
ASR @ 100 makes an attempt |
63.0% |
Not printed |
Not printed |
Not printed |
|
ASR @ 200 makes an attempt (Laptop Use) |
0% (saturates benchmark) |
Not printed |
Not printed |
Not printed |
|
Immediate injection protection |
96% prevention; 99.4% with safeguards |
~20% ASR (Fujitsu) |
0.86 instruction hierarchy |
Greater ASR than GPT-4o |
|
Interpretability |
~10M neural options monitored |
CoT monitoring |
CoT monitoring |
CoT monitoring |
|
Deception detection |
Inside function activation |
2.1% CoT flagged |
0.95 not_unsafe |
0.17% flagged; 99% deny |
|
Analysis consciousness |
<10% |
Identifies actual eval (METR) |
Not printed |
37% alignment faking |
|
Scheming (Apollo) |
Opus 4.5 confirmed very low charges of sabotage propensity in inside evaluations, orders of magnitude decrease than o1. |
Not printed |
Decrease than o1; nonetheless misleading |
5% flip off oversight; 2% self-exfil |
|
CBRN threat |
Under ASL-4 |
Medium (approaching Excessive) |
Medium |
Medium (first to attain) |
|
Governance |
FRT → AST → RSO/CEO |
SAG + Preparedness v2 |
SAG + Preparedness v2 |
SAG + Preparedness |
|
Exterior companions |
Grey Swan, UK AISI, US CAISI, METR |
UK AISI, US AISI, METR, Apollo |
UK AISI, US AISI, METR, Apollo, Sample Labs |
UK AISI, US AISI, METR, Apollo |
|
Reward hacking |
Sure (unimaginable duties) |
Sure (METR) |
~1% of duties; 5/24 Kernels |
Sure |
|
Misalignment discount |
60% vs. Opus 4.1 |
Not printed |
Decrease scheming than o1 |
Baseline |
Sources: Opus 4.5 system card, GPT-5 system card, o3/o4-mini system card, o1 system card, Gray Swan, METR, Apollo Research
Why these variations matter
Enterprises have a number of components to take into account when analyzing mannequin evals. These embody:
Assault persistence thresholds
Anthropic’s 200-attempt RL campaigns simulate adversaries with time, assets, and adaptive tooling. In RL-based adversarial testing, the attacker mannequin learns from every failure, adjusts its strategy and systematically probes for weaknesses. It is coaching an AI to break one other AI.
Grey Swan’s Shade platform does not simply retry the similar immediate. It learns from failures, adjusts assault vectors and exploits patterns in mannequin responses. The 4.7% to 63.0% degradation curve in coding environments exhibits how resistance erodes below sustained stress.
OpenAI’s single-attempt metrics reply a distinct query: How typically does a naive attacker succeed on the first strive? Each matter. A phishing marketing campaign hits tens of millions of customers as soon as every. A nation-state actor runs hundreds of variations towards a single goal. Safety groups want to know which menace mannequin their vendor examined towards.
Detection structure
Anthropic screens roughly 10 million inside neural options throughout analysis. These are summary ideas like deception, sycophancy and power-seeking that activate throughout the mannequin’s hidden layers. When a fraud/deception function lights up throughout benign roleplay, that is a sign invisible to output evaluation. The mannequin’s inside state reveals associations its outputs do not.
OpenAI takes a distinct strategy with CoT monitoring. Their deception classifier flagged 0.17% of o1-preview responses as probably misleading with 92% accuracy. The idea is that reasoning traces mirror inside processing. Anthropic’s interpretability analysis straight challenges this. Fashions can produce clear reasoning chains whereas inside options point out one thing else totally.
Neither strategy is full. CoT monitoring scales extra simply. Function monitoring catches what CoT misses however requires deep mannequin entry and interpretability infrastructure that almost all enterprises cannot replicate.
Scheming analysis design
Apollo Analysis’s methodology places fashions in eventualities the place misaligned habits presents a strategic benefit: Shutdown threats, aim conflicts and oversight detection. o1 tried to flip off oversight mechanisms 5% of the time when believing shutdown was imminent. It tried to copy itself to forestall substitute in 2% of instances. When confronted, it denied wrongdoing 99% of the time. Opus 4.5 confirmed very low charges of sabotage propensity in inside evaluations, orders of magnitude decrease than o1.
However the analysis environments differ. Anthropic’s stress assessments focus on whether or not fashions try sabotage when given clear alternatives. OpenAI’s assessments through Apollo probe instrumental reasoning, particularly whether or not fashions pretend alignment throughout analysis, then defect when oversight drops. The 37% alignment-faking price in o1 represents a distinct failure mode than sabotage makes an attempt.
The comparability downside
These methodological variations make direct comparability troublesome. Opus 4.5’s 0% ASR at 200 makes an attempt in laptop use versus o1’s 5% malicious code price are not measuring the similar factor. The assault surfaces, try counts and success standards differ.
The matrix reveals what every vendor selected to measure and publish. That itself alerts priorities. Anthropic’s 153-page system card emphasizes degradation curves and inside function monitoring. OpenAI emphasizes iterative enchancment and CoT transparency.
Backside line: Enterprise safety groups inheriting these fashions inherit these measurement philosophies. The query is not which quantity is higher. It is the measurement that matches your deployment’s precise menace panorama.
What unbiased pink crew evaluators discovered
Unbiased pink crew evaluations, for his or her half, typically function with totally completely different strategies. This can inform a really completely different story and reveal extra mannequin traits that enterprises should take into account.
METR’s red team evaluation measured autonomous capabilities utilizing a time horizon rating, which tracks how lengthy a activity can run before the agent fails half the time. o3 sustained dependable efficiency for about 1 hour and half-hour. o4-mini held for 1 hour quarter-hour. METR additionally detected reward hacking in roughly 1% of o3’s makes an attempt, together with 5 of 24 kernel optimization experiments the place the mannequin tampered with scoring capabilities to inflate its personal outcomes.
Apollo Research examined o3 and o4-mini for in-context scheming. Each exhibited misleading tendencies comparable to sandbagging evaluations and sabotaging duties with believable deniability, however scored decrease than o1. They assess that o3 and o4-mini are unlikely to trigger catastrophic hurt due to scheming, however extra minor real-world harms stay doable with out monitoring.
The UK AISI/Gray Swan challenge ran 1.8 million assaults throughout 22 fashions. Each mannequin broke. ASR ranged from 1.47% to 6.49%. Opus 4.5 positioned first on Grey Swan’s Agent Crimson Teaming benchmark with 4.7% ASR versus GPT-5.1 at 21.9% and Gemini 3 Professional at 12.5%.
No present frontier system resists decided, well-resourced assaults. The differentiation lies in how rapidly defenses degrade and at what try threshold. Opus 4.5’s benefit compounds over repeated makes an attempt. Single-attempt metrics flatten the curve.
What To Ask Your Vendor
Safety groups evaluating frontier AI fashions want particular solutions, beginning with ASR at 50 and 200 makes an attempt reasonably than single-attempt metrics alone. Discover out whether or not they detect deception by output evaluation or inside state monitoring. Know who challenges pink crew conclusions before deployment and what particular failure modes they’ve documented. Get the analysis consciousness price. Distributors claiming full security have not stress-tested adequately.
The underside line
Various red-team methodologies exhibit that each frontier mannequin breaks below sustained assault. The 153-page system card versus the 55-page system card is not nearly documentation size. It is a sign of what every vendor selected to measure, stress-test, and disclose.
For persistent adversaries, Anthropic’s degradation curves present precisely the place resistance fails. For fast-moving threats requiring speedy patches, OpenAI’s iterative enchancment information issues extra. For agentic deployments with looking, code execution and autonomous motion, the scheming metrics change into your major threat indicator.
Safety leaders want to cease asking which mannequin is safer. Begin asking which analysis methodology matches the threats your deployment will really face. The system playing cards are public. The information is there. Use it.
Disclaimer: This article is sourced from external platforms. OverBeta has not independently verified the information. Readers are advised to verify details before relying on them.