Anthropic Releases Claude 4: Benchmark Records and a New Approach to Reasoning
Source Material
Hybrid Thinking
Dynamically switches between fast responses and deep chain-of-thought reasoning, cutting latency 40% on conversational tasks.
Benchmark Records
72.5% on SWE-bench Verified and 87.6% on AIME 2025 — both records at launch.
Stronger Agent Safety
Updated Constitutional AI with extensive red-teaming specifically for agentic and multi-step tool-use scenarios.
Anthropic has released Claude 4, its most capable model to date, claiming top scores on a range of coding, mathematics, and scientific reasoning benchmarks. The announcement, made this morning on the company blog, positions Claude 4 as a direct competitor to OpenAI's latest offerings.1
Hybrid Thinking Mode
The most notable new feature is what Anthropic calls hybrid thinking — a mechanism that allows the model to decide in real time whether a query warrants extended chain-of-thought reasoning or a faster, more direct response. Earlier reasoning models ran extended thinking on every query, adding latency even for simple requests. Claude 4 now reserves deep reasoning for problems that require it.1
In internal testing, Anthropic says this reduces median response latency by 40% on conversational tasks while maintaining the performance benefits of extended thinking on hard problems like multi-step proofs and complex code generation.
Benchmark Performance
On SWE-bench Verified, a test of real-world software engineering tasks, Claude 4 scores 72.5% — up from Claude 3.7 Sonnet's 62.3%.2 On AIME 2025, a competition mathematics benchmark, it achieves 87.6%, a record at the time of publication. Frontier Math, which tests research-level mathematics, sees Claude 4 reach 18.9%, compared to 10.1% for its predecessor.
OpenAI's o3 model, currently the benchmark leader on several of these tasks, remains ahead on AIME by roughly four percentage points, but Anthropic notes that Claude 4 outperforms o3 on the HumanEval coding benchmark.2
Safety and Constitutional AI Updates
Alongside capability improvements, Anthropic has updated its Constitutional AI framework for Claude 4. The model undergoes more extensive red-teaming for agentic scenarios — cases where an AI is given tools and asked to complete multi-step tasks autonomously.3 The company says this is increasingly important as more developers deploy Claude in agent pipelines.
Claude 4 is available today via the Anthropic API and on Claude.ai. Pricing remains the same as Claude 3.7 Sonnet for the standard tier, with a new extended thinking tier priced at a premium for high-complexity workloads.1