Anthropic Releases Claude 4: Benchmark Records and a New Approach to Reasoning

Anthropic has released Claude 4, its most capable model to date, claiming top scores on a range of coding, mathematics, and scientific reasoning benchmarks. The announcement, made this morning on the company blog, positions Claude 4 as a direct competitor to OpenAI's latest offerings.¹

Hybrid Thinking Mode

The most notable new feature is what Anthropic calls hybrid thinking — a mechanism that allows the model to decide in real time whether a query warrants extended chain-of-thought reasoning or a faster, more direct response. Earlier reasoning models ran extended thinking on every query, adding latency even for simple requests. Claude 4 now reserves deep reasoning for problems that require it.¹

In internal testing, Anthropic says this reduces median response latency by 40% on conversational tasks while maintaining the performance benefits of extended thinking on hard problems like multi-step proofs and complex code generation.

Benchmark Performance

On SWE-bench Verified, a test of real-world software engineering tasks, Claude 4 scores 72.5% — up from Claude 3.7 Sonnet's 62.3%.² On AIME 2025, a competition mathematics benchmark, it achieves 87.6%, a record at the time of publication. Frontier Math, which tests research-level mathematics, sees Claude 4 reach 18.9%, compared to 10.1% for its predecessor.

OpenAI's o3 model, currently the benchmark leader on several of these tasks, remains ahead on AIME by roughly four percentage points, but Anthropic notes that Claude 4 outperforms o3 on the HumanEval coding benchmark.²

Safety and Constitutional AI Updates

Alongside capability improvements, Anthropic has updated its Constitutional AI framework for Claude 4. The model undergoes more extensive red-teaming for agentic scenarios — cases where an AI is given tools and asked to complete multi-step tasks autonomously.³ The company says this is increasingly important as more developers deploy Claude in agent pipelines.

Claude 4 is available today via the Anthropic API and on Claude.ai. Pricing remains the same as Claude 3.7 Sonnet for the standard tier, with a new extended thinking tier priced at a premium for high-complexity workloads.¹

Things to know

SWE-bench

A benchmark of real GitHub issues used to measure how well AI models can write and fix code in production codebases.

Chain-of-thought

A technique where the model writes out intermediate reasoning steps before giving a final answer, improving accuracy on complex problems.

Constitutional AI

Anthropic's approach to AI alignment, where the model is trained against a set of principles and learns to critique its own outputs.

4 comments

Overall rating

Writing

5.0

Accuracy

5.0

Depth

5.0

Relevance

5.0

Value

5.0

Alice Borsatti

★★★★★

2d ago

This is the best article I’ve read all year. Well done AI

Alice Borsatti

2d ago

This article is way too long!

Nick

3d ago

Looks good!

Nick

3d ago

Curious to see what the next model is like

Anthropic Releases Claude 4: Benchmark Records and a New Approach to Reasoning

Source Material

Hybrid Thinking Mode

Benchmark Performance

Safety and Constitutional AI Updates

Key topics covered

More from AI

OpenAI Unveils o3-mini: A Leaner Model That Punches Above Its Weight

test

Things to know

4 comments

Leave a comment

OpenAI, Anthropic, Google: How the AI Arms Race Is Reshaping Silicon Valley

GPT-5 Arrives: What the New Model Means for Developers