Stop Paying for Opus: Sonnet 4.6 Closes the Gap to 1.2%

Anthropic dropped Claude Sonnet 4.6 yesterday. The headline: it scores 79.6% on SWE-bench Verified, just 1.2 points behind Opus 4.6 (80.8%). At $3/$15 per million tokens versus Opus's $5/$25, that gap is hard to justify paying for.

This isn't a minor version bump. Sonnet 4.6 outperforms the previous Opus 4.5 (the flagship from November 2025) on most coding tasks. Claude Code testers preferred it over Sonnet 4.5 about 70% of the time and over the old Opus 59% of the time.

Here's what actually changed and why it matters if you write code for a living.

The benchmarks that matter for coding

Coding benchmark comparison

Three numbers stand out.

SWE-bench Verified: 79.6%. This benchmark tests real-world software engineering: finding bugs in open source repos, understanding codebases, writing patches that pass test suites. Sonnet 4.6 is now second only to Opus 4.6 (80.8%) across all models. It beats GPT-5.3 Codex (74.9%) by nearly 5 points.

Terminal-Bench 2.0: 59.1%. This measures interactive coding: multi-step debugging, iterating in a terminal environment, the kind of work you actually do all day. That's an 8-point jump over Sonnet 4.5 (51.0%). GPT-5.3 Codex leads this benchmark at 77.3%, making it the stronger pick for rapid interactive sessions.

OSWorld: 72.5%. The computer-use benchmark. Sonnet 4.6 nearly matches Opus 4.6 (72.7%) and leads GPT-5.3 Codex (64.7%) by 8 points. If you're building agents that interact with desktop applications or browsers, this is significant.

It actually finishes the job now

The most common complaint about Sonnet 4.5 was laziness. You'd ask it to refactor three files and it would do the first one thoroughly, half-finish the second, and tell you "the third file follows the same pattern." Developers had a word for this: cope.

Sonnet 4.6 fixes this. Anthropic specifically targeted the "lazy" behavior, and it shows in practice. The model follows through on multi-step tasks instead of claiming premature completion.

Three concrete improvements you'll notice immediately:

It reads before it writes. Sonnet 4.6 more consistently examines existing code before modifying it. It picks up on patterns in your codebase (naming conventions, error handling style, abstraction levels) and matches them.
It consolidates instead of duplicating. When adding new functionality, it's more likely to extend existing functions than create parallel implementations. This was a persistent pain point with 4.5, which would happily create handleUserV2() next to handleUser().
Fewer false success claims. When something doesn't work, it's more honest about it. Less "I've implemented the changes" when the changes are incomplete or broken.

Adaptive thinking: the new default

Sonnet 4.6 introduces adaptive thinking, and it changes how the model allocates compute. Instead of you deciding how much "thinking" budget to give the model, it decides for itself based on the problem.

Simple questions get fast answers. Complex multi-step reasoning triggers deeper chains. You don't have to tune anything.

For the API, the new pattern:

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-6",
  thinking: { type: "adaptive" },
  // ...
})

The old thinking: { type: "enabled", budget_tokens: N } approach is deprecated on 4.6 models. If you've been manually setting thinking budgets, you can stop.

Anthropic recommends setting effort: "medium" for most Sonnet 4.6 use cases. This balances speed, cost, and quality. Reserve high or max effort for genuinely hard problems (complex refactors, architectural decisions, tricky debugging).

The cost math

Here's where Sonnet 4.6 gets interesting for teams that have been defaulting to Opus.

Model	Input	Output	SWE-bench
Opus 4.6	$5/MTok	$25/MTok	80.8%
Sonnet 4.6	$3/MTok	$15/MTok	79.6%
GPT-5.3 Codex	~$4/MTok	~$20/MTok	74.9%

Sonnet 4.6 is about 60% the cost of Opus for 98.5% of the coding performance. For most everyday coding tasks (writing features, fixing bugs, refactoring, code review), that last 1.2% on SWE-bench doesn't translate to a noticeable quality difference.

Add prompt caching and that gap widens further. Cache reads cost $0.30/MTok, a 90% discount on input. Batch processing cuts everything by 50%. A team running heavy agentic workloads could see costs drop dramatically by switching from Opus to Sonnet 4.6 with caching enabled.

When Opus still wins: deep reasoning tasks (ARC-AGI-2: 75.2% vs 58.3%), novel architecture design, and problems that genuinely need the extra intelligence. But for the 80% of coding work that's well-scoped implementation and iteration, Sonnet 4.6 is the better value.

1M context window changes what's possible

Sonnet 4.6 supports a 1 million token context window in beta. That's roughly 25,000 lines of code in a single prompt.

This matters for a specific class of problems: cross-file refactors where you need the model to understand how a change in one module affects behavior three directories away. With 200K tokens you're constantly deciding what to include. With 1M tokens, you include everything.

The pricing for long context is worth knowing:

	Under 200K tokens	Over 200K tokens
Input	$3/MTok	$6/MTok
Output	$15/MTok	$22.50/MTok

Double the input cost, 50% more on output. Not cheap for sustained use, but it makes previously impossible single-shot tasks feasible. Load your entire backend into context, ask for a migration plan, get an answer that accounts for every dependency.

There's also a new context compaction API (beta) that automatically summarizes earlier parts of a conversation as it approaches the window limit. Effectively infinite conversations without manual /compact management. This is aimed squarely at agent builders running long autonomous coding sessions.

Where it fits in your stack

Sonnet 4.6 is already the default model for Free and Pro users on claude.ai. It's available in Claude Code, GitHub Copilot, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.

For most developers, the practical question is: when do you still reach for Opus?

Use Sonnet 4.6 for:

Day-to-day feature development and bug fixes
Code review and refactoring
Interactive pair programming sessions
Agent workflows where cost matters (Sonnet handles the volume, Opus handles the hard problems)
Computer use and browser automation tasks

Use Opus 4.6 for:

Deep architectural reasoning and system design
Problems that require genuine novelty (not pattern matching)
Long autonomous sessions on complex, ambiguous tasks
When 1.2% more SWE-bench accuracy actually saves you debugging time

The honest pattern for teams: route simple and medium tasks to Sonnet 4.6, escalate to Opus for the genuinely hard stuff. Your costs drop, your throughput increases, and the quality difference for routine work is negligible.

What this means for the market

A year ago, getting near-80% on SWE-bench required the most expensive frontier model available. Now a mid-tier model does it at $3/$15.

Sonnet 4.6 outscores GPT-5.3 Codex on SWE-bench (+4.7%) and OSWorld (+7.8%). GPT-5.3 Codex fights back hard on Terminal-Bench 2.0 (77.3% vs 59.1%) and still wins on pure math (AIME, GPQA). The competitive picture is more nuanced than last gen, where Claude led on almost everything.

GitHub's VP of Product called it out specifically: "Out of the gate, Claude Sonnet 4.6 is already excelling at complex code fixes, especially when searching across large codebases is essential."

The broader trend: the performance ceiling for coding agents is rising while the price floor is falling. Tasks that were unreliable six months ago (multi-file refactors, automated bug investigation, full PR generation from issue descriptions) are becoming reliable enough to depend on.

Sonnet 4.6 doesn't rewrite the rules. It just makes the top-tier coding performance that was locked behind Opus pricing available to everyone. For most developers, that's the upgrade that actually matters.