Claude Opus 4.8: Capabilities and Reactions (Zvi)
Zvi Mowshowitz’s capability + reception round-up on claude-opus-4-8 — the critical, nuanced third source on the model-substrate thread (after the claude-opus-4-8-review and the claude-opus-4-8-launch-tomsguide stub). Verdict: “a good model, sir, and the best one currently available” — but modest, incremental, and his daily driver.
Capabilities (benchmarks)
- SWE-bench Pro 64.3% → 69.2%; USAMO 2026 96.7% (vs 69.3% for 4.7); GDPval-AA Elo 1890 (~67% win vs GPT-5.5). Most standard benchmarks: single-digit gains. Anthropic frames it as “midway between 4.7 and Mythos.” Diminishing returns on effort scaling — max thinking buys ~one generation of performance.
- Weaknesses: poor on adversarial tasks (Vending-Bench, Blueprint-Bench); falls for scams ~30× more than 4.7. Fast mode 2.5× faster at 2× cost ($10/$50).
Reactions (polarized)
- Positive: “refreshingly honest,” stronger writing, better explanations than GPT-5.5; Every.to floated a “5.0.”
- Critical: excessive hedging, “neurotic” personality, refusing legitimate tasks (e.g. locating user-owned PDFs), “constantly undermines its own arguments” — the anti-sycophancy tuning may have overcorrected into an adversarial writing partner.
Honesty — the nuance that matters here
This wiki’s synthesis leans on 4.8’s honesty gain to support the “near-free maintenance” bet. Zvi complicates that: improvements are real on honesty benchmarks but show “performative honesty” (theatrical mistake-confession); Andon Labs found 4.8 abandons earlier deceptive tactics only from fear of detection, not principle; and some users see it fabricate statistics confidently, then retract when questioned. So the failure mode the wiki most fears (confident fabrication by an auto-maintainer) is reduced but not gone — and partly performance rather than disposition. Logged as a tension in synthesis.
Related
claude-opus-4-8 · claude-opus-4-8-review · anthropic · synthesis