LLM Stats Leaderboard
An independent leaderboard ranking 300+ models by a composite score blending verified benchmarks, live performance, and price — the core source for llm-benchmarks. Pricing refreshes hourly; live metrics update continuously.
What goes into the score
- Benchmarks: GPQA Diamond, SWE-Bench Verified, MMLU, AIME, HumanEval, coding-arena.
- Speed: output throughput (tokens/sec) + time-to-first-token, 7-day rolling.
- Price: per-token input cost from official lists.
- Context window and agentic/coding performance.
Top of the board (2026 snapshot; volatile)
- Claude Mythos Preview (anthropic) — highest reasoning score.
- claude-opus-4-8 (Anthropic) — #2 overall.
- GPT-5.5 (OpenAI) — #3.
- Qwen3.7 Max (Alibaba) — cheapest top-10 at $1.53/M.
- Grok 4 Fast (xAI) — largest context (2.0M tokens).
- Mercury 2 — fastest output (784 tok/s).
Why it matters here
It operationalizes “best model” as a multi-axis tradeoff (intelligence × speed × price × context), not a single number — the antidote to single-benchmark hype. Anthropic/OpenAI lead reasoning while Alibaba (Qwen) and xAI compete on cost/context. (Leaderboard = a dated snapshot; rankings churn weekly.)
Related
llm-benchmarks · anthropic · claude-opus-4-8 · llm-provider · llm-api-pricing