Dataset source ↗ source url updated Mon Jun 01 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

LLM Stats Leaderboard

An independent leaderboard ranking 300+ models by a composite score blending verified benchmarks, live performance, and price — the core source for llm-benchmarks. Pricing refreshes hourly; live metrics update continuously.

What goes into the score

Benchmarks: GPQA Diamond, SWE-Bench Verified, MMLU, AIME, HumanEval, coding-arena.
Speed: output throughput (tokens/sec) + time-to-first-token, 7-day rolling.
Price: per-token input cost from official lists.
Context window and agentic/coding performance.

Top of the board (2026 snapshot; volatile)

Claude Mythos Preview (anthropic) — highest reasoning score.
claude-opus-4-8 (Anthropic) — #2 overall.
GPT-5.5 (OpenAI) — #3.
Qwen3.7 Max (Alibaba) — cheapest top-10 at $1.53/M.
Grok 4 Fast (xAI) — largest context (2.0M tokens).
Mercury 2 — fastest output (784 tok/s).

Why it matters here

It operationalizes “best model” as a multi-axis tradeoff (intelligence × speed × price × context), not a single number — the antidote to single-benchmark hype. Anthropic/OpenAI lead reasoning while Alibaba (Qwen) and xAI compete on cost/context. (Leaderboard = a dated snapshot; rankings churn weekly.)

llm-benchmarks · anthropic · claude-opus-4-8 · llm-provider · llm-api-pricing

LLM Stats Leaderboard

What goes into the score

Top of the board (2026 snapshot; volatile)

Why it matters here

Related

Linked from