LLM benchmarks & leaderboards
How models are ranked across the llm-provider landscape. The 2026 best practice (llm-leaderboard-stats) is a composite, multi-axis score, not a single number.
The axes
- Capability benchmarks — GPQA Diamond, SWE-Bench Verified, MMLU, AIME, HumanEval, coding-arena.
- Speed — output throughput (tokens/sec), time-to-first-token.
- Price — per-token cost (llm-api-pricing).
- Context window and agentic/coding performance.
- Aggregators referenced across sources: Artificial Analysis Index, llm-stats.com.
2026 snapshot (volatile — churns weekly)
Reasoning leaders are proprietary — Anthropic (Claude Mythos Preview, claude-opus-4-8 #2) and OpenAI (GPT-5.5 #3); Alibaba’s Qwen3.7 Max is the cheapest top-10 ($1.53/M); xAI’s Grok 4 Fast has the largest context (2.0M); Mercury 2 is fastest (784 tok/s). Among open weights, deepseek V4 Pro tops the Artificial Analysis Index; Qwen3.6 hits 77.2% SWE-Bench.
The caution
Single-benchmark rankings invite gaming and hype; the open-source-llms-2026 source explicitly argues practical fit (task performance, license, hardware, cost, speed) beats leaderboard rank. Treat every ranking as a dated snapshot, and prefer composite/multi-axis views.
Related
llm-leaderboard-stats · llm-provider · llm-api-pricing · open-weight-models · claude-opus-4-8