Defined Term concept updated Mon Jun 01 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

LLM benchmarks & leaderboards

How models are ranked across the llm-provider landscape. The 2026 best practice (llm-leaderboard-stats) is a composite, multi-axis score, not a single number.

The axes

Capability benchmarks — GPQA Diamond, SWE-Bench Verified, MMLU, AIME, HumanEval, coding-arena.
Speed — output throughput (tokens/sec), time-to-first-token.
Price — per-token cost (llm-api-pricing).
Context window and agentic/coding performance.
Aggregators referenced across sources: Artificial Analysis Index, llm-stats.com.

2026 snapshot (volatile — churns weekly)

Reasoning leaders are proprietary — Anthropic (Claude Mythos Preview, claude-opus-4-8 #2) and OpenAI (GPT-5.5 #3); Alibaba’s Qwen3.7 Max is the cheapest top-10 ($1.53/M); xAI’s Grok 4 Fast has the largest context (2.0M); Mercury 2 is fastest (784 tok/s). Among open weights, deepseek V4 Pro tops the Artificial Analysis Index; Qwen3.6 hits 77.2% SWE-Bench.

The caution

Single-benchmark rankings invite gaming and hype; the open-source-llms-2026 source explicitly argues practical fit (task performance, license, hardware, cost, speed) beats leaderboard rank. Treat every ranking as a dated snapshot, and prefer composite/multi-axis views.

llm-leaderboard-stats · llm-provider · llm-api-pricing · open-weight-models · claude-opus-4-8

LLM benchmarks & leaderboards

The axes

2026 snapshot (volatile — churns weekly)

The caution

Related

Linked from