Web Page source ↗ source url updated Fri Jun 12 2026 00:00:00 GMT+0000 (Coordinated Universal Time)

Artificial Analysis (independent benchmark platform)

The methodology-disclosed, reproducible benchmark the wiki’s open question explicitly asked for — “a neutral, reproducible benchmark would be the highest-value next source.” Artificial Analysis is an independent comparison platform for AI models and API hosting providers, and (per 2026 usage) the most-cited public benchmark source — a step up in rigor from the leaderboard/pricing sources already here (llm-leaderboard-stats, llm-api-pricing-comparison), which the synthesis flags as vendor/SEO-biased. URL-only ingest.

What it measures — the multi-axis frame, made concrete

It scores models on four axes — quality, output speed, price, context — the exact composite shape llm-benchmarks describes:

Quality → the Artificial Analysis Intelligence Index (v4.0): a composite of hard, current evals — GPQA Diamond, Humanity’s Last Exam, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, IFBench, AA-LCR, AA-Omniscience, GDPval-AA, CritPt. Composite-of-hard-benchmarks resists single-test gaming.
Price → a blended price assuming a 7:2:1 ratio of cache-hit : input : output tokens — a single comparable number that operationalizes the synthesis’s “cost is dominated by output tokens; caching/routing sets real cost” thread.
Speed → Time to First Token (request → first token) + output tokens/sec, measured live across provider APIs (so the same model is also compared across hosts, not just labs).

Why it matters here

Answers the neutral-benchmark open question. Transparent, documented, continuously re-run methodology across major APIs — the reproducible yardstick to check the “frontier premium survives?” question against, rather than vendor self-report.
Separates model from host. By benchmarking API providers serving the same weights, it makes the reseller/host layer measurable — where price/speed differ for identical models, the “engineering sets real cost” thesis becomes a number.
The blended-price construct is itself a useful artifact: it encodes that real cost depends on cache hit-rate, sharpening llm-api-pricing beyond sticker $/1M.

Caveat

Still a dated snapshot (rankings/prices churn weekly — the wiki’s standing volatile-data caveat) and the composite weighting + the 7:2:1 blend are editorial choices: neutral and disclosed, but not the only defensible methodology. Independent, not a standards body.

llm-benchmarks · llm-api-pricing · llm-leaderboard-stats · amazon-bedrock · synthesis

Artificial Analysis (independent benchmark platform)

What it measures — the multi-axis frame, made concrete

Why it matters here

Caveat

Related

Linked from