Artificial Analysis (independent benchmark platform)
The methodology-disclosed, reproducible benchmark the wiki’s open question explicitly asked for — “a neutral, reproducible benchmark would be the highest-value next source.” Artificial Analysis is an independent comparison platform for AI models and API hosting providers, and (per 2026 usage) the most-cited public benchmark source — a step up in rigor from the leaderboard/pricing sources already here (llm-leaderboard-stats, llm-api-pricing-comparison), which the synthesis flags as vendor/SEO-biased. URL-only ingest.
What it measures — the multi-axis frame, made concrete
It scores models on four axes — quality, output speed, price, context — the exact composite shape llm-benchmarks describes:
- Quality → the Artificial Analysis Intelligence Index (v4.0): a composite of hard, current evals — GPQA Diamond, Humanity’s Last Exam, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, IFBench, AA-LCR, AA-Omniscience, GDPval-AA, CritPt. Composite-of-hard-benchmarks resists single-test gaming.
- Price → a blended price assuming a 7:2:1 ratio of cache-hit : input : output tokens — a single comparable number that operationalizes the synthesis’s “cost is dominated by output tokens; caching/routing sets real cost” thread.
- Speed → Time to First Token (request → first token) + output tokens/sec, measured live across provider APIs (so the same model is also compared across hosts, not just labs).
Why it matters here
- Answers the neutral-benchmark open question. Transparent, documented, continuously re-run methodology across major APIs — the reproducible yardstick to check the “frontier premium survives?” question against, rather than vendor self-report.
- Separates model from host. By benchmarking API providers serving the same weights, it makes the reseller/host layer measurable — where price/speed differ for identical models, the “engineering sets real cost” thesis becomes a number.
- The blended-price construct is itself a useful artifact: it encodes that real cost depends on cache hit-rate, sharpening llm-api-pricing beyond sticker $/1M.
Caveat
Still a dated snapshot (rankings/prices churn weekly — the wiki’s standing volatile-data caveat) and the composite weighting + the 7:2:1 blend are editorial choices: neutral and disclosed, but not the only defensible methodology. Independent, not a standards body.
Related
llm-benchmarks · llm-api-pricing · llm-leaderboard-stats · amazon-bedrock · synthesis