AI Benchmark Tracker & Analysis

ARC-AGI-2

The benchmark that measures what scaling alone cannot solve: fluid intelligence, abstract reasoning, and efficient generalization

CharXiv Reasoning

A comprehensive benchmark for chart understanding in multimodal LLMs: 2,323 real-world charts from scientific papers with expert-curated questions

FACTS Benchmark Suite

A comprehensive benchmark suite from Google DeepMind for systematically evaluating the factuality of LLMs across grounding, parametric knowledge, search, and multimodal tasks

GPQA Diamond

A graduate-level, Google-proof science benchmark where PhD experts reach only 65% — and frontier AI models now surpass them

Global PIQA

A participatory physical commonsense reasoning benchmark spanning 116 languages and 65 countries, revealing how LLMs struggle with everyday knowledge across the world

Humanity’s Last Exam (HLE)

The hardest AI benchmark ever built: 2,500 expert-level questions designed to be the final closed-ended academic exam for AI

LiveCodeBench Pro

How Olympiad medalists judge LLMs in competitive programming — a contamination-free benchmark from Codeforces, ICPC, and IOI where the best model scores 0% on hard problems

MMMLU (Multilingual MMLU)

Testing LLMs across 14 languages and 57 subjects — OpenAI

MMMU-Pro

A more robust multimodal benchmark — filtering text-solvable questions, expanding to 10 options, and introducing vision-only evaluation across 30 college subjects

MathArena

Evaluating LLMs on uncontaminated math competitions — from AIME and Putnam to IMO and USAMO — with proof-writing assessment and real-time contamination-free evaluation

OmniDocBench 1.5

A comprehensive benchmark for evaluating diverse PDF document parsing — covering text OCR, table recognition, formula extraction, and layout detection across 1,355 real-world pages

OpenAI MRCR (Multi-Round Coreference Resolution)

A long-context

SWE-bench Verified

A human-validated benchmark of 500 real-world GitHub issues testing whether AI can autonomously resolve software engineering problems

ScreenSpot-Pro

A GUI grounding benchmark for professional high-resolution computer use — testing whether AI can locate tiny UI elements across 23 applications and 5 industries

SimpleQA

A factuality benchmark measuring whether language models can answer short, fact-seeking questions — and know when they don

τ²-Bench

A dual-control benchmark that tests whether AI agents can guide users through real-world customer service tasks across airline, retail, and telecom domains

Terminal Bench 2.0

A harbor-native benchmark of 89 expert-crafted tasks measuring how well AI agents master real terminal environments across SWE, ML, security, and data science

Vending-Bench

A long-horizon benchmark that tests whether LLM agents can coherently operate a vending machine business over months of simulated time

Video-MMMU

A multi-discipline benchmark evaluating how LLMs acquire knowledge from professional videos through three cognitive stages: Perception, Comprehension, and Adaptation