We use cookies to improve your browsing experience, support the operation of this site, and understand how visitors use our content. You can accept all cookies, accept only essential cookies, or deny non-essential cookies. Privacy Policy
Vectoring AI
The benchmark that measures what scaling alone cannot solve: fluid intelligence, abstract reasoning, and efficient generalization
A comprehensive benchmark for chart understanding in multimodal LLMs: 2,323 real-world charts from scientific papers with expert-curated questions
A comprehensive benchmark suite from Google DeepMind for systematically evaluating the factuality of LLMs across grounding, parametric knowledge, search, and multimodal tasks
A graduate-level, Google-proof science benchmark where PhD experts reach only 65% — and frontier AI models now surpass them
A participatory physical commonsense reasoning benchmark spanning 116 languages and 65 countries, revealing how LLMs struggle with everyday knowledge across the world
The hardest AI benchmark ever built: 2,500 expert-level questions designed to be the final closed-ended academic exam for AI
How Olympiad medalists judge LLMs in competitive programming — a contamination-free benchmark from Codeforces, ICPC, and IOI where the best model scores 0% on hard problems
Testing LLMs across 14 languages and 57 subjects — OpenAI
A more robust multimodal benchmark — filtering text-solvable questions, expanding to 10 options, and introducing vision-only evaluation across 30 college subjects
Evaluating LLMs on uncontaminated math competitions — from AIME and Putnam to IMO and USAMO — with proof-writing assessment and real-time contamination-free evaluation
A comprehensive benchmark for evaluating diverse PDF document parsing — covering text OCR, table recognition, formula extraction, and layout detection across 1,355 real-world pages
A long-context
A human-validated benchmark of 500 real-world GitHub issues testing whether AI can autonomously resolve software engineering problems
A GUI grounding benchmark for professional high-resolution computer use — testing whether AI can locate tiny UI elements across 23 applications and 5 industries
A factuality benchmark measuring whether language models can answer short, fact-seeking questions — and know when they don
A dual-control benchmark that tests whether AI agents can guide users through real-world customer service tasks across airline, retail, and telecom domains
A harbor-native benchmark of 89 expert-crafted tasks measuring how well AI agents master real terminal environments across SWE, ML, security, and data science
A long-horizon benchmark that tests whether LLM agents can coherently operate a vending machine business over months of simulated time
A multi-discipline benchmark evaluating how LLMs acquire knowledge from professional videos through three cognitive stages: Perception, Comprehension, and Adaptation