graph LR
A["Traditional Benchmarks<br/>(MMLU, GPQA, etc.)<br/>Saturated at 90%+"] --> B["Cannot distinguish<br/>frontier models"]
B --> C["ARC-AGI-2<br/>Easy for humans<br/>Hard for AI"]
C --> D["Measures fluid<br/>intelligence gap"]
style A fill:#e74c3c,stroke:#333,color:#fff
style B fill:#f39c12,stroke:#333,color:#fff
style C fill:#27ae60,stroke:#333,color:#fff
style D fill:#3498db,stroke:#333,color:#fff
ARC-AGI-2
The benchmark that measures what scaling alone cannot solve: fluid intelligence, abstract reasoning, and efficient generalization
Keywords: ARC-AGI-2, ARC Prize, AGI benchmark, fluid intelligence, abstract reasoning, François Chollet, Mike Knoop, generalization, novel tasks, reasoning systems, efficiency

Introduction
Most AI benchmarks test what models already know — specialized knowledge, pattern recall, or skills that can be prepared for in advance. As frontier models saturate these benchmarks, they become unable to distinguish genuine progress toward general intelligence from incremental scaling.
ARC-AGI-2 takes the opposite approach. It tests what AI systems cannot yet do: adapt efficiently to novel, never-before-seen tasks that require only basic human-level reasoning. Pure LLMs score 0% on ARC-AGI-2. Even the best AI reasoning systems initially scored only single-digit percentages. Yet every task in the benchmark has been solved by at least 2 humans in under 2 attempts.
This gap — between what is easy for humans and hard for AI — is the essence of what ARC-AGI-2 measures. When this gap reaches zero, we will have achieved AGI.
What Is ARC-AGI-2?
ARC-AGI-2 (Abstraction and Reasoning Corpus for Artificial General Intelligence, version 2) is the second edition of the ARC-AGI benchmark series. It presents AI systems with visual reasoning tasks — input-output grid pairs where the system must discover the underlying transformation rule and apply it to a new input.
Each task is unique and cannot be memorized in advance. Solving them requires the kind of fluid intelligence that humans use effortlessly: recognizing abstract patterns, composing multiple rules, and generalizing from just a few examples.
Key Characteristics
| Feature | Details |
|---|---|
| Training set | 1,000 tasks (public, uncalibrated, spectrum of difficulty) |
| Public Eval | 120 tasks (public, calibrated, all solved by humans) |
| Semi-Private Eval | 120 tasks (private, calibrated, Kaggle live leaderboard) |
| Private Eval | 120 tasks (private, calibrated, Kaggle final leaderboard) |
| Format | Input-output grid pairs (colored cells on 2D grids) |
| Measurement | pass@2 (two attempts per task, matching human rules) |
| Human solvability | 100% — every task solved by at least 2 humans in ≤2 attempts |
| Human testing | 400+ participants tested on 1,400+ tasks in controlled settings |
What Changed from ARC-AGI-1?
ARC-AGI-1, introduced in 2019, was designed to challenge deep learning by resisting memorization. After five years, frontier reasoning systems like OpenAI’s o3-preview reached 75.7% on ARC-AGI-1 — demonstrating a binary level of fluid intelligence. ARC-AGI-2 raises the bar significantly:
- Eval sets expanded to 120 tasks each (up from 100)
- Brute-force-resistant tasks — removed tasks susceptible to naive program search
- Controlled human testing to calibrate difficulty and ensure IDD (independent and identically distributed) evaluation sets
- New task categories designed to challenge AI reasoning systems: symbolic interpretation, compositional reasoning, and contextual rule application
graph TD
A["ARC-AGI-1 (2019)<br/>Challenged deep learning<br/>o3: 75.7%"] --> B["Benchmark needs<br/>higher difficulty"]
B --> C["ARC-AGI-2 (2025)<br/>Challenges reasoning<br/>systems"]
C --> D["120 calibrated<br/>eval tasks per set"]
C --> E["Human-validated<br/>400+ testers"]
C --> F["Brute-force<br/>resistant"]
C --> G["New cognitive<br/>challenge categories"]
style A fill:#f39c12,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#3498db,color:#fff,stroke:#333
style E fill:#3498db,color:#fff,stroke:#333
style F fill:#3498db,color:#fff,stroke:#333
style G fill:#3498db,color:#fff,stroke:#333
Who Built It?
ARC-AGI-2 was developed by the ARC Prize Foundation, a nonprofit advancing open-source AGI research through benchmarks and prizes.
Institutional Support
ARC Prize is trusted by the world’s leading AI labs — OpenAI, Google, Anthropic, and xAI have all acknowledged ARC-AGI as a critical measure of AI progress. Sam Altman, Demis Hassabis, Sundar Pichai, and Elon Musk have publicly endorsed its importance.
| Resource | Link |
|---|---|
| ARC Prize Foundation | arcprize.org |
| ARC-AGI-2 Technical Report | arxiv.org/abs/2505.11831 |
| ARC Prize 2024 Report | arxiv.org/abs/2412.04604 |
What Skills Does It Test?
Unlike benchmarks that test specialized knowledge (PhD-level science, domain expertise), ARC-AGI-2 tests fluid intelligence — the ability to generalize from limited experience and apply knowledge in new, unexpected situations. Every task requires only elementary “Core Knowledge” priors that any human possesses.
graph TD
ARC["ARC-AGI-2<br/>Fluid Intelligence"] --> SI["Symbolic<br/>Interpretation"]
ARC --> CR["Compositional<br/>Reasoning"]
ARC --> CRA["Contextual Rule<br/>Application"]
ARC --> AB["Abstract Pattern<br/>Recognition"]
ARC --> GEN["Generalization<br/>from Few Examples"]
ARC --> EFF["Efficient<br/>Adaptation"]
style ARC fill:#e74c3c,color:#fff,stroke:#333
style SI fill:#3498db,color:#fff,stroke:#333
style CR fill:#27ae60,color:#fff,stroke:#333
style CRA fill:#f39c12,color:#fff,stroke:#333
style AB fill:#8e44ad,color:#fff,stroke:#333
style GEN fill:#e67e22,color:#fff,stroke:#333
style EFF fill:#6cc3d5,color:#fff,stroke:#333
Symbolic Interpretation
Frontier AI reasoning systems struggle with tasks requiring symbols to be interpreted as having meaning beyond their visual patterns. Systems attempt symmetry checking, mirroring, and transformations, but fail to assign semantic significance to the symbols themselves.
Compositional Reasoning
AI systems struggle with tasks requiring the simultaneous application of multiple interacting rules. If a task has one simple global rule, AI can discover and apply it. But when multiple rules must be composed together, performance collapses.
Contextual Rule Application
AI systems struggle with tasks where rules must be applied differently based on context. Systems fixate on superficial patterns rather than understanding the underlying selection principles.
Why These Skills Matter
| Capability | What ARC-AGI-2 Tests | Why AI Fails |
|---|---|---|
| Symbolic interpretation | Assigning meaning to visual patterns | Models treat symbols as shapes, not representations |
| Compositional reasoning | Applying multiple interacting rules | Models can handle single rules, not rule interactions |
| Contextual rule application | Adapting rules to context | Models fixate on surface patterns |
| Efficient generalization | Learning from 2–3 examples | Models require massive data or expensive search |
“Intelligence requires the ability to generalize from limited experience and apply knowledge in new, unexpected situations. AI systems are already superhuman in many specific domains. However, these are narrow, specialized capabilities. The ‘human-AI gap’ reveals what’s missing for general intelligence — highly efficiently acquiring new skills.” — ARC Prize Foundation
Current Leaderboard
The leaderboard below shows ARC-AGI-2 scores from the ARC Prize Leaderboard, which tracks both performance (score) and efficiency (cost per task) — because intelligence is not just about solving problems, but solving them efficiently.
Source: ARC Prize Leaderboard (consulted March 28, 2026). ARC-AGI-2 semi-private eval set (120 tasks), pass@2 scoring.
Top Performers
| Rank | System | Author | Type | ARC-AGI-2 (%) | Cost/Task |
|---|---|---|---|---|---|
| 1 | Human Panel (at least 2) | Human | — | 100.0 | $17.00 |
| 2 | Gemini 3 Deep Think (2/26) | CoT | 84.6 | $13.62 | |
| 3 | GPT-5.4 Pro (xHigh) | OpenAI | CoT | 83.3 | $16.41 |
| 4 | Gemini 3.1 Pro (Preview) | CoT | 77.1 | $0.96 | |
| 5 | GPT-5.4 (xHigh) | OpenAI | CoT | 74.0 | $1.52 |
| 6 | GPT-5.2 (Refine.) | Johan Land | Refinement | 72.9 | $38.99 |
| 7 | Claude Opus 4.6 (120K, High) | Anthropic | CoT | 69.2 | $3.47 |
| 8 | Claude Opus 4.6 (120K, Max) | Anthropic | CoT | 68.8 | $3.64 |
| 9 | GPT-5.4 (High) | OpenAI | CoT | 67.5 | $1.02 |
| 10 | Claude Opus 4.6 (120K, Medium) | Anthropic | CoT | 66.3 | $2.72 |
| 11 | Grok 4.20 (Reasoning) | xAI | CoT | 65.1 | $0.92 |
| 12 | Claude Opus 4.6 (120K, Low) | Anthropic | CoT | 64.6 | $2.25 |
| 13 | Claude Sonnet 4.6 (High) | Anthropic | CoT | 60.4 | $2.70 |
| 14 | Claude Sonnet 4.6 (Max) | Anthropic | CoT | 58.3 | $2.72 |
| 15 | GPT-5.4 (Medium) | OpenAI | CoT | 55.4 | $0.68 |
How It Started (March 2025 Launch)
For comparison, here are the scores at launch — when ARC-AGI-2 was first introduced:
| System | ARC-AGI-1 (%) | ARC-AGI-2 (%) | Cost/Task |
|---|---|---|---|
| Human Panel (at least 2) | 98.0 | 100.0 | $17.00 |
| o3-preview-low (CoT + Search) | 75.7 | ~4* | $200.00 |
| ARChitects (Kaggle 2024 Winner) | 53.5 | 3.0 | $0.25 |
| o1-pro (CoT + Search) | ~50 | ~1* | $200.00 |
| o3-mini-high (Single CoT) | 35.0 | 0.0 | $0.41 |
| DeepSeek R1 (Single CoT) | 15.8 | 0.3 | $0.08 |
| GPT-4.5 (Pure LLM) | 10.3 | 0.0 | $0.29 |
Scores marked with were in-progress estimates at launch time.*
Key takeaway: In one year, frontier reasoning systems went from single-digit scores to over 80% on ARC-AGI-2. But achieving this required expensive reasoning — the top systems cost $13–$17 per task, close to the $17 cost of human performance. The efficiency gap remains a critical measure of progress toward AGI.
graph LR
A["March 2025<br/>Best AI: ~4%<br/>Cost: $200/task"] --> B["March 2026<br/>Best AI: 84.6%<br/>Cost: $13.62/task"]
B --> C["Human baseline<br/>100%<br/>Cost: $17/task"]
style A fill:#e74c3c,color:#fff,stroke:#333
style B fill:#f39c12,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
Intelligence Is Not Just Capability
A core principle of ARC-AGI-2 is that intelligence must be measured alongside efficiency. Brute-force search could theoretically solve any ARC-AGI task given unlimited compute and time — but that would not represent intelligence.
The ARC Prize leaderboard plots systems on two axes:
- Score (%) — How many tasks the system solves
- Cost per task ($) — How efficiently it solves them
This framing reveals that many high-scoring systems achieve their results through expensive reasoning (thousands of tokens of chain-of-thought per task), while humans solve the same tasks in ~2.3 minutes at ~$17/task.
“The core question being asked is not just, ‘Can AI acquire skill to solve a task?’ but also, ‘At what efficiency or cost?’” — ARC Prize Foundation
ARC Prize Competition
Alongside ARC-AGI-2, the ARC Prize Foundation runs an annual competition on Kaggle to drive open-source progress:
ARC Prize 2025
| Detail | Value |
|---|---|
| Total prizes | $1,000,000 |
| Grand Prize | $700,000 (first team to reach 85% within Kaggle efficiency limits) |
| Top Score Prize | $75,000 |
| Paper Prize | $50,000 (most significant conceptual progress) |
| TBA Prizes | $175,000 |
| Platform | Kaggle |
| Duration | March 26 – November 3, 2025 |
| Compute budget | ~$50 per submission (L4 x4 GPUs) |
| Requirement | Open-source solutions before receiving private eval scores |
The 2024 competition attracted over 1,500 teams and generated 40 influential research papers. Winners introduced innovations now adopted across the AI industry.
Note: ARC Prize 2026 has since been announced with over $2,000,000 in prizes and a new ARC-AGI-3 benchmark that measures agentic intelligence.
Where to Explore the Benchmark
Dashboards and Leaderboards
| Resource | Description | Link |
|---|---|---|
| ARC Prize Leaderboard | Official leaderboard with score and efficiency axes | arcprize.org/leaderboard |
| ARC-AGI Task Player | Try ARC-AGI tasks yourself in the browser | arcprize.org/tasks |
| Kaggle Competition | ARC Prize 2025 contest page | arcprize.org/competition |
Dataset and Code
| Resource | Description | Link |
|---|---|---|
| ARC-AGI-2 GitHub | Official dataset repository with training and public eval tasks | github.com/arcprize/ARC-AGI-2 |
| Benchmarking Repo | Run ARC-AGI tasks against multiple model adapters | github.com/arcprize/arc-agi-benchmarking |
| ARC-AGI-2 Technical Report | Full paper with methodology, human testing, and AI results | arxiv.org/abs/2505.11831 |
| ARC Prize 2024 Report | Survey of top approaches and lessons from ARC-AGI-1 | arxiv.org/abs/2412.04604 |
Clone the Dataset
git clone https://github.com/arcprize/ARC-AGI-2.gitRun a Benchmark
git clone https://github.com/arcprize/arc-agi-benchmarking.git
cd arc-agi-benchmarking
pip install .
python main.py \
--data_dir data/sample/tasks \
--config random-baseline \
--task_id 66e6c45b \
--save_submission_dir submissions/random-single \
--log-level INFOWhy ARC-AGI-2 Matters
graph LR
A["Benchmark<br/>Saturation"] --> B["Cannot measure<br/>true intelligence"]
B --> C["ARC-AGI-2<br/>fills the gap"]
C --> D["Guides research<br/>toward AGI"]
A2["Scaling alone<br/>≠ Intelligence"] --> B2["Efficiency matters<br/>not just capability"]
B2 --> C
C --> D2["Novel ideas<br/>over brute force"]
style A fill:#e74c3c,color:#fff,stroke:#333
style A2 fill:#e74c3c,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#3498db,color:#fff,stroke:#333
style D2 fill:#3498db,color:#fff,stroke:#333
- Measures what matters — Fluid intelligence and abstract reasoning, not memorized knowledge
- Easy for humans, hard for AI — Every task is solvable by humans, exposing the real capability gap
- Resists saturation — Designed to challenge the next generation of reasoning systems
- Measures efficiency — Tracks cost per task alongside accuracy, rewarding intelligence over brute force
- Drives open research — $1M+ in prizes requiring open-source solutions, generating influential papers
- Validates by humans — 400+ human participants tested to calibrate difficulty empirically
Video: ARC-AGI-2 Explained
Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀
Conclusion
ARC-AGI-2 represents a fundamental shift in how we measure AI progress:
- 1,360 tasks across training and evaluation sets, designed to resist brute-force and memorization
- Built by the ARC Prize Foundation (François Chollet, Mike Knoop, and team) with support from the world’s leading AI labs
- 100% solvable by humans — validated by 400+ participants in controlled testing
- At launch, the best AI scored ~4% — one year later, the best scores 84.6%, but at high cost
- Efficiency matters — the leaderboard tracks both score and cost per task
- Drives open-source research through $1M+ annual competitions on Kaggle
As AI capabilities advance, ARC-AGI-2 provides a clear, measurable signal for genuine progress toward general intelligence. It shows us that scaling alone is not enough — new ideas are needed to close the gap between human and AI reasoning.
As the founders note: “AGI is the most important technology humanity will create, and we believe it is achievable in our lifetime. But ARC shows us we still need new ideas. Maybe they’ll come from you?”
References
- Chollet, F., Knoop, M., Kamradt, G., Landers, B., Pinkard, H. “ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems.” arXiv preprint arXiv:2505.11831 (2025). arxiv.org/abs/2505.11831
- Chollet, F., Knoop, M., Kamradt, G., Landers, B. “ARC Prize 2024: Technical Report.” arXiv preprint arXiv:2412.04604 (2024). arxiv.org/abs/2412.04604
- ARC Prize Foundation. “ARC-AGI-2 + ARC Prize 2025 is Live!” arcprize.org/arc-agi-2 (March 24, 2025)
- ARC Prize Foundation. “ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems — Blog.” arcprize.org/blog/arc-agi-2-technical-report (May 20, 2025)
- ARC Prize Foundation. “ARC Prize Leaderboard.” arcprize.org/leaderboard (consulted March 28, 2026)
- ARC Prize Foundation. “ARC-AGI-2 Dataset.” GitHub. github.com/arcprize/ARC-AGI-2
- ARC Prize Foundation. “ARC-AGI Benchmarking.” GitHub. github.com/arcprize/arc-agi-benchmarking
Read More
- Compare with the hardest academic benchmark — see Humanity’s Last Exam (HLE)
- Deploy models for running your own evaluations — see Deploying and Serving LLM with vLLM
- Scale your evaluation infrastructure — see Scaling LLM Serving for Enterprise Production
- Understand quantization trade-offs for evaluation — see Quantization Methods for LLMs
- Track model costs when running evaluations — see FinOps Best Practices for LLM Applications
- ARC Prize Official Website
- ARC Prize Leaderboard
- Try ARC-AGI Tasks
- ARC-AGI-2 Dataset on GitHub
- ARC Prize Discord Community