ARC-AGI-2

The benchmark that measures what scaling alone cannot solve: fluid intelligence, abstract reasoning, and efficient generalization

Published

August 25, 2025

Keywords: ARC-AGI-2, ARC Prize, AGI benchmark, fluid intelligence, abstract reasoning, François Chollet, Mike Knoop, generalization, novel tasks, reasoning systems, efficiency

Introduction

Most AI benchmarks test what models already know — specialized knowledge, pattern recall, or skills that can be prepared for in advance. As frontier models saturate these benchmarks, they become unable to distinguish genuine progress toward general intelligence from incremental scaling.

ARC-AGI-2 takes the opposite approach. It tests what AI systems cannot yet do: adapt efficiently to novel, never-before-seen tasks that require only basic human-level reasoning. Pure LLMs score 0% on ARC-AGI-2. Even the best AI reasoning systems initially scored only single-digit percentages. Yet every task in the benchmark has been solved by at least 2 humans in under 2 attempts.

This gap — between what is easy for humans and hard for AI — is the essence of what ARC-AGI-2 measures. When this gap reaches zero, we will have achieved AGI.

graph LR
    A["Traditional Benchmarks<br/>(MMLU, GPQA, etc.)<br/>Saturated at 90%+"] --> B["Cannot distinguish<br/>frontier models"]
    B --> C["ARC-AGI-2<br/>Easy for humans<br/>Hard for AI"]
    C --> D["Measures fluid<br/>intelligence gap"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is ARC-AGI-2?

ARC-AGI-2 (Abstraction and Reasoning Corpus for Artificial General Intelligence, version 2) is the second edition of the ARC-AGI benchmark series. It presents AI systems with visual reasoning tasks — input-output grid pairs where the system must discover the underlying transformation rule and apply it to a new input.

Each task is unique and cannot be memorized in advance. Solving them requires the kind of fluid intelligence that humans use effortlessly: recognizing abstract patterns, composing multiple rules, and generalizing from just a few examples.

Key Characteristics

Feature Details
Training set 1,000 tasks (public, uncalibrated, spectrum of difficulty)
Public Eval 120 tasks (public, calibrated, all solved by humans)
Semi-Private Eval 120 tasks (private, calibrated, Kaggle live leaderboard)
Private Eval 120 tasks (private, calibrated, Kaggle final leaderboard)
Format Input-output grid pairs (colored cells on 2D grids)
Measurement pass@2 (two attempts per task, matching human rules)
Human solvability 100% — every task solved by at least 2 humans in ≤2 attempts
Human testing 400+ participants tested on 1,400+ tasks in controlled settings

What Changed from ARC-AGI-1?

ARC-AGI-1, introduced in 2019, was designed to challenge deep learning by resisting memorization. After five years, frontier reasoning systems like OpenAI’s o3-preview reached 75.7% on ARC-AGI-1 — demonstrating a binary level of fluid intelligence. ARC-AGI-2 raises the bar significantly:

  1. Eval sets expanded to 120 tasks each (up from 100)
  2. Brute-force-resistant tasks — removed tasks susceptible to naive program search
  3. Controlled human testing to calibrate difficulty and ensure IDD (independent and identically distributed) evaluation sets
  4. New task categories designed to challenge AI reasoning systems: symbolic interpretation, compositional reasoning, and contextual rule application

graph TD
    A["ARC-AGI-1 (2019)<br/>Challenged deep learning<br/>o3: 75.7%"] --> B["Benchmark needs<br/>higher difficulty"]
    B --> C["ARC-AGI-2 (2025)<br/>Challenges reasoning<br/>systems"]
    C --> D["120 calibrated<br/>eval tasks per set"]
    C --> E["Human-validated<br/>400+ testers"]
    C --> F["Brute-force<br/>resistant"]
    C --> G["New cognitive<br/>challenge categories"]

    style A fill:#f39c12,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style E fill:#3498db,color:#fff,stroke:#333
    style F fill:#3498db,color:#fff,stroke:#333
    style G fill:#3498db,color:#fff,stroke:#333

Who Built It?

ARC-AGI-2 was developed by the ARC Prize Foundation, a nonprofit advancing open-source AGI research through benchmarks and prizes.

Founders and Authors

  • François Chollet — Creator of Keras, the original ARC-AGI benchmark (2019), and co-founder of ARC Prize Foundation and Ndea. Chollet’s 2019 paper “On the Measure of Intelligence” laid the theoretical foundation for measuring fluid intelligence in AI systems.
  • Mike Knoop — Co-founder of Zapier and Ndea, co-founder of ARC Prize Foundation. Drives the competition infrastructure and community engagement.
  • Gregory Kamradt — ARC Prize team, communications and community lead.
  • Bryan Landers — ARC Prize team, technical infrastructure.
  • Henry Pinkard — ARC Prize team, research and human testing.

Institutional Support

ARC Prize is trusted by the world’s leading AI labs — OpenAI, Google, Anthropic, and xAI have all acknowledged ARC-AGI as a critical measure of AI progress. Sam Altman, Demis Hassabis, Sundar Pichai, and Elon Musk have publicly endorsed its importance.

Resource Link
ARC Prize Foundation arcprize.org
ARC-AGI-2 Technical Report arxiv.org/abs/2505.11831
ARC Prize 2024 Report arxiv.org/abs/2412.04604

What Skills Does It Test?

Unlike benchmarks that test specialized knowledge (PhD-level science, domain expertise), ARC-AGI-2 tests fluid intelligence — the ability to generalize from limited experience and apply knowledge in new, unexpected situations. Every task requires only elementary “Core Knowledge” priors that any human possesses.

graph TD
    ARC["ARC-AGI-2<br/>Fluid Intelligence"] --> SI["Symbolic<br/>Interpretation"]
    ARC --> CR["Compositional<br/>Reasoning"]
    ARC --> CRA["Contextual Rule<br/>Application"]
    ARC --> AB["Abstract Pattern<br/>Recognition"]
    ARC --> GEN["Generalization<br/>from Few Examples"]
    ARC --> EFF["Efficient<br/>Adaptation"]

    style ARC fill:#e74c3c,color:#fff,stroke:#333
    style SI fill:#3498db,color:#fff,stroke:#333
    style CR fill:#27ae60,color:#fff,stroke:#333
    style CRA fill:#f39c12,color:#fff,stroke:#333
    style AB fill:#8e44ad,color:#fff,stroke:#333
    style GEN fill:#e67e22,color:#fff,stroke:#333
    style EFF fill:#6cc3d5,color:#fff,stroke:#333

Symbolic Interpretation

Frontier AI reasoning systems struggle with tasks requiring symbols to be interpreted as having meaning beyond their visual patterns. Systems attempt symmetry checking, mirroring, and transformations, but fail to assign semantic significance to the symbols themselves.

Compositional Reasoning

AI systems struggle with tasks requiring the simultaneous application of multiple interacting rules. If a task has one simple global rule, AI can discover and apply it. But when multiple rules must be composed together, performance collapses.

Contextual Rule Application

AI systems struggle with tasks where rules must be applied differently based on context. Systems fixate on superficial patterns rather than understanding the underlying selection principles.

Why These Skills Matter

Capability What ARC-AGI-2 Tests Why AI Fails
Symbolic interpretation Assigning meaning to visual patterns Models treat symbols as shapes, not representations
Compositional reasoning Applying multiple interacting rules Models can handle single rules, not rule interactions
Contextual rule application Adapting rules to context Models fixate on surface patterns
Efficient generalization Learning from 2–3 examples Models require massive data or expensive search

“Intelligence requires the ability to generalize from limited experience and apply knowledge in new, unexpected situations. AI systems are already superhuman in many specific domains. However, these are narrow, specialized capabilities. The ‘human-AI gap’ reveals what’s missing for general intelligence — highly efficiently acquiring new skills.” — ARC Prize Foundation

Current Leaderboard

The leaderboard below shows ARC-AGI-2 scores from the ARC Prize Leaderboard, which tracks both performance (score) and efficiency (cost per task) — because intelligence is not just about solving problems, but solving them efficiently.

Source: ARC Prize Leaderboard (consulted March 28, 2026). ARC-AGI-2 semi-private eval set (120 tasks), pass@2 scoring.

Top Performers

Rank System Author Type ARC-AGI-2 (%) Cost/Task
1 Human Panel (at least 2) Human 100.0 $17.00
2 Gemini 3 Deep Think (2/26) Google CoT 84.6 $13.62
3 GPT-5.4 Pro (xHigh) OpenAI CoT 83.3 $16.41
4 Gemini 3.1 Pro (Preview) Google CoT 77.1 $0.96
5 GPT-5.4 (xHigh) OpenAI CoT 74.0 $1.52
6 GPT-5.2 (Refine.) Johan Land Refinement 72.9 $38.99
7 Claude Opus 4.6 (120K, High) Anthropic CoT 69.2 $3.47
8 Claude Opus 4.6 (120K, Max) Anthropic CoT 68.8 $3.64
9 GPT-5.4 (High) OpenAI CoT 67.5 $1.02
10 Claude Opus 4.6 (120K, Medium) Anthropic CoT 66.3 $2.72
11 Grok 4.20 (Reasoning) xAI CoT 65.1 $0.92
12 Claude Opus 4.6 (120K, Low) Anthropic CoT 64.6 $2.25
13 Claude Sonnet 4.6 (High) Anthropic CoT 60.4 $2.70
14 Claude Sonnet 4.6 (Max) Anthropic CoT 58.3 $2.72
15 GPT-5.4 (Medium) OpenAI CoT 55.4 $0.68

How It Started (March 2025 Launch)

For comparison, here are the scores at launch — when ARC-AGI-2 was first introduced:

System ARC-AGI-1 (%) ARC-AGI-2 (%) Cost/Task
Human Panel (at least 2) 98.0 100.0 $17.00
o3-preview-low (CoT + Search) 75.7 ~4* $200.00
ARChitects (Kaggle 2024 Winner) 53.5 3.0 $0.25
o1-pro (CoT + Search) ~50 ~1* $200.00
o3-mini-high (Single CoT) 35.0 0.0 $0.41
DeepSeek R1 (Single CoT) 15.8 0.3 $0.08
GPT-4.5 (Pure LLM) 10.3 0.0 $0.29

Scores marked with were in-progress estimates at launch time.*

Key takeaway: In one year, frontier reasoning systems went from single-digit scores to over 80% on ARC-AGI-2. But achieving this required expensive reasoning — the top systems cost $13–$17 per task, close to the $17 cost of human performance. The efficiency gap remains a critical measure of progress toward AGI.

graph LR
    A["March 2025<br/>Best AI: ~4%<br/>Cost: $200/task"] --> B["March 2026<br/>Best AI: 84.6%<br/>Cost: $13.62/task"]
    B --> C["Human baseline<br/>100%<br/>Cost: $17/task"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#f39c12,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333

Intelligence Is Not Just Capability

A core principle of ARC-AGI-2 is that intelligence must be measured alongside efficiency. Brute-force search could theoretically solve any ARC-AGI task given unlimited compute and time — but that would not represent intelligence.

The ARC Prize leaderboard plots systems on two axes:

  • Score (%) — How many tasks the system solves
  • Cost per task ($) — How efficiently it solves them

This framing reveals that many high-scoring systems achieve their results through expensive reasoning (thousands of tokens of chain-of-thought per task), while humans solve the same tasks in ~2.3 minutes at ~$17/task.

“The core question being asked is not just, ‘Can AI acquire skill to solve a task?’ but also, ‘At what efficiency or cost?’” — ARC Prize Foundation

ARC Prize Competition

Alongside ARC-AGI-2, the ARC Prize Foundation runs an annual competition on Kaggle to drive open-source progress:

ARC Prize 2025

Detail Value
Total prizes $1,000,000
Grand Prize $700,000 (first team to reach 85% within Kaggle efficiency limits)
Top Score Prize $75,000
Paper Prize $50,000 (most significant conceptual progress)
TBA Prizes $175,000
Platform Kaggle
Duration March 26 – November 3, 2025
Compute budget ~$50 per submission (L4 x4 GPUs)
Requirement Open-source solutions before receiving private eval scores

The 2024 competition attracted over 1,500 teams and generated 40 influential research papers. Winners introduced innovations now adopted across the AI industry.

Note: ARC Prize 2026 has since been announced with over $2,000,000 in prizes and a new ARC-AGI-3 benchmark that measures agentic intelligence.

Where to Explore the Benchmark

Dashboards and Leaderboards

Resource Description Link
ARC Prize Leaderboard Official leaderboard with score and efficiency axes arcprize.org/leaderboard
ARC-AGI Task Player Try ARC-AGI tasks yourself in the browser arcprize.org/tasks
Kaggle Competition ARC Prize 2025 contest page arcprize.org/competition

Dataset and Code

Resource Description Link
ARC-AGI-2 GitHub Official dataset repository with training and public eval tasks github.com/arcprize/ARC-AGI-2
Benchmarking Repo Run ARC-AGI tasks against multiple model adapters github.com/arcprize/arc-agi-benchmarking
ARC-AGI-2 Technical Report Full paper with methodology, human testing, and AI results arxiv.org/abs/2505.11831
ARC Prize 2024 Report Survey of top approaches and lessons from ARC-AGI-1 arxiv.org/abs/2412.04604

Clone the Dataset

git clone https://github.com/arcprize/ARC-AGI-2.git

Run a Benchmark

git clone https://github.com/arcprize/arc-agi-benchmarking.git
cd arc-agi-benchmarking
pip install .

python main.py \
  --data_dir data/sample/tasks \
  --config random-baseline \
  --task_id 66e6c45b \
  --save_submission_dir submissions/random-single \
  --log-level INFO

Why ARC-AGI-2 Matters

graph LR
    A["Benchmark<br/>Saturation"] --> B["Cannot measure<br/>true intelligence"]
    B --> C["ARC-AGI-2<br/>fills the gap"]
    C --> D["Guides research<br/>toward AGI"]

    A2["Scaling alone<br/>≠ Intelligence"] --> B2["Efficiency matters<br/>not just capability"]
    B2 --> C
    C --> D2["Novel ideas<br/>over brute force"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style A2 fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style D2 fill:#3498db,color:#fff,stroke:#333

  1. Measures what matters — Fluid intelligence and abstract reasoning, not memorized knowledge
  2. Easy for humans, hard for AI — Every task is solvable by humans, exposing the real capability gap
  3. Resists saturation — Designed to challenge the next generation of reasoning systems
  4. Measures efficiency — Tracks cost per task alongside accuracy, rewarding intelligence over brute force
  5. Drives open research — $1M+ in prizes requiring open-source solutions, generating influential papers
  6. Validates by humans — 400+ human participants tested to calibrate difficulty empirically

Video: ARC-AGI-2 Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

ARC-AGI-2 represents a fundamental shift in how we measure AI progress:

  • 1,360 tasks across training and evaluation sets, designed to resist brute-force and memorization
  • Built by the ARC Prize Foundation (François Chollet, Mike Knoop, and team) with support from the world’s leading AI labs
  • 100% solvable by humans — validated by 400+ participants in controlled testing
  • At launch, the best AI scored ~4% — one year later, the best scores 84.6%, but at high cost
  • Efficiency matters — the leaderboard tracks both score and cost per task
  • Drives open-source research through $1M+ annual competitions on Kaggle

As AI capabilities advance, ARC-AGI-2 provides a clear, measurable signal for genuine progress toward general intelligence. It shows us that scaling alone is not enough — new ideas are needed to close the gap between human and AI reasoning.

As the founders note: “AGI is the most important technology humanity will create, and we believe it is achievable in our lifetime. But ARC shows us we still need new ideas. Maybe they’ll come from you?”

References

Read More