ARC-AGI-2

The benchmark that measures what scaling alone cannot solve: fluid intelligence, abstract reasoning, and efficient generalization

Author

Vectoring AI

Keywords

ARC-AGI-2, ARC Prize, AGI benchmark, fluid intelligence, abstract reasoning, François Chollet, Mike Knoop, generalization, novel tasks, reasoning systems, efficiency

ARC-AGI-2 abstract reasoning benchmark for fluid intelligence

Introduction

Most AI benchmarks test what models already know — specialized knowledge, pattern recall, or skills that can be prepared for in advance. As frontier models saturate these benchmarks, they become unable to distinguish genuine progress toward general intelligence from incremental scaling.

ARC-AGI-2 takes the opposite approach. It tests what AI systems cannot yet do: adapt efficiently to novel, never-before-seen tasks that require only basic human-level reasoning. Pure LLMs score 0% on ARC-AGI-2. Even the best AI reasoning systems initially scored only single-digit percentages. Yet every task in the benchmark has been solved by at least 2 humans in under 2 attempts.

This gap — between what is easy for humans and hard for AI — is the essence of what ARC-AGI-2 measures. When this gap reaches zero, we will have achieved AGI.

graph LR
    linkStyle default stroke:#000,color:#000
    A["Traditional Benchmarks<br/>(MMLU, GPQA, etc.)<br/>Saturated at 90%+"] --> B["Cannot distinguish<br/>frontier models"]
    B --> C["ARC-AGI-2<br/>Easy for humans<br/>Hard for AI"]
    C --> D["Measures fluid<br/>intelligence gap"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is ARC-AGI-2?

ARC-AGI-2 (Abstraction and Reasoning Corpus for Artificial General Intelligence, version 2) is the second edition of the ARC-AGI benchmark series. It presents AI systems with visual reasoning tasks — input-output grid pairs where the system must discover the underlying transformation rule and apply it to a new input.

Each task is unique and cannot be memorized in advance. Solving them requires the kind of fluid intelligence that humans use effortlessly: recognizing abstract patterns, composing multiple rules, and generalizing from just a few examples.

Key Characteristics

Feature	Details
Training set	1,000 tasks (public, uncalibrated, spectrum of difficulty)
Public Eval	120 tasks (public, calibrated, all solved by humans)
Semi-Private Eval	120 tasks (private, calibrated, Kaggle live leaderboard)
Private Eval	120 tasks (private, calibrated, Kaggle final leaderboard)
Format	Input-output grid pairs (colored cells on 2D grids)
Measurement	pass@2 (two attempts per task, matching human rules)
Human solvability	100% — every task solved by at least 2 humans in ≤2 attempts
Human testing	400+ participants tested on 1,400+ tasks in controlled settings

What Changed from ARC-AGI-1?

ARC-AGI-1, introduced in 2019, was designed to challenge deep learning by resisting memorization. After five years, frontier reasoning systems like OpenAI’s o3-preview reached 75.7% on ARC-AGI-1 — demonstrating a binary level of fluid intelligence. ARC-AGI-2 raises the bar significantly:

Eval sets expanded to 120 tasks each (up from 100)
Brute-force-resistant tasks — removed tasks susceptible to naive program search
Controlled human testing to calibrate difficulty and ensure IDD (independent and identically distributed) evaluation sets
New task categories designed to challenge AI reasoning systems: symbolic interpretation, compositional reasoning, and contextual rule application

graph TD
    linkStyle default stroke:#000,color:#000
    A["ARC-AGI-1 (2019)<br/>Challenged deep learning<br/>o3: 75.7%"] --> B["Benchmark needs<br/>higher difficulty"]
    B --> C["ARC-AGI-2 (2025)<br/>Challenges reasoning<br/>systems"]
    C --> D["120 calibrated<br/>eval tasks per set"]
    C --> E["Human-validated<br/>400+ testers"]
    C --> F["Brute-force<br/>resistant"]
    C --> G["New cognitive<br/>challenge categories"]

    style A fill:#f39c12,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style E fill:#3498db,color:#fff,stroke:#333
    style F fill:#3498db,color:#fff,stroke:#333
    style G fill:#3498db,color:#fff,stroke:#333

Who Built It?

ARC-AGI-2 was developed by the ARC Prize Foundation, a nonprofit advancing open-source AGI research through benchmarks and prizes.

Founders and Authors

François Chollet — Creator of Keras, the original ARC-AGI benchmark (2019), and co-founder of ARC Prize Foundation and Ndea. Chollet’s 2019 paper “On the Measure of Intelligence” laid the theoretical foundation for measuring fluid intelligence in AI systems.
Mike Knoop — Co-founder of Zapier and Ndea, co-founder of ARC Prize Foundation. Drives the competition infrastructure and community engagement.
Gregory Kamradt — ARC Prize team, communications and community lead.
Bryan Landers — ARC Prize team, technical infrastructure.
Henry Pinkard — ARC Prize team, research and human testing.

Institutional Support

ARC Prize is trusted by the world’s leading AI labs — OpenAI, Google, Anthropic, and xAI have all acknowledged ARC-AGI as a critical measure of AI progress. Sam Altman, Demis Hassabis, Sundar Pichai, and Elon Musk have publicly endorsed its importance.

Resource	Link
ARC Prize Foundation	arcprize.org
ARC-AGI-2 Technical Report	arxiv.org/abs/2505.11831
ARC Prize 2024 Report	arxiv.org/abs/2412.04604

What Skills Does It Test?

Unlike benchmarks that test specialized knowledge (PhD-level science, domain expertise), ARC-AGI-2 tests fluid intelligence — the ability to generalize from limited experience and apply knowledge in new, unexpected situations. Every task requires only elementary “Core Knowledge” priors that any human possesses.

graph TD
    linkStyle default stroke:#000,color:#000
    ARC["ARC-AGI-2<br/>Fluid Intelligence"] --> SI["Symbolic<br/>Interpretation"]
    ARC --> CR["Compositional<br/>Reasoning"]
    ARC --> CRA["Contextual Rule<br/>Application"]
    ARC --> AB["Abstract Pattern<br/>Recognition"]
    ARC --> GEN["Generalization<br/>from Few Examples"]
    ARC --> EFF["Efficient<br/>Adaptation"]

    style ARC fill:#e74c3c,color:#fff,stroke:#333
    style SI fill:#3498db,color:#fff,stroke:#333
    style CR fill:#27ae60,color:#fff,stroke:#333
    style CRA fill:#f39c12,color:#fff,stroke:#333
    style AB fill:#8e44ad,color:#fff,stroke:#333
    style GEN fill:#e67e22,color:#fff,stroke:#333
    style EFF fill:#6cc3d5,color:#fff,stroke:#333

Symbolic Interpretation

Frontier AI reasoning systems struggle with tasks requiring symbols to be interpreted as having meaning beyond their visual patterns. Systems attempt symmetry checking, mirroring, and transformations, but fail to assign semantic significance to the symbols themselves.

Compositional Reasoning

AI systems struggle with tasks requiring the simultaneous application of multiple interacting rules. If a task has one simple global rule, AI can discover and apply it. But when multiple rules must be composed together, performance collapses.

Contextual Rule Application

AI systems struggle with tasks where rules must be applied differently based on context. Systems fixate on superficial patterns rather than understanding the underlying selection principles.

Why These Skills Matter

Capability	What ARC-AGI-2 Tests	Why AI Fails
Symbolic interpretation	Assigning meaning to visual patterns	Models treat symbols as shapes, not representations
Compositional reasoning	Applying multiple interacting rules	Models can handle single rules, not rule interactions
Contextual rule application	Adapting rules to context	Models fixate on surface patterns
Efficient generalization	Learning from 2–3 examples	Models require massive data or expensive search

“Intelligence requires the ability to generalize from limited experience and apply knowledge in new, unexpected situations. AI systems are already superhuman in many specific domains. However, these are narrow, specialized capabilities. The ‘human-AI gap’ reveals what’s missing for general intelligence — highly efficiently acquiring new skills.” — ARC Prize Foundation

Current Leaderboard

The leaderboard below shows ARC-AGI-2 scores from the ARC Prize Leaderboard, which tracks both performance (score) and efficiency (cost per task) — because intelligence is not just about solving problems, but solving them efficiently.

Source: ARC Prize Leaderboard (consulted March 28, 2026). ARC-AGI-2 semi-private eval set (120 tasks), pass@2 scoring.

Top Performers

Rank	System	Author	Type	ARC-AGI-2 (%)	Cost/Task
1	Human Panel (at least 2)	Human	—	100.0	$17.00
2	Gemini 3 Deep Think (2/26)	Google	CoT	84.6	$13.62
3	GPT-5.4 Pro (xHigh)	OpenAI	CoT	83.3	$16.41
4	Gemini 3.1 Pro (Preview)	Google	CoT	77.1	$0.96
5	GPT-5.4 (xHigh)	OpenAI	CoT	74.0	$1.52
6	GPT-5.2 (Refine.)	Johan Land	Refinement	72.9	$38.99
7	Claude Opus 4.6 (120K, High)	Anthropic	CoT	69.2	$3.47
8	Claude Opus 4.6 (120K, Max)	Anthropic	CoT	68.8	$3.64
9	GPT-5.4 (High)	OpenAI	CoT	67.5	$1.02
10	Claude Opus 4.6 (120K, Medium)	Anthropic	CoT	66.3	$2.72
11	Grok 4.20 (Reasoning)	xAI	CoT	65.1	$0.92
12	Claude Opus 4.6 (120K, Low)	Anthropic	CoT	64.6	$2.25
13	Claude Sonnet 4.6 (High)	Anthropic	CoT	60.4	$2.70
14	Claude Sonnet 4.6 (Max)	Anthropic	CoT	58.3	$2.72
15	GPT-5.4 (Medium)	OpenAI	CoT	55.4	$0.68

How It Started (March 2025 Launch)

For comparison, here are the scores at launch — when ARC-AGI-2 was first introduced:

System	ARC-AGI-1 (%)	ARC-AGI-2 (%)	Cost/Task
Human Panel (at least 2)	98.0	100.0	$17.00
o3-preview-low (CoT + Search)	75.7	~4*	$200.00
ARChitects (Kaggle 2024 Winner)	53.5	3.0	$0.25
o1-pro (CoT + Search)	~50	~1*	$200.00
o3-mini-high (Single CoT)	35.0	0.0	$0.41
DeepSeek R1 (Single CoT)	15.8	0.3	$0.08
GPT-4.5 (Pure LLM)	10.3	0.0	$0.29

Scores marked with were in-progress estimates at launch time.*

Key takeaway: In one year, frontier reasoning systems went from single-digit scores to over 80% on ARC-AGI-2. But achieving this required expensive reasoning — the top systems cost $13–$17 per task, close to the $17 cost of human performance. The efficiency gap remains a critical measure of progress toward AGI.

graph LR
    linkStyle default stroke:#000,color:#000
    A["March 2025<br/>Best AI: ~4%<br/>Cost: $200/task"] --> B["March 2026<br/>Best AI: 84.6%<br/>Cost: $13.62/task"]
    B --> C["Human baseline<br/>100%<br/>Cost: $17/task"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#f39c12,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333

Intelligence Is Not Just Capability

A core principle of ARC-AGI-2 is that intelligence must be measured alongside efficiency. Brute-force search could theoretically solve any ARC-AGI task given unlimited compute and time — but that would not represent intelligence.

The ARC Prize leaderboard plots systems on two axes:

Score (%) — How many tasks the system solves
Cost per task ($) — How efficiently it solves them

This framing reveals that many high-scoring systems achieve their results through expensive reasoning (thousands of tokens of chain-of-thought per task), while humans solve the same tasks in ~2.3 minutes at ~$17/task.

“The core question being asked is not just, ‘Can AI acquire skill to solve a task?’ but also, ‘At what efficiency or cost?’” — ARC Prize Foundation

ARC Prize Competition

Alongside ARC-AGI-2, the ARC Prize Foundation runs an annual competition on Kaggle to drive open-source progress:

ARC Prize 2025

Detail	Value
Total prizes	$1,000,000
Grand Prize	$700,000 (first team to reach 85% within Kaggle efficiency limits)
Top Score Prize	$75,000
Paper Prize	$50,000 (most significant conceptual progress)
TBA Prizes	$175,000
Platform	Kaggle
Duration	March 26 – November 3, 2025
Compute budget	~$50 per submission (L4 x4 GPUs)
Requirement	Open-source solutions before receiving private eval scores

The 2024 competition attracted over 1,500 teams and generated 40 influential research papers. Winners introduced innovations now adopted across the AI industry.

Note: ARC Prize 2026 has since been announced with over $2,000,000 in prizes and a new ARC-AGI-3 benchmark that measures agentic intelligence.

Where to Explore the Benchmark

Dashboards and Leaderboards

Resource	Description	Link
ARC Prize Leaderboard	Official leaderboard with score and efficiency axes	arcprize.org/leaderboard
ARC-AGI Task Player	Try ARC-AGI tasks yourself in the browser	arcprize.org/tasks
Kaggle Competition	ARC Prize 2025 contest page	arcprize.org/competition

Dataset and Code

Resource	Description	Link
ARC-AGI-2 GitHub	Official dataset repository with training and public eval tasks	github.com/arcprize/ARC-AGI-2
Benchmarking Repo	Run ARC-AGI tasks against multiple model adapters	github.com/arcprize/arc-agi-benchmarking
ARC-AGI-2 Technical Report	Full paper with methodology, human testing, and AI results	arxiv.org/abs/2505.11831
ARC Prize 2024 Report	Survey of top approaches and lessons from ARC-AGI-1	arxiv.org/abs/2412.04604

Clone the Dataset

git clone https://github.com/arcprize/ARC-AGI-2.git

Run a Benchmark

git clone https://github.com/arcprize/arc-agi-benchmarking.git
cd arc-agi-benchmarking
pip install .

python main.py \
  --data_dir data/sample/tasks \
  --config random-baseline \
  --task_id 66e6c45b \
  --save_submission_dir submissions/random-single \
  --log-level INFO

Why ARC-AGI-2 Matters

graph LR
    linkStyle default stroke:#000,color:#000
    A["Benchmark<br/>Saturation"] --> B["Cannot measure<br/>true intelligence"]
    B --> C["ARC-AGI-2<br/>fills the gap"]
    C --> D["Guides research<br/>toward AGI"]

    A2["Scaling alone<br/>≠ Intelligence"] --> B2["Efficiency matters<br/>not just capability"]
    B2 --> C
    C --> D2["Novel ideas<br/>over brute force"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style A2 fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style D2 fill:#3498db,color:#fff,stroke:#333

Measures what matters — Fluid intelligence and abstract reasoning, not memorized knowledge
Easy for humans, hard for AI — Every task is solvable by humans, exposing the real capability gap
Resists saturation — Designed to challenge the next generation of reasoning systems
Measures efficiency — Tracks cost per task alongside accuracy, rewarding intelligence over brute force
Drives open research — $1M+ in prizes requiring open-source solutions, generating influential papers
Validates by humans — 400+ human participants tested to calibrate difficulty empirically

Video: ARC-AGI-2 Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

ARC-AGI-2 represents a fundamental shift in how we measure AI progress:

1,360 tasks across training and evaluation sets, designed to resist brute-force and memorization
Built by the ARC Prize Foundation (François Chollet, Mike Knoop, and team) with support from the world’s leading AI labs
100% solvable by humans — validated by 400+ participants in controlled testing
At launch, the best AI scored ~4% — one year later, the best scores 84.6%, but at high cost
Efficiency matters — the leaderboard tracks both score and cost per task
Drives open-source research through $1M+ annual competitions on Kaggle

As AI capabilities advance, ARC-AGI-2 provides a clear, measurable signal for genuine progress toward general intelligence. It shows us that scaling alone is not enough — new ideas are needed to close the gap between human and AI reasoning.

As the founders note: “AGI is the most important technology humanity will create, and we believe it is achievable in our lifetime. But ARC shows us we still need new ideas. Maybe they’ll come from you?”

Enjoyed this article?

If this article helped you, your support helps us deliver more useful content. Here are a few ways to support our work:

Subscribe to Vectoring AI on YouTube
Share this article with your networks
Support with a coffee

References

Chollet, F., Knoop, M., Kamradt, G., Landers, B., Pinkard, H. “ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems.” arXiv preprint arXiv:2505.11831 (2025). arxiv.org/abs/2505.11831
Chollet, F., Knoop, M., Kamradt, G., Landers, B. “ARC Prize 2024: Technical Report.” arXiv preprint arXiv:2412.04604 (2024). arxiv.org/abs/2412.04604
ARC Prize Foundation. “ARC-AGI-2 + ARC Prize 2025 is Live!” arcprize.org/arc-agi-2 (March 24, 2025)
ARC Prize Foundation. “ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems — Blog.” arcprize.org/blog/arc-agi-2-technical-report (May 20, 2025)
ARC Prize Foundation. “ARC Prize Leaderboard.” arcprize.org/leaderboard (consulted March 28, 2026)
ARC Prize Foundation. “ARC-AGI-2 Dataset.” GitHub. github.com/arcprize/ARC-AGI-2
ARC Prize Foundation. “ARC-AGI Benchmarking.” GitHub. github.com/arcprize/arc-agi-benchmarking

Compare with the hardest academic benchmark — see Humanity’s Last Exam (HLE)
Deploy models for running your own evaluations — see Deploying and Serving LLM with vLLM
Scale your evaluation infrastructure — see Scaling LLM Serving for Enterprise Production
Understand quantization trade-offs for evaluation — see Quantization Methods for LLMs
Track model costs when running evaluations — see FinOps Best Practices for LLM Applications
ARC Prize Official Website
ARC Prize Leaderboard
Try ARC-AGI Tasks
ARC-AGI-2 Dataset on GitHub
ARC Prize Discord Community

Explore Benchmarks Home

Introduction

What Is ARC-AGI-2?

Key Characteristics

What Changed from ARC-AGI-1?

Who Built It?

Founders and Authors

Institutional Support

What Skills Does It Test?

Symbolic Interpretation

Compositional Reasoning

Contextual Rule Application

Why These Skills Matter

Current Leaderboard

Top Performers

How It Started (March 2025 Launch)

Intelligence Is Not Just Capability

ARC Prize Competition

ARC Prize 2025

Where to Explore the Benchmark

Dashboards and Leaderboards

Dataset and Code

Clone the Dataset

Run a Benchmark

Why ARC-AGI-2 Matters

Video: ARC-AGI-2 Explained

Conclusion

References

Read More