MathArena

Evaluating LLMs on uncontaminated math competitions — from AIME and Putnam to IMO and USAMO — with proof-writing assessment and real-time contamination-free evaluation

Author

Vectoring AI

Keywords

MathArena, math benchmark, LLM evaluation, math olympiad, AIME, IMO, USAMO, Putnam, proof-writing, contamination-free, mathematical reasoning, ETH Zurich, INSAIT, NeurIPS 2025

MathArena LLM evaluation on competition math: AIME, Putnam, IMO

Introduction

Mathematical benchmarks like AIME 2024 have become standard tests for frontier LLMs — but many of these problems are widely available online, making it impossible to tell whether a model is genuinely reasoning or simply memorizing solutions. Worse, most math benchmarks only check final numerical answers, completely ignoring whether the model can write a rigorous proof.

MathArena solves both problems. Built by SRI Lab at ETH Zurich and INSAIT, it evaluates LLMs on the latest math competitions and olympiads — problems released so recently that models could not have seen them during training. It is also the first benchmark to evaluate proof-writing capabilities of LLMs on competition mathematics.

“We find strong signs of contamination in AIME 2024. Nonetheless, evaluations on harder competitions, such as CMIMC 2025, demonstrate impressive reasoning capabilities in top-performing models.” — MathArena Paper

graph LR
    linkStyle default stroke:#000,color:#000
    A["Static Math Benchmarks<br/>(AIME 2024, GSM8K)<br/>Contamination risk"] --> B["Can't distinguish<br/>reasoning from<br/>memorization"]
    B --> C["MathArena<br/>Live competitions<br/>Continuously updated"]
    C --> D["Contamination-free<br/>evaluation of<br/>math reasoning"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is MathArena?

MathArena is a platform for rigorous evaluation of LLMs on the latest math competitions and olympiads. Its key insight is that recurring math competitions provide a natural stream of high-quality, challenging problems that can be used for real-time evaluation — effectively eliminating the risk of data contamination.

The platform evaluates models as soon as new competition problems are released. Each model is run 4 times on each problem, and MathArena computes both the average score and the cost in USD — giving a practical picture of both capability and efficiency.

Key Characteristics

Feature	Details
Problem sources	AIME, USAMO, IMO, IMC, Putnam, Miklós Schweitzer, CMIMC, Project Euler
Update frequency	Continuously updated with each new competition
Evaluation types	Final-answer competitions and proof-based competitions
Runs per problem	4 (average score computed)
Cost tracking	Reports evaluation cost in USD per model
Anti-contamination	Real-time evaluation on newly released problems
Proof evaluation	First benchmark to assess LLM proof-writing on competition math

What Makes It Different from Other Math Benchmarks?

graph TD
    linkStyle default stroke:#000,color:#000
    MA["MathArena"] --> E1["Contamination-Free<br/>Real-time evaluation<br/>on new competitions"]
    MA --> E2["Proof-Writing<br/>First benchmark to<br/>assess LLM proofs"]
    MA --> E3["Multi-Competition<br/>AIME, IMO, USAMO,<br/>Putnam, and more"]
    MA --> E4["Cost Tracking<br/>USD cost per model<br/>evaluation"]

    style MA fill:#e74c3c,color:#fff,stroke:#333
    style E1 fill:#3498db,color:#fff,stroke:#333
    style E2 fill:#27ae60,color:#fff,stroke:#333
    style E3 fill:#f39c12,color:#fff,stroke:#333
    style E4 fill:#8e44ad,color:#fff,stroke:#333

Two standout features set MathArena apart:

Proof-writing evaluation — While most math benchmarks only check final numerical answers, MathArena evaluates whether LLMs can construct rigorous mathematical proofs, a crucial capability for real mathematical work
Contamination detection — By evaluating on freshly released competition problems, MathArena exposes models that perform suspiciously well on older, widely available problems but struggle on new ones

Who Built It?

MathArena was developed by the SRI Lab at ETH Zurich in collaboration with INSAIT (Institute for Computer Science, Artificial Intelligence, and Technology):

Mislav Balunović — ETH Zurich, SRI Lab
Jasper Dekoninck — ETH Zurich, SRI Lab (lead contact)
Ivo Petrov — ETH Zurich, SRI Lab
Nikola Jovanović — ETH Zurich, SRI Lab
Martin Vechev — ETH Zurich, SRI Lab (group lead)

Publication

The paper was published at the NeurIPS 2025 Datasets and Benchmarks track — one of the premier venues for benchmark papers in machine learning.

Resource	Link
Main paper	arxiv.org/abs/2505.23281
USAMO paper	arxiv.org/abs/2503.21934
Project page	matharena.ai
GitHub	github.com/eth-sri/matharena
HuggingFace	huggingface.co/MathArena

What Skills Does It Test?

MathArena tests the full spectrum of competition-level mathematical reasoning — from final-answer problem solving to formal proof construction, across difficulty levels ranging from undergraduate (AIME, Putnam) to Olympiad (IMO, USAMO).

graph TD
    linkStyle default stroke:#000,color:#000
    MA["MathArena<br/>Competition Mathematics"] --> A["Algebraic<br/>Reasoning<br/>Equations, inequalities,<br/>number theory"]
    MA --> B["Geometric<br/>Reasoning<br/>Euclidean geometry,<br/>transformations"]
    MA --> C["Combinatorial<br/>Reasoning<br/>Counting, graph theory,<br/>pigeonhole principle"]
    MA --> D["Proof<br/>Construction<br/>Rigorous arguments,<br/>logical deduction"]
    MA --> E["Problem<br/>Decomposition<br/>Multi-step solutions,<br/>case analysis"]
    MA --> F["Computational<br/>Fluency<br/>Accurate calculation<br/>under constraints"]

    style MA fill:#e74c3c,color:#fff,stroke:#333
    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#27ae60,color:#fff,stroke:#333
    style C fill:#f39c12,color:#fff,stroke:#333
    style D fill:#8e44ad,color:#fff,stroke:#333
    style E fill:#e67e22,color:#fff,stroke:#333
    style F fill:#6cc3d5,color:#fff,stroke:#333

Capability	What MathArena Tests
Algebraic reasoning	Solving equations, inequalities, and number theory problems at competition level
Geometric reasoning	Euclidean geometry proofs, coordinate geometry, and geometric transformations
Combinatorial reasoning	Counting arguments, graph theory, and discrete mathematics
Proof construction	Writing rigorous mathematical proofs (IMO, USAMO-level)
Problem decomposition	Breaking complex multi-step problems into manageable subproblems
Computational fluency	Performing accurate calculations under time and complexity constraints

Competition Hierarchy

The competitions covered by MathArena span a wide range of difficulty:

Competition	Level	Type
AIME	High school (advanced)	Final-answer
CMIMC	Undergraduate	Final-answer
Putnam	Undergraduate	Proof-based
USAMO	High school (Olympiad)	Proof-based
IMO	International Olympiad	Proof-based
IMC	International undergraduate	Mixed
Miklós Schweitzer	Graduate-level	Proof-based
Project Euler	Computational math	Final-answer

Current Leaderboard

The leaderboard below shows model performance on MathArena as displayed on the official platform. Results represent average scores across 4 runs per problem, with evaluation cost in USD.

Source: MathArena Leaderboard (consulted March 28, 2026). Continuously updated with new competitions and models.

Overall Performance (Final-Answer Competitions)

Rank	Model	Score (%)	Cost (USD)
1	GPT-5.4 (xhigh)	95.24	$5.15
2	Gemini 3.1 Pro Preview	74.40	$2.20
3	Claude Opus 4.6 (high)	47.02	$13.23
4	Step 3.5 Flash	44.64	$0.22
5	Qwen3.5-397B-A17B	36.31	$0.72
6	GLM 5	35.12	$1.47

Key takeaways:

GPT-5.4 dominates at 95.24% — but at a cost of $5.15 per evaluation, it is among the most expensive options
Cost-performance tradeoffs are stark — Step 3.5 Flash achieves 44.64% at just $0.22, while Claude Opus 4.6 costs $13.23 for 47.02%
Proof-based competitions remain much harder — on IMO 2025, top models achieve slightly less than 40%, showing significant room for improvement
Contamination matters — models perform suspiciously well on older problems (AIME 2024) but struggle on fresh ones, confirming MathArena’s value as a contamination-free evaluation
The benchmark is continuously updated with new competitions (USAMO 2026 was added March 28, 2026)

For the full, up-to-date leaderboard across all competitions (AIME, USAMO, IMO, Putnam, and more), visit the official platform linked in the next section.

Where to Explore the Benchmark

Leaderboard and Project

Resource	Description	Link
Official platform	Live leaderboard with competition filters, cost tracking, and multi-run analysis	matharena.ai
Main paper	Full methodology, contamination analysis, and proof evaluation framework	arxiv.org/abs/2505.23281
USAMO paper	Detailed study of LLMs on 2025 USA Math Olympiad proof problems	arxiv.org/abs/2503.21934
GitHub	Source code, evaluation scripts, and problem data	github.com/eth-sri/matharena
HuggingFace	Problem datasets and model outputs	huggingface.co/MathArena

Related Resources

Resource	Description	Link
AIME	American Invitational Mathematics Examination — source of many MathArena problems	maa.org/aime
IMO	International Mathematical Olympiad — the most prestigious math competition	imo-official.org
Putnam	William Lowell Putnam Competition — the premier undergraduate math contest	maa.org/putnam

Understanding the Metrics

Average Score Across 4 Runs

Unlike benchmarks that report a single-run result, MathArena runs each model 4 times per problem and reports the average score. This approach:

Reduces variance from stochastic sampling
Reveals consistency — a model that solves a problem 1 out of 4 times is clearly less reliable than one that solves it 4 out of 4
Provides more robust rankings by smoothing out lucky or unlucky runs

Cost in USD

MathArena uniquely reports the API cost of running each model, making it possible to compare models on a cost-efficiency basis. A model that achieves 95% at $5.15 may not be the best choice if another achieves 75% at $2.20 for your use case.

Final-Answer vs. Proof-Based

Type	Evaluation	Competitions
Final-answer	Model produces a numerical answer; automatically checked	AIME, CMIMC, Project Euler
Proof-based	Model writes a mathematical proof; evaluated for rigor	IMO, USAMO, Putnam, Miklós Schweitzer

Proof-based evaluation is significantly harder — it requires the model to construct step-by-step logical arguments, not just produce a number.

Why MathArena Matters

graph LR
    linkStyle default stroke:#000,color:#000
    A["Math benchmarks<br/>contaminated by<br/>training data"] --> C["MathArena<br/>uses live competitions<br/>for real-time eval"]
    B["No benchmark<br/>tests proof-writing<br/>capabilities"] --> C
    C --> D["Contamination-free<br/>genuine reasoning<br/>measurement"]
    C --> E["First proof-writing<br/>benchmark for<br/>LLMs"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style E fill:#3498db,color:#fff,stroke:#333

Contamination-free evaluation — By using freshly released competition problems, MathArena provides the cleanest signal of genuine mathematical reasoning
Proof-writing assessment — The first benchmark to evaluate whether LLMs can construct rigorous mathematical proofs, not just produce final answers
Cost transparency — Reports evaluation cost in USD, enabling practical cost-performance tradeoffs
Broad difficulty range — From AIME (advanced high school) to Miklós Schweitzer (graduate-level), covering the full spectrum of mathematical challenge
Continuously evolving — New competitions are added as they occur, ensuring the benchmark stays relevant and contamination-free

Video: MathArena Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

MathArena provides the gold standard for evaluating LLM mathematical reasoning:

Real-time evaluation on live competitions — AIME, USAMO, IMO, Putnam, and more — eliminates contamination risk
Proof-writing assessment makes it the first benchmark to test rigorous mathematical argumentation, not just final answers
Strong signs of contamination found in AIME 2024, validating the need for fresh, uncontaminated evaluation
Top models achieve ~95% on final-answer problems but less than 40% on proof-based IMO problems — a massive gap that reveals where real mathematical reasoning still falls short
Cost tracking enables practical decisions about which model to deploy for math-intensive applications

As LLMs advance in mathematical reasoning, MathArena ensures we can measure that progress honestly — with problems the models have never seen and standards that demand genuine understanding, not pattern matching.

Enjoyed this article?

If this article helped you, your support helps us deliver more useful content. Here are a few ways to support our work:

Subscribe to Vectoring AI on YouTube
Share this article with your networks
Support with a coffee

References

Balunović, M., Dekoninck, J., Petrov, I., Jovanović, N., & Vechev, M. “MathArena: Evaluating LLMs on Uncontaminated Math Competitions.” NeurIPS Datasets and Benchmarks 2025. arXiv:2505.23281 (2025). arxiv.org/abs/2505.23281
Petrov, I., Dekoninck, J., Balunović, M., Jovanović, N., & Vechev, M. “Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad.” arXiv preprint arXiv:2503.21934 (2025). arxiv.org/abs/2503.21934
MathArena. “Official Platform.” matharena.ai
SRI Lab at ETH Zurich. “MathArena GitHub Repository.” github.com/eth-sri/matharena

Explore the hardest AI benchmark ever built — see Humanity’s Last Exam (HLE)
Test graduate-level science reasoning — see GPQA Diamond
Measure abstract reasoning and fluid intelligence — see ARC-AGI-2
Evaluate competitive programming skills — see LiveCodeBench Pro
Track model costs when running evaluations — see FinOps Best Practices for LLM Applications

Explore Benchmarks Home

Introduction

What Is MathArena?

Key Characteristics

What Makes It Different from Other Math Benchmarks?

Who Built It?

Publication

What Skills Does It Test?

Competition Hierarchy

Current Leaderboard

Overall Performance (Final-Answer Competitions)

Where to Explore the Benchmark

Leaderboard and Project

Related Resources

Understanding the Metrics

Average Score Across 4 Runs

Cost in USD

Final-Answer vs. Proof-Based

Why MathArena Matters

Video: MathArena Explained

Conclusion

References

Read More