MathArena

Evaluating LLMs on uncontaminated math competitions — from AIME and Putnam to IMO and USAMO — with proof-writing assessment and real-time contamination-free evaluation

Published

September 6, 2025

Keywords: MathArena, math benchmark, LLM evaluation, math olympiad, AIME, IMO, USAMO, Putnam, proof-writing, contamination-free, mathematical reasoning, ETH Zurich, INSAIT, NeurIPS 2025

Introduction

Mathematical benchmarks like AIME 2024 have become standard tests for frontier LLMs — but many of these problems are widely available online, making it impossible to tell whether a model is genuinely reasoning or simply memorizing solutions. Worse, most math benchmarks only check final numerical answers, completely ignoring whether the model can write a rigorous proof.

MathArena solves both problems. Built by SRI Lab at ETH Zurich and INSAIT, it evaluates LLMs on the latest math competitions and olympiads — problems released so recently that models could not have seen them during training. It is also the first benchmark to evaluate proof-writing capabilities of LLMs on competition mathematics.

“We find strong signs of contamination in AIME 2024. Nonetheless, evaluations on harder competitions, such as CMIMC 2025, demonstrate impressive reasoning capabilities in top-performing models.” — MathArena Paper

graph LR
    A["Static Math Benchmarks<br/>(AIME 2024, GSM8K)<br/>Contamination risk"] --> B["Can't distinguish<br/>reasoning from<br/>memorization"]
    B --> C["MathArena<br/>Live competitions<br/>Continuously updated"]
    C --> D["Contamination-free<br/>evaluation of<br/>math reasoning"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is MathArena?

MathArena is a platform for rigorous evaluation of LLMs on the latest math competitions and olympiads. Its key insight is that recurring math competitions provide a natural stream of high-quality, challenging problems that can be used for real-time evaluation — effectively eliminating the risk of data contamination.

The platform evaluates models as soon as new competition problems are released. Each model is run 4 times on each problem, and MathArena computes both the average score and the cost in USD — giving a practical picture of both capability and efficiency.

Key Characteristics

Feature Details
Problem sources AIME, USAMO, IMO, IMC, Putnam, Miklós Schweitzer, CMIMC, Project Euler
Update frequency Continuously updated with each new competition
Evaluation types Final-answer competitions and proof-based competitions
Runs per problem 4 (average score computed)
Cost tracking Reports evaluation cost in USD per model
Anti-contamination Real-time evaluation on newly released problems
Proof evaluation First benchmark to assess LLM proof-writing on competition math

What Makes It Different from Other Math Benchmarks?

graph TD
    MA["MathArena"] --> E1["Contamination-Free<br/>Real-time evaluation<br/>on new competitions"]
    MA --> E2["Proof-Writing<br/>First benchmark to<br/>assess LLM proofs"]
    MA --> E3["Multi-Competition<br/>AIME, IMO, USAMO,<br/>Putnam, and more"]
    MA --> E4["Cost Tracking<br/>USD cost per model<br/>evaluation"]

    style MA fill:#e74c3c,color:#fff,stroke:#333
    style E1 fill:#3498db,color:#fff,stroke:#333
    style E2 fill:#27ae60,color:#fff,stroke:#333
    style E3 fill:#f39c12,color:#fff,stroke:#333
    style E4 fill:#8e44ad,color:#fff,stroke:#333

Two standout features set MathArena apart:

  1. Proof-writing evaluation — While most math benchmarks only check final numerical answers, MathArena evaluates whether LLMs can construct rigorous mathematical proofs, a crucial capability for real mathematical work
  2. Contamination detection — By evaluating on freshly released competition problems, MathArena exposes models that perform suspiciously well on older, widely available problems but struggle on new ones

Who Built It?

MathArena was developed by the SRI Lab at ETH Zurich in collaboration with INSAIT (Institute for Computer Science, Artificial Intelligence, and Technology):

  • Mislav Balunović — ETH Zurich, SRI Lab
  • Jasper Dekoninck — ETH Zurich, SRI Lab (lead contact)
  • Ivo Petrov — ETH Zurich, SRI Lab
  • Nikola Jovanović — ETH Zurich, SRI Lab
  • Martin Vechev — ETH Zurich, SRI Lab (group lead)

Publication

The paper was published at the NeurIPS 2025 Datasets and Benchmarks track — one of the premier venues for benchmark papers in machine learning.

Resource Link
Main paper arxiv.org/abs/2505.23281
USAMO paper arxiv.org/abs/2503.21934
Project page matharena.ai
GitHub github.com/eth-sri/matharena
HuggingFace huggingface.co/MathArena

What Skills Does It Test?

MathArena tests the full spectrum of competition-level mathematical reasoning — from final-answer problem solving to formal proof construction, across difficulty levels ranging from undergraduate (AIME, Putnam) to Olympiad (IMO, USAMO).

graph TD
    MA["MathArena<br/>Competition Mathematics"] --> A["Algebraic<br/>Reasoning<br/>Equations, inequalities,<br/>number theory"]
    MA --> B["Geometric<br/>Reasoning<br/>Euclidean geometry,<br/>transformations"]
    MA --> C["Combinatorial<br/>Reasoning<br/>Counting, graph theory,<br/>pigeonhole principle"]
    MA --> D["Proof<br/>Construction<br/>Rigorous arguments,<br/>logical deduction"]
    MA --> E["Problem<br/>Decomposition<br/>Multi-step solutions,<br/>case analysis"]
    MA --> F["Computational<br/>Fluency<br/>Accurate calculation<br/>under constraints"]

    style MA fill:#e74c3c,color:#fff,stroke:#333
    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#27ae60,color:#fff,stroke:#333
    style C fill:#f39c12,color:#fff,stroke:#333
    style D fill:#8e44ad,color:#fff,stroke:#333
    style E fill:#e67e22,color:#fff,stroke:#333
    style F fill:#6cc3d5,color:#fff,stroke:#333

Capability What MathArena Tests
Algebraic reasoning Solving equations, inequalities, and number theory problems at competition level
Geometric reasoning Euclidean geometry proofs, coordinate geometry, and geometric transformations
Combinatorial reasoning Counting arguments, graph theory, and discrete mathematics
Proof construction Writing rigorous mathematical proofs (IMO, USAMO-level)
Problem decomposition Breaking complex multi-step problems into manageable subproblems
Computational fluency Performing accurate calculations under time and complexity constraints

Competition Hierarchy

The competitions covered by MathArena span a wide range of difficulty:

Competition Level Type
AIME High school (advanced) Final-answer
CMIMC Undergraduate Final-answer
Putnam Undergraduate Proof-based
USAMO High school (Olympiad) Proof-based
IMO International Olympiad Proof-based
IMC International undergraduate Mixed
Miklós Schweitzer Graduate-level Proof-based
Project Euler Computational math Final-answer

Current Leaderboard

The leaderboard below shows model performance on MathArena as displayed on the official platform. Results represent average scores across 4 runs per problem, with evaluation cost in USD.

Source: MathArena Leaderboard (consulted March 28, 2026). Continuously updated with new competitions and models.

Overall Performance (Final-Answer Competitions)

Rank Model Score (%) Cost (USD)
1 GPT-5.4 (xhigh) 95.24 $5.15
2 Gemini 3.1 Pro Preview 74.40 $2.20
3 Claude Opus 4.6 (high) 47.02 $13.23
4 Step 3.5 Flash 44.64 $0.22
5 Qwen3.5-397B-A17B 36.31 $0.72
6 GLM 5 35.12 $1.47

Key takeaways:

  • GPT-5.4 dominates at 95.24% — but at a cost of $5.15 per evaluation, it is among the most expensive options
  • Cost-performance tradeoffs are stark — Step 3.5 Flash achieves 44.64% at just $0.22, while Claude Opus 4.6 costs $13.23 for 47.02%
  • Proof-based competitions remain much harder — on IMO 2025, top models achieve slightly less than 40%, showing significant room for improvement
  • Contamination matters — models perform suspiciously well on older problems (AIME 2024) but struggle on fresh ones, confirming MathArena’s value as a contamination-free evaluation
  • The benchmark is continuously updated with new competitions (USAMO 2026 was added March 28, 2026)

For the full, up-to-date leaderboard across all competitions (AIME, USAMO, IMO, Putnam, and more), visit the official platform linked in the next section.

Where to Explore the Benchmark

Leaderboard and Project

Resource Description Link
Official platform Live leaderboard with competition filters, cost tracking, and multi-run analysis matharena.ai
Main paper Full methodology, contamination analysis, and proof evaluation framework arxiv.org/abs/2505.23281
USAMO paper Detailed study of LLMs on 2025 USA Math Olympiad proof problems arxiv.org/abs/2503.21934
GitHub Source code, evaluation scripts, and problem data github.com/eth-sri/matharena
HuggingFace Problem datasets and model outputs huggingface.co/MathArena

Understanding the Metrics

Average Score Across 4 Runs

Unlike benchmarks that report a single-run result, MathArena runs each model 4 times per problem and reports the average score. This approach:

  • Reduces variance from stochastic sampling
  • Reveals consistency — a model that solves a problem 1 out of 4 times is clearly less reliable than one that solves it 4 out of 4
  • Provides more robust rankings by smoothing out lucky or unlucky runs

Cost in USD

MathArena uniquely reports the API cost of running each model, making it possible to compare models on a cost-efficiency basis. A model that achieves 95% at $5.15 may not be the best choice if another achieves 75% at $2.20 for your use case.

Final-Answer vs. Proof-Based

Type Evaluation Competitions
Final-answer Model produces a numerical answer; automatically checked AIME, CMIMC, Project Euler
Proof-based Model writes a mathematical proof; evaluated for rigor IMO, USAMO, Putnam, Miklós Schweitzer

Proof-based evaluation is significantly harder — it requires the model to construct step-by-step logical arguments, not just produce a number.

Why MathArena Matters

graph LR
    A["Math benchmarks<br/>contaminated by<br/>training data"] --> C["MathArena<br/>uses live competitions<br/>for real-time eval"]
    B["No benchmark<br/>tests proof-writing<br/>capabilities"] --> C
    C --> D["Contamination-free<br/>genuine reasoning<br/>measurement"]
    C --> E["First proof-writing<br/>benchmark for<br/>LLMs"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style E fill:#3498db,color:#fff,stroke:#333

  1. Contamination-free evaluation — By using freshly released competition problems, MathArena provides the cleanest signal of genuine mathematical reasoning
  2. Proof-writing assessment — The first benchmark to evaluate whether LLMs can construct rigorous mathematical proofs, not just produce final answers
  3. Cost transparency — Reports evaluation cost in USD, enabling practical cost-performance tradeoffs
  4. Broad difficulty range — From AIME (advanced high school) to Miklós Schweitzer (graduate-level), covering the full spectrum of mathematical challenge
  5. Continuously evolving — New competitions are added as they occur, ensuring the benchmark stays relevant and contamination-free

Video: MathArena Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

MathArena provides the gold standard for evaluating LLM mathematical reasoning:

  • Real-time evaluation on live competitions — AIME, USAMO, IMO, Putnam, and more — eliminates contamination risk
  • Proof-writing assessment makes it the first benchmark to test rigorous mathematical argumentation, not just final answers
  • Strong signs of contamination found in AIME 2024, validating the need for fresh, uncontaminated evaluation
  • Top models achieve ~95% on final-answer problems but less than 40% on proof-based IMO problems — a massive gap that reveals where real mathematical reasoning still falls short
  • Cost tracking enables practical decisions about which model to deploy for math-intensive applications

As LLMs advance in mathematical reasoning, MathArena ensures we can measure that progress honestly — with problems the models have never seen and standards that demand genuine understanding, not pattern matching.

References

  • Balunović, M., Dekoninck, J., Petrov, I., Jovanović, N., & Vechev, M. “MathArena: Evaluating LLMs on Uncontaminated Math Competitions.” NeurIPS Datasets and Benchmarks 2025. arXiv:2505.23281 (2025). arxiv.org/abs/2505.23281
  • Petrov, I., Dekoninck, J., Balunović, M., Jovanović, N., & Vechev, M. “Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad.” arXiv preprint arXiv:2503.21934 (2025). arxiv.org/abs/2503.21934
  • MathArena. “Official Platform.” matharena.ai
  • SRI Lab at ETH Zurich. “MathArena GitHub Repository.” github.com/eth-sri/matharena

Read More