graph LR
A["Static Math Benchmarks<br/>(AIME 2024, GSM8K)<br/>Contamination risk"] --> B["Can't distinguish<br/>reasoning from<br/>memorization"]
B --> C["MathArena<br/>Live competitions<br/>Continuously updated"]
C --> D["Contamination-free<br/>evaluation of<br/>math reasoning"]
style A fill:#e74c3c,stroke:#333,color:#fff
style B fill:#f39c12,stroke:#333,color:#fff
style C fill:#27ae60,stroke:#333,color:#fff
style D fill:#3498db,stroke:#333,color:#fff
MathArena
Evaluating LLMs on uncontaminated math competitions — from AIME and Putnam to IMO and USAMO — with proof-writing assessment and real-time contamination-free evaluation
Keywords: MathArena, math benchmark, LLM evaluation, math olympiad, AIME, IMO, USAMO, Putnam, proof-writing, contamination-free, mathematical reasoning, ETH Zurich, INSAIT, NeurIPS 2025

Introduction
Mathematical benchmarks like AIME 2024 have become standard tests for frontier LLMs — but many of these problems are widely available online, making it impossible to tell whether a model is genuinely reasoning or simply memorizing solutions. Worse, most math benchmarks only check final numerical answers, completely ignoring whether the model can write a rigorous proof.
MathArena solves both problems. Built by SRI Lab at ETH Zurich and INSAIT, it evaluates LLMs on the latest math competitions and olympiads — problems released so recently that models could not have seen them during training. It is also the first benchmark to evaluate proof-writing capabilities of LLMs on competition mathematics.
“We find strong signs of contamination in AIME 2024. Nonetheless, evaluations on harder competitions, such as CMIMC 2025, demonstrate impressive reasoning capabilities in top-performing models.” — MathArena Paper
What Is MathArena?
MathArena is a platform for rigorous evaluation of LLMs on the latest math competitions and olympiads. Its key insight is that recurring math competitions provide a natural stream of high-quality, challenging problems that can be used for real-time evaluation — effectively eliminating the risk of data contamination.
The platform evaluates models as soon as new competition problems are released. Each model is run 4 times on each problem, and MathArena computes both the average score and the cost in USD — giving a practical picture of both capability and efficiency.
Key Characteristics
| Feature | Details |
|---|---|
| Problem sources | AIME, USAMO, IMO, IMC, Putnam, Miklós Schweitzer, CMIMC, Project Euler |
| Update frequency | Continuously updated with each new competition |
| Evaluation types | Final-answer competitions and proof-based competitions |
| Runs per problem | 4 (average score computed) |
| Cost tracking | Reports evaluation cost in USD per model |
| Anti-contamination | Real-time evaluation on newly released problems |
| Proof evaluation | First benchmark to assess LLM proof-writing on competition math |
What Makes It Different from Other Math Benchmarks?
graph TD
MA["MathArena"] --> E1["Contamination-Free<br/>Real-time evaluation<br/>on new competitions"]
MA --> E2["Proof-Writing<br/>First benchmark to<br/>assess LLM proofs"]
MA --> E3["Multi-Competition<br/>AIME, IMO, USAMO,<br/>Putnam, and more"]
MA --> E4["Cost Tracking<br/>USD cost per model<br/>evaluation"]
style MA fill:#e74c3c,color:#fff,stroke:#333
style E1 fill:#3498db,color:#fff,stroke:#333
style E2 fill:#27ae60,color:#fff,stroke:#333
style E3 fill:#f39c12,color:#fff,stroke:#333
style E4 fill:#8e44ad,color:#fff,stroke:#333
Two standout features set MathArena apart:
- Proof-writing evaluation — While most math benchmarks only check final numerical answers, MathArena evaluates whether LLMs can construct rigorous mathematical proofs, a crucial capability for real mathematical work
- Contamination detection — By evaluating on freshly released competition problems, MathArena exposes models that perform suspiciously well on older, widely available problems but struggle on new ones
Who Built It?
MathArena was developed by the SRI Lab at ETH Zurich in collaboration with INSAIT (Institute for Computer Science, Artificial Intelligence, and Technology):
- Mislav Balunović — ETH Zurich, SRI Lab
- Jasper Dekoninck — ETH Zurich, SRI Lab (lead contact)
- Ivo Petrov — ETH Zurich, SRI Lab
- Nikola Jovanović — ETH Zurich, SRI Lab
- Martin Vechev — ETH Zurich, SRI Lab (group lead)
Publication
The paper was published at the NeurIPS 2025 Datasets and Benchmarks track — one of the premier venues for benchmark papers in machine learning.
| Resource | Link |
|---|---|
| Main paper | arxiv.org/abs/2505.23281 |
| USAMO paper | arxiv.org/abs/2503.21934 |
| Project page | matharena.ai |
| GitHub | github.com/eth-sri/matharena |
| HuggingFace | huggingface.co/MathArena |
What Skills Does It Test?
MathArena tests the full spectrum of competition-level mathematical reasoning — from final-answer problem solving to formal proof construction, across difficulty levels ranging from undergraduate (AIME, Putnam) to Olympiad (IMO, USAMO).
graph TD
MA["MathArena<br/>Competition Mathematics"] --> A["Algebraic<br/>Reasoning<br/>Equations, inequalities,<br/>number theory"]
MA --> B["Geometric<br/>Reasoning<br/>Euclidean geometry,<br/>transformations"]
MA --> C["Combinatorial<br/>Reasoning<br/>Counting, graph theory,<br/>pigeonhole principle"]
MA --> D["Proof<br/>Construction<br/>Rigorous arguments,<br/>logical deduction"]
MA --> E["Problem<br/>Decomposition<br/>Multi-step solutions,<br/>case analysis"]
MA --> F["Computational<br/>Fluency<br/>Accurate calculation<br/>under constraints"]
style MA fill:#e74c3c,color:#fff,stroke:#333
style A fill:#3498db,color:#fff,stroke:#333
style B fill:#27ae60,color:#fff,stroke:#333
style C fill:#f39c12,color:#fff,stroke:#333
style D fill:#8e44ad,color:#fff,stroke:#333
style E fill:#e67e22,color:#fff,stroke:#333
style F fill:#6cc3d5,color:#fff,stroke:#333
| Capability | What MathArena Tests |
|---|---|
| Algebraic reasoning | Solving equations, inequalities, and number theory problems at competition level |
| Geometric reasoning | Euclidean geometry proofs, coordinate geometry, and geometric transformations |
| Combinatorial reasoning | Counting arguments, graph theory, and discrete mathematics |
| Proof construction | Writing rigorous mathematical proofs (IMO, USAMO-level) |
| Problem decomposition | Breaking complex multi-step problems into manageable subproblems |
| Computational fluency | Performing accurate calculations under time and complexity constraints |
Competition Hierarchy
The competitions covered by MathArena span a wide range of difficulty:
| Competition | Level | Type |
|---|---|---|
| AIME | High school (advanced) | Final-answer |
| CMIMC | Undergraduate | Final-answer |
| Putnam | Undergraduate | Proof-based |
| USAMO | High school (Olympiad) | Proof-based |
| IMO | International Olympiad | Proof-based |
| IMC | International undergraduate | Mixed |
| Miklós Schweitzer | Graduate-level | Proof-based |
| Project Euler | Computational math | Final-answer |
Current Leaderboard
The leaderboard below shows model performance on MathArena as displayed on the official platform. Results represent average scores across 4 runs per problem, with evaluation cost in USD.
Source: MathArena Leaderboard (consulted March 28, 2026). Continuously updated with new competitions and models.
Overall Performance (Final-Answer Competitions)
| Rank | Model | Score (%) | Cost (USD) |
|---|---|---|---|
| 1 | GPT-5.4 (xhigh) | 95.24 | $5.15 |
| 2 | Gemini 3.1 Pro Preview | 74.40 | $2.20 |
| 3 | Claude Opus 4.6 (high) | 47.02 | $13.23 |
| 4 | Step 3.5 Flash | 44.64 | $0.22 |
| 5 | Qwen3.5-397B-A17B | 36.31 | $0.72 |
| 6 | GLM 5 | 35.12 | $1.47 |
Key takeaways:
- GPT-5.4 dominates at 95.24% — but at a cost of $5.15 per evaluation, it is among the most expensive options
- Cost-performance tradeoffs are stark — Step 3.5 Flash achieves 44.64% at just $0.22, while Claude Opus 4.6 costs $13.23 for 47.02%
- Proof-based competitions remain much harder — on IMO 2025, top models achieve slightly less than 40%, showing significant room for improvement
- Contamination matters — models perform suspiciously well on older problems (AIME 2024) but struggle on fresh ones, confirming MathArena’s value as a contamination-free evaluation
- The benchmark is continuously updated with new competitions (USAMO 2026 was added March 28, 2026)
For the full, up-to-date leaderboard across all competitions (AIME, USAMO, IMO, Putnam, and more), visit the official platform linked in the next section.
Where to Explore the Benchmark
Leaderboard and Project
| Resource | Description | Link |
|---|---|---|
| Official platform | Live leaderboard with competition filters, cost tracking, and multi-run analysis | matharena.ai |
| Main paper | Full methodology, contamination analysis, and proof evaluation framework | arxiv.org/abs/2505.23281 |
| USAMO paper | Detailed study of LLMs on 2025 USA Math Olympiad proof problems | arxiv.org/abs/2503.21934 |
| GitHub | Source code, evaluation scripts, and problem data | github.com/eth-sri/matharena |
| HuggingFace | Problem datasets and model outputs | huggingface.co/MathArena |
Understanding the Metrics
Average Score Across 4 Runs
Unlike benchmarks that report a single-run result, MathArena runs each model 4 times per problem and reports the average score. This approach:
- Reduces variance from stochastic sampling
- Reveals consistency — a model that solves a problem 1 out of 4 times is clearly less reliable than one that solves it 4 out of 4
- Provides more robust rankings by smoothing out lucky or unlucky runs
Cost in USD
MathArena uniquely reports the API cost of running each model, making it possible to compare models on a cost-efficiency basis. A model that achieves 95% at $5.15 may not be the best choice if another achieves 75% at $2.20 for your use case.
Final-Answer vs. Proof-Based
| Type | Evaluation | Competitions |
|---|---|---|
| Final-answer | Model produces a numerical answer; automatically checked | AIME, CMIMC, Project Euler |
| Proof-based | Model writes a mathematical proof; evaluated for rigor | IMO, USAMO, Putnam, Miklós Schweitzer |
Proof-based evaluation is significantly harder — it requires the model to construct step-by-step logical arguments, not just produce a number.
Why MathArena Matters
graph LR
A["Math benchmarks<br/>contaminated by<br/>training data"] --> C["MathArena<br/>uses live competitions<br/>for real-time eval"]
B["No benchmark<br/>tests proof-writing<br/>capabilities"] --> C
C --> D["Contamination-free<br/>genuine reasoning<br/>measurement"]
C --> E["First proof-writing<br/>benchmark for<br/>LLMs"]
style A fill:#e74c3c,color:#fff,stroke:#333
style B fill:#e74c3c,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#3498db,color:#fff,stroke:#333
style E fill:#3498db,color:#fff,stroke:#333
- Contamination-free evaluation — By using freshly released competition problems, MathArena provides the cleanest signal of genuine mathematical reasoning
- Proof-writing assessment — The first benchmark to evaluate whether LLMs can construct rigorous mathematical proofs, not just produce final answers
- Cost transparency — Reports evaluation cost in USD, enabling practical cost-performance tradeoffs
- Broad difficulty range — From AIME (advanced high school) to Miklós Schweitzer (graduate-level), covering the full spectrum of mathematical challenge
- Continuously evolving — New competitions are added as they occur, ensuring the benchmark stays relevant and contamination-free
Video: MathArena Explained
Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀
Conclusion
MathArena provides the gold standard for evaluating LLM mathematical reasoning:
- Real-time evaluation on live competitions — AIME, USAMO, IMO, Putnam, and more — eliminates contamination risk
- Proof-writing assessment makes it the first benchmark to test rigorous mathematical argumentation, not just final answers
- Strong signs of contamination found in AIME 2024, validating the need for fresh, uncontaminated evaluation
- Top models achieve ~95% on final-answer problems but less than 40% on proof-based IMO problems — a massive gap that reveals where real mathematical reasoning still falls short
- Cost tracking enables practical decisions about which model to deploy for math-intensive applications
As LLMs advance in mathematical reasoning, MathArena ensures we can measure that progress honestly — with problems the models have never seen and standards that demand genuine understanding, not pattern matching.
References
- Balunović, M., Dekoninck, J., Petrov, I., Jovanović, N., & Vechev, M. “MathArena: Evaluating LLMs on Uncontaminated Math Competitions.” NeurIPS Datasets and Benchmarks 2025. arXiv:2505.23281 (2025). arxiv.org/abs/2505.23281
- Petrov, I., Dekoninck, J., Balunović, M., Jovanović, N., & Vechev, M. “Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad.” arXiv preprint arXiv:2503.21934 (2025). arxiv.org/abs/2503.21934
- MathArena. “Official Platform.” matharena.ai
- SRI Lab at ETH Zurich. “MathArena GitHub Repository.” github.com/eth-sri/matharena
Read More
- Explore the hardest AI benchmark ever built — see Humanity’s Last Exam (HLE)
- Test graduate-level science reasoning — see GPQA Diamond
- Measure abstract reasoning and fluid intelligence — see ARC-AGI-2
- Evaluate competitive programming skills — see LiveCodeBench Pro
- Track model costs when running evaluations — see FinOps Best Practices for LLM Applications