graph LR
A["MMMU<br/>(Original)<br/>4 options, text shortcuts"] --> B["Models exploit<br/>text-only<br/>shortcuts"]
B --> C["MMMU-Pro<br/>10 options, vision-only,<br/>text-solvable filtered"]
C --> D["True multimodal<br/>understanding<br/>assessment"]
style A fill:#e74c3c,stroke:#333,color:#fff
style B fill:#f39c12,stroke:#333,color:#fff
style C fill:#27ae60,stroke:#333,color:#fff
style D fill:#3498db,stroke:#333,color:#fff
MMMU-Pro
A more robust multimodal benchmark — filtering text-solvable questions, expanding to 10 options, and introducing vision-only evaluation across 30 college subjects
Keywords: MMMU-Pro, MMMU, multimodal benchmark, vision-language model, LMM evaluation, college-level reasoning, visual understanding, multi-discipline, ACL 2025, expert AGI, OCR, Chain of Thought, 10-option multiple choice

Introduction
Multimodal benchmarks like MMMU have become central to evaluating vision-language models — but many questions in MMMU can be answered without even looking at the images. Text-only LLMs can exploit option-elimination shortcuts, and with only 4 answer choices, random guessing already gives a 25% baseline.
MMMU-Pro was built to fix these weaknesses. It takes the original MMMU benchmark and applies a rigorous three-step hardening process: (1) filtering out questions answerable by text-only models, (2) expanding from 4 to 10 candidate options, and (3) introducing a vision-only input setting where questions are embedded inside screenshots. The result is a dramatically harder benchmark — performance drops from 69% to 52% for top models, exposing genuine multimodal understanding gaps.
“MMMU-Pro rigorously assesses multimodal models’ true understanding and reasoning capabilities… This setting challenges AI to truly ‘see’ and ‘read’ simultaneously, testing a fundamental human cognitive skill.” — MMMU-Pro Paper
What Is MMMU-Pro?
MMMU-Pro is a hardened version of the MMMU (Massive Multi-discipline Multimodal Understanding) benchmark. It evaluates multimodal AI models on college-level questions across 30 subjects in 6 disciplines — but with three critical improvements that make shortcut-based solving much harder.
The Three-Step Hardening Process
graph TD
A["Step 1: Filter<br/>Remove questions<br/>answerable by<br/>text-only LLMs"] --> B["Step 2: Augment<br/>Expand from 4 to<br/>10 candidate options"]
B --> C["Step 3: Vision-Only<br/>Embed questions in<br/>screenshots — no<br/>separate text input"]
style A fill:#e74c3c,color:#fff,stroke:#333
style B fill:#f39c12,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
- Filtering text-solvable questions — Four strong open-source LLMs attempt each MMMU question without images. Questions consistently answered correctly are excluded, ensuring the remaining problems truly require visual understanding
- Augmenting candidate options — The number of answer choices increases from 4 to 10, reducing the guessing baseline from 25% to 10% and making option-elimination strategies much less effective
- Vision-only input setting — Questions are embedded within screenshots or photos (with varied backgrounds, fonts, and sizes), requiring models to simultaneously “see” and “read” — a fundamental human cognitive skill
Key Characteristics
| Feature | Details |
|---|---|
| Base benchmark | MMMU (college-level multimodal questions) |
| Subjects | 30 across 6 disciplines |
| Evaluation modes | Standard (10 options) and Vision-only |
| Total questions | ~1,730 (Standard), ~1,730 (Vision) |
| Question format | Multiple-choice (10 options) |
| Image types | 30 types: diagrams, charts, chemical structures, music sheets, medical images, etc. |
| Anti-shortcut | Text-solvable questions filtered out |
| Random baseline | 10% (vs. 25% in original MMMU) |
| Publication | ACL 2025 Main Conference |
What Makes It Different from MMMU?
| Feature | MMMU | MMMU-Pro |
|---|---|---|
| Answer options | 4 | 10 |
| Text-only filtering | No | Yes — removes text-solvable questions |
| Vision-only mode | No | Yes — question embedded in screenshot |
| Random baseline | 25% | 10% |
| Best model performance | ~85% | ~81% (Standard), ~78% (Vision) |
Who Built It?
MMMU-Pro was developed by a multi-institutional research team led by:
- Xiang Yue — Carnegie Mellon University (lead author)
- Tianyu Zheng, Yuansheng Ni, Yubo Wang — Core contributors
- Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu — Key researchers
- Ge Zhang, Huan Sun, Yu Su — Ohio State University
- Wenhu Chen — University of Waterloo
- Graham Neubig — Carnegie Mellon University
The original MMMU benchmark was published at CVPR 2024, and MMMU-Pro was accepted at the ACL 2025 Main Conference.
Publication and Resources
| Resource | Link |
|---|---|
| arXiv paper | arxiv.org/abs/2409.02813 |
| Homepage | mmmu-benchmark.github.io |
| GitHub | github.com/MMMU-Benchmark/MMMU |
| Dataset (HuggingFace) | huggingface.co/datasets/MMMU/MMMU_Pro |
| EvalAI Server | eval.ai/web/challenges/challenge-page/2179 |
What Skills Does It Test?
MMMU-Pro tests expert-level multimodal understanding and reasoning across six core academic disciplines — specifically targeting capabilities that require genuine integration of visual and textual information.
graph TD
MMMUPro["MMMU-Pro<br/>Multimodal Understanding"] --> A["Art & Design<br/>Paintings, architecture,<br/>design principles"]
MMMUPro --> B["Business<br/>Accounting, economics,<br/>marketing"]
MMMUPro --> C["Science<br/>Physics, chemistry,<br/>biology"]
MMMUPro --> D["Health & Medicine<br/>Clinical knowledge,<br/>medical imaging"]
MMMUPro --> E["Humanities &<br/>Social Science<br/>History, psychology,<br/>sociology"]
MMMUPro --> F["Tech &<br/>Engineering<br/>CS, electrical eng.,<br/>materials science"]
style MMMUPro fill:#e74c3c,color:#fff,stroke:#333
style A fill:#3498db,color:#fff,stroke:#333
style B fill:#27ae60,color:#fff,stroke:#333
style C fill:#f39c12,color:#fff,stroke:#333
style D fill:#8e44ad,color:#fff,stroke:#333
style E fill:#e67e22,color:#fff,stroke:#333
style F fill:#6cc3d5,color:#fff,stroke:#333
| Capability | What MMMU-Pro Tests |
|---|---|
| Visual perception | Interpreting diagrams, charts, chemical structures, medical images, music sheets, and 25+ other image types |
| Domain knowledge | College-level expertise across 30 subjects in 6 disciplines |
| Multimodal reasoning | Integrating visual and textual information to solve multi-step problems |
| Vision-text integration | Reading and reasoning from questions embedded in screenshots (vision-only mode) |
| Robustness to shortcuts | Resisting option-elimination and text-only solving strategies |
| OCR capability | Extracting text from images to answer questions in the vision-only setting |
The 30+ Image Types
MMMU-Pro questions span an extraordinary range of visual formats:
| Category | Examples |
|---|---|
| Charts & Data | Plots, tables, graphs, timelines |
| Scientific | Chemical structures, DNA sequences, medical images, MRI/CT scans |
| Technical | Diagrams, blueprints, geometric shapes, mathematical notations |
| Artistic | Paintings, sculptures, photographs, portraits |
| Other | Music sheets, maps, comics, logos, advertisements |
Current Leaderboard
The leaderboard below shows model performance on MMMU-Pro across both the Standard (10 options) and MMMU (Val) settings, as published on the official MMMU benchmark website. Human expert performance is included for reference.
Source: MMMU Benchmark Leaderboard (consulted March 28, 2026). Last updated September 5, 2025. Zero-shot evaluation.
MMMU-Pro (Standard, 10 Options) and MMMU (Val)
| Rank | Model | MMMU-Pro (%) | MMMU Val (%) |
|---|---|---|---|
| — | Human Expert (High) | 85.4 | 88.6 |
| 1 | Gemini 3.0 Pro | 81.0 | — |
| — | Human Expert (Medium) | 80.8 | 82.6 |
| 2 | GPT-5 w/ thinking | 78.4 | 84.2 |
| 3 | o3 | 76.4 | 82.9 |
| 4 | GPT-5.1 | 76.0 | 85.4 |
| — | Human Expert (Low) | 73.0 | 76.2 |
| 5 | dots.vlm1 (37B) | 70.1 | 80.1 |
| 6 | Qwen3-VL 235B-A22B | 68.1 | 78.7 |
| 7 | Gemini 2.5 Pro (05-06) | 68.0 | 79.6 |
| 8 | Seed 1.5-VL Thinking (20B) | 67.6 | 77.9 |
| 9 | Seed 1.6-Thinking (20B) | 66.4 | 74.8 |
| 10 | GLM-4.5V w/ Thinking (12B) | 65.2 | 75.4 |
| 11 | GPT-5 w/o thinking | 62.7 | 74.4 |
| 12 | Seed 1.5-VL (20B) | 59.9 | 73.6 |
| 13 | Skywork-R1V3-38B | 55.4 | 76.0 |
| 14 | GPT-4o (0513) | 51.9 | 69.1 |
| 15 | Claude 3.5 Sonnet | 51.5 | 68.3 |
Key takeaways:
- Gemini 3.0 Pro leads at 81.0% on MMMU-Pro — approaching but not yet matching the high human expert level (85.4%)
- Massive performance drop from MMMU to MMMU-Pro — GPT-5.1 scores 85.4% on MMMU Val but drops to 76.0% on MMMU-Pro, confirming the hardening is effective
- Human experts still lead — the high expert baseline (85.4%) has only recently been approached by the top model
- Thinking/reasoning capabilities matter — GPT-5 with thinking (78.4%) significantly outperforms GPT-5 without thinking (62.7%), a 16-point gap
- Open-source models are competitive — dots.vlm1 (37B) at 70.1% and Qwen3-VL at 68.1% show strong multimodal reasoning from open-weight models
- Random choice baseline is just 12.6% on MMMU-Pro, versus 22.1% on MMMU Val — confirming the 10-option format dramatically reduces guessing
For the full leaderboard with all models and settings, visit the official benchmark website linked in the next section.
Where to Explore the Benchmark
Leaderboard and Project
| Resource | Description | Link |
|---|---|---|
| Official Leaderboard | Full rankings across MMMU-Pro, MMMU Val, and MMMU Test with filters for open-source/proprietary | mmmu-benchmark.github.io/#leaderboard |
| arXiv Paper | Full technical paper with methodology, filtering analysis, and results | arxiv.org/abs/2409.02813 |
| EvalAI Server | Submit your own model for official evaluation on the test set | eval.ai/web/challenges/challenge-page/2179 |
Dataset and Code
| Resource | Description | Link |
|---|---|---|
| MMMU-Pro Dataset | Standard (10 options) and Vision subsets on Hugging Face | huggingface.co/datasets/MMMU/MMMU_Pro |
| MMMU Dataset | Original MMMU with 11.5K questions (test answers released Feb 2026) | huggingface.co/datasets/MMMU/MMMU |
| GitHub Repository | Evaluation code and benchmark infrastructure | github.com/MMMU-Benchmark/MMMU |
Load the Dataset
from datasets import load_dataset
# Standard setting (10 options)
mmmu_pro_standard = load_dataset("MMMU/MMMU_Pro", "standard (10 options)")
# Vision-only setting
mmmu_pro_vision = load_dataset("MMMU/MMMU_Pro", "vision")
# Standard setting (4 options, for comparison)
mmmu_pro_4 = load_dataset("MMMU/MMMU_Pro", "standard (4 options)")Understanding the Metrics
Accuracy (Zero-Shot)
Models are evaluated in a zero-shot setting — no fine-tuning or few-shot examples. Each model receives a question (with images in Standard mode, or a screenshot in Vision mode) and must select the correct answer from 10 options.
Standard vs. Vision-Only
| Mode | Input | Challenge |
|---|---|---|
| Standard (10 options) | Text question + images, 10 answer choices | Expanded options reduce guessing; text-solvable questions removed |
| Vision-only | Screenshot containing question + images embedded together | Model must “see” and “read” simultaneously from a single image |
| Standard (4 options) | Same as Standard but with 4 options (for comparison) | Intermediate difficulty between MMMU and full MMMU-Pro |
Key Findings on Evaluation Methods
The paper explored two additional prompting strategies:
- OCR prompts — Asking the model to first extract text from images before answering. Result: minimal effect on performance
- Chain of Thought (CoT) — Asking the model to reason step by step. Result: generally improves performance across models
graph LR
A["Standard<br/>(10 options)<br/>Text + Images"] --> D["Harder than<br/>MMMU"]
B["Vision-Only<br/>Screenshot input<br/>only"] --> D
C["OCR / CoT<br/>Prompting<br/>strategies"] --> E["CoT helps,<br/>OCR has<br/>minimal effect"]
style A fill:#3498db,color:#fff,stroke:#333
style B fill:#27ae60,color:#fff,stroke:#333
style C fill:#f39c12,color:#fff,stroke:#333
style D fill:#8e44ad,color:#fff,stroke:#333
style E fill:#e67e22,color:#fff,stroke:#333
Why MMMU-Pro Matters
graph LR
A["MMMU questions<br/>solvable without<br/>images"] --> C["MMMU-Pro<br/>filters shortcuts,<br/>10 options,<br/>vision-only"]
B["4-option guessing<br/>inflates scores"] --> C
C --> D["True multimodal<br/>understanding<br/>measurement"]
C --> E["Reveals gap between<br/>perception and<br/>reasoning"]
style A fill:#e74c3c,color:#fff,stroke:#333
style B fill:#e74c3c,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#3498db,color:#fff,stroke:#333
style E fill:#3498db,color:#fff,stroke:#333
- Eliminates text-only shortcuts — By filtering questions answerable without images, MMMU-Pro ensures models must genuinely understand visual content
- Reduces guessing advantage — 10 options drop the random baseline from 25% to 10%, providing a cleaner signal
- Tests real-world vision-text integration — The vision-only mode mirrors how humans encounter information: text and images embedded together in documents, slides, and photos
- 30 image types across 6 disciplines — From chemical structures and medical imaging to music sheets and architectural blueprints
- Accepted at ACL 2025 — Peer-reviewed and validated by the NLP/multimodal research community
Video: MMMU-Pro Explained
Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀
Conclusion
MMMU-Pro sets a higher bar for multimodal AI evaluation:
- Three-step hardening — filtering text-solvable questions, expanding to 10 options, and introducing vision-only input — exposes genuine multimodal understanding gaps
- Published at ACL 2025 and built on the CVPR 2024 MMMU benchmark covering 30 subjects, 6 disciplines, and 30+ image types
- The best model (Gemini 3.0 Pro) scores 81.0% on MMMU-Pro, approaching but not yet matching the high human expert level (85.4%)
- Thinking capabilities provide a 16-point boost — GPT-5 with thinking (78.4%) vs. without (62.7%) — showing that reasoning is critical for hard multimodal problems
- Substantial MMMU-to-MMMU-Pro drop confirms that many prior benchmark scores were inflated by text-only shortcuts and narrow option sets
As multimodal AI advances toward expert-level understanding, MMMU-Pro provides the essential stress test: can your model truly see, read, and reason — or is it just exploiting shortcuts?
References
- Yue, X., Zheng, T., Ni, Y., Wang, Y., Zhang, K., Tong, S., Sun, Y., Yu, B., Zhang, G., Sun, H., Su, Y., Chen, W., & Neubig, G. “MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark.” ACL 2025 Main. arXiv:2409.02813 (2024). arxiv.org/abs/2409.02813
- Yue, X., Ni, Y., Zhang, K., Zheng, T. et al. “MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI.” CVPR 2024. mmmu-benchmark.github.io
- MMMU Benchmark. “Official Leaderboard.” mmmu-benchmark.github.io/#leaderboard
- MMMU. “MMMU-Pro Dataset.” Hugging Face. huggingface.co/datasets/MMMU/MMMU_Pro
Read More
- Explore the hardest AI benchmark ever built — see Humanity’s Last Exam (HLE)
- Test graduate-level science reasoning — see GPQA Diamond
- Measure abstract reasoning and fluid intelligence — see ARC-AGI-2
- Evaluate chart and figure understanding — see CharXiv Reasoning
- Test multilingual knowledge across 14 languages — see MMMLU
- Assess competitive programming skills — see LiveCodeBench Pro