MMMU-Pro

A more robust multimodal benchmark — filtering text-solvable questions, expanding to 10 options, and introducing vision-only evaluation across 30 college subjects

Author

Vectoring AI

Keywords

MMMU-Pro, MMMU, multimodal benchmark, vision-language model, LMM evaluation, college-level reasoning, visual understanding, multi-discipline, ACL 2025, expert AGI, OCR, Chain of Thought, 10-option multiple choice

MMMU-Pro multimodal benchmark across 30 college subjects

Introduction

Multimodal benchmarks like MMMU have become central to evaluating vision-language models — but many questions in MMMU can be answered without even looking at the images. Text-only LLMs can exploit option-elimination shortcuts, and with only 4 answer choices, random guessing already gives a 25% baseline.

MMMU-Pro was built to fix these weaknesses. It takes the original MMMU benchmark and applies a rigorous three-step hardening process: (1) filtering out questions answerable by text-only models, (2) expanding from 4 to 10 candidate options, and (3) introducing a vision-only input setting where questions are embedded inside screenshots. The result is a dramatically harder benchmark — performance drops from 69% to 52% for top models, exposing genuine multimodal understanding gaps.

“MMMU-Pro rigorously assesses multimodal models’ true understanding and reasoning capabilities… This setting challenges AI to truly ‘see’ and ‘read’ simultaneously, testing a fundamental human cognitive skill.” — MMMU-Pro Paper

graph LR
    A["MMMU<br/>(Original)<br/>4 options, text shortcuts"] --> B["Models exploit<br/>text-only<br/>shortcuts"]
    B --> C["MMMU-Pro<br/>10 options, vision-only,<br/>text-solvable filtered"]
    C --> D["True multimodal<br/>understanding<br/>assessment"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is MMMU-Pro?

MMMU-Pro is a hardened version of the MMMU (Massive Multi-discipline Multimodal Understanding) benchmark. It evaluates multimodal AI models on college-level questions across 30 subjects in 6 disciplines — but with three critical improvements that make shortcut-based solving much harder.

The Three-Step Hardening Process

graph TD
    A["Step 1: Filter<br/>Remove questions<br/>answerable by<br/>text-only LLMs"] --> B["Step 2: Augment<br/>Expand from 4 to<br/>10 candidate options"]
    B --> C["Step 3: Vision-Only<br/>Embed questions in<br/>screenshots — no<br/>separate text input"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#f39c12,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333

Filtering text-solvable questions — Four strong open-source LLMs attempt each MMMU question without images. Questions consistently answered correctly are excluded, ensuring the remaining problems truly require visual understanding
Augmenting candidate options — The number of answer choices increases from 4 to 10, reducing the guessing baseline from 25% to 10% and making option-elimination strategies much less effective
Vision-only input setting — Questions are embedded within screenshots or photos (with varied backgrounds, fonts, and sizes), requiring models to simultaneously “see” and “read” — a fundamental human cognitive skill

Key Characteristics

Feature	Details
Base benchmark	MMMU (college-level multimodal questions)
Subjects	30 across 6 disciplines
Evaluation modes	Standard (10 options) and Vision-only
Total questions	~1,730 (Standard), ~1,730 (Vision)
Question format	Multiple-choice (10 options)
Image types	30 types: diagrams, charts, chemical structures, music sheets, medical images, etc.
Anti-shortcut	Text-solvable questions filtered out
Random baseline	10% (vs. 25% in original MMMU)
Publication	ACL 2025 Main Conference

What Makes It Different from MMMU?

Feature	MMMU	MMMU-Pro
Answer options	4	10
Text-only filtering	No	Yes — removes text-solvable questions
Vision-only mode	No	Yes — question embedded in screenshot
Random baseline	25%	10%
Best model performance	~85%	~81% (Standard), ~78% (Vision)

Who Built It?

MMMU-Pro was developed by a multi-institutional research team led by:

Xiang Yue — Carnegie Mellon University (lead author)
Tianyu Zheng, Yuansheng Ni, Yubo Wang — Core contributors
Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu — Key researchers
Ge Zhang, Huan Sun, Yu Su — Ohio State University
Wenhu Chen — University of Waterloo
Graham Neubig — Carnegie Mellon University

The original MMMU benchmark was published at CVPR 2024, and MMMU-Pro was accepted at the ACL 2025 Main Conference.

Publication and Resources

Resource	Link
arXiv paper	arxiv.org/abs/2409.02813
Homepage	mmmu-benchmark.github.io
GitHub	github.com/MMMU-Benchmark/MMMU
Dataset (HuggingFace)	huggingface.co/datasets/MMMU/MMMU_Pro
EvalAI Server	eval.ai/web/challenges/challenge-page/2179

What Skills Does It Test?

MMMU-Pro tests expert-level multimodal understanding and reasoning across six core academic disciplines — specifically targeting capabilities that require genuine integration of visual and textual information.

graph TD
    MMMUPro["MMMU-Pro<br/>Multimodal Understanding"] --> A["Art & Design<br/>Paintings, architecture,<br/>design principles"]
    MMMUPro --> B["Business<br/>Accounting, economics,<br/>marketing"]
    MMMUPro --> C["Science<br/>Physics, chemistry,<br/>biology"]
    MMMUPro --> D["Health & Medicine<br/>Clinical knowledge,<br/>medical imaging"]
    MMMUPro --> E["Humanities &<br/>Social Science<br/>History, psychology,<br/>sociology"]
    MMMUPro --> F["Tech &<br/>Engineering<br/>CS, electrical eng.,<br/>materials science"]

    style MMMUPro fill:#e74c3c,color:#fff,stroke:#333
    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#27ae60,color:#fff,stroke:#333
    style C fill:#f39c12,color:#fff,stroke:#333
    style D fill:#8e44ad,color:#fff,stroke:#333
    style E fill:#e67e22,color:#fff,stroke:#333
    style F fill:#6cc3d5,color:#fff,stroke:#333

Capability	What MMMU-Pro Tests
Visual perception	Interpreting diagrams, charts, chemical structures, medical images, music sheets, and 25+ other image types
Domain knowledge	College-level expertise across 30 subjects in 6 disciplines
Multimodal reasoning	Integrating visual and textual information to solve multi-step problems
Vision-text integration	Reading and reasoning from questions embedded in screenshots (vision-only mode)
Robustness to shortcuts	Resisting option-elimination and text-only solving strategies
OCR capability	Extracting text from images to answer questions in the vision-only setting

The 30+ Image Types

MMMU-Pro questions span an extraordinary range of visual formats:

Category	Examples
Charts & Data	Plots, tables, graphs, timelines
Scientific	Chemical structures, DNA sequences, medical images, MRI/CT scans
Technical	Diagrams, blueprints, geometric shapes, mathematical notations
Artistic	Paintings, sculptures, photographs, portraits
Other	Music sheets, maps, comics, logos, advertisements

Current Leaderboard

The leaderboard below shows model performance on MMMU-Pro across both the Standard (10 options) and MMMU (Val) settings, as published on the official MMMU benchmark website. Human expert performance is included for reference.

Source: MMMU Benchmark Leaderboard (consulted March 28, 2026). Last updated September 5, 2025. Zero-shot evaluation.

MMMU-Pro (Standard, 10 Options) and MMMU (Val)

Rank	Model	MMMU-Pro (%)	MMMU Val (%)
—	Human Expert (High)	85.4	88.6
1	Gemini 3.0 Pro	81.0	—
—	Human Expert (Medium)	80.8	82.6
2	GPT-5 w/ thinking	78.4	84.2
3	o3	76.4	82.9
4	GPT-5.1	76.0	85.4
—	Human Expert (Low)	73.0	76.2
5	dots.vlm1 (37B)	70.1	80.1
6	Qwen3-VL 235B-A22B	68.1	78.7
7	Gemini 2.5 Pro (05-06)	68.0	79.6
8	Seed 1.5-VL Thinking (20B)	67.6	77.9
9	Seed 1.6-Thinking (20B)	66.4	74.8
10	GLM-4.5V w/ Thinking (12B)	65.2	75.4
11	GPT-5 w/o thinking	62.7	74.4
12	Seed 1.5-VL (20B)	59.9	73.6
13	Skywork-R1V3-38B	55.4	76.0
14	GPT-4o (0513)	51.9	69.1
15	Claude 3.5 Sonnet	51.5	68.3

Key takeaways:

Gemini 3.0 Pro leads at 81.0% on MMMU-Pro — approaching but not yet matching the high human expert level (85.4%)
Massive performance drop from MMMU to MMMU-Pro — GPT-5.1 scores 85.4% on MMMU Val but drops to 76.0% on MMMU-Pro, confirming the hardening is effective
Human experts still lead — the high expert baseline (85.4%) has only recently been approached by the top model
Thinking/reasoning capabilities matter — GPT-5 with thinking (78.4%) significantly outperforms GPT-5 without thinking (62.7%), a 16-point gap
Open-source models are competitive — dots.vlm1 (37B) at 70.1% and Qwen3-VL at 68.1% show strong multimodal reasoning from open-weight models
Random choice baseline is just 12.6% on MMMU-Pro, versus 22.1% on MMMU Val — confirming the 10-option format dramatically reduces guessing

For the full leaderboard with all models and settings, visit the official benchmark website linked in the next section.

Where to Explore the Benchmark

Leaderboard and Project

Resource	Description	Link
Official Leaderboard	Full rankings across MMMU-Pro, MMMU Val, and MMMU Test with filters for open-source/proprietary	mmmu-benchmark.github.io/#leaderboard
arXiv Paper	Full technical paper with methodology, filtering analysis, and results	arxiv.org/abs/2409.02813
EvalAI Server	Submit your own model for official evaluation on the test set	eval.ai/web/challenges/challenge-page/2179

Dataset and Code

Resource	Description	Link
MMMU-Pro Dataset	Standard (10 options) and Vision subsets on Hugging Face	huggingface.co/datasets/MMMU/MMMU_Pro
MMMU Dataset	Original MMMU with 11.5K questions (test answers released Feb 2026)	huggingface.co/datasets/MMMU/MMMU
GitHub Repository	Evaluation code and benchmark infrastructure	github.com/MMMU-Benchmark/MMMU

Load the Dataset

from datasets import load_dataset

# Standard setting (10 options)
mmmu_pro_standard = load_dataset("MMMU/MMMU_Pro", "standard (10 options)")

# Vision-only setting
mmmu_pro_vision = load_dataset("MMMU/MMMU_Pro", "vision")

# Standard setting (4 options, for comparison)
mmmu_pro_4 = load_dataset("MMMU/MMMU_Pro", "standard (4 options)")

Understanding the Metrics

Accuracy (Zero-Shot)

Models are evaluated in a zero-shot setting — no fine-tuning or few-shot examples. Each model receives a question (with images in Standard mode, or a screenshot in Vision mode) and must select the correct answer from 10 options.

Standard vs. Vision-Only

Mode	Input	Challenge
Standard (10 options)	Text question + images, 10 answer choices	Expanded options reduce guessing; text-solvable questions removed
Vision-only	Screenshot containing question + images embedded together	Model must “see” and “read” simultaneously from a single image
Standard (4 options)	Same as Standard but with 4 options (for comparison)	Intermediate difficulty between MMMU and full MMMU-Pro

Key Findings on Evaluation Methods

The paper explored two additional prompting strategies:

OCR prompts — Asking the model to first extract text from images before answering. Result: minimal effect on performance
Chain of Thought (CoT) — Asking the model to reason step by step. Result: generally improves performance across models

graph LR
    A["Standard<br/>(10 options)<br/>Text + Images"] --> D["Harder than<br/>MMMU"]
    B["Vision-Only<br/>Screenshot input<br/>only"] --> D
    C["OCR / CoT<br/>Prompting<br/>strategies"] --> E["CoT helps,<br/>OCR has<br/>minimal effect"]

    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#27ae60,color:#fff,stroke:#333
    style C fill:#f39c12,color:#fff,stroke:#333
    style D fill:#8e44ad,color:#fff,stroke:#333
    style E fill:#e67e22,color:#fff,stroke:#333

Why MMMU-Pro Matters

graph LR
    A["MMMU questions<br/>solvable without<br/>images"] --> C["MMMU-Pro<br/>filters shortcuts,<br/>10 options,<br/>vision-only"]
    B["4-option guessing<br/>inflates scores"] --> C
    C --> D["True multimodal<br/>understanding<br/>measurement"]
    C --> E["Reveals gap between<br/>perception and<br/>reasoning"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style E fill:#3498db,color:#fff,stroke:#333

Eliminates text-only shortcuts — By filtering questions answerable without images, MMMU-Pro ensures models must genuinely understand visual content
Reduces guessing advantage — 10 options drop the random baseline from 25% to 10%, providing a cleaner signal
Tests real-world vision-text integration — The vision-only mode mirrors how humans encounter information: text and images embedded together in documents, slides, and photos
30 image types across 6 disciplines — From chemical structures and medical imaging to music sheets and architectural blueprints
Accepted at ACL 2025 — Peer-reviewed and validated by the NLP/multimodal research community

Video: MMMU-Pro Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

MMMU-Pro sets a higher bar for multimodal AI evaluation:

Three-step hardening — filtering text-solvable questions, expanding to 10 options, and introducing vision-only input — exposes genuine multimodal understanding gaps
Published at ACL 2025 and built on the CVPR 2024 MMMU benchmark covering 30 subjects, 6 disciplines, and 30+ image types
The best model (Gemini 3.0 Pro) scores 81.0% on MMMU-Pro, approaching but not yet matching the high human expert level (85.4%)
Thinking capabilities provide a 16-point boost — GPT-5 with thinking (78.4%) vs. without (62.7%) — showing that reasoning is critical for hard multimodal problems
Substantial MMMU-to-MMMU-Pro drop confirms that many prior benchmark scores were inflated by text-only shortcuts and narrow option sets

As multimodal AI advances toward expert-level understanding, MMMU-Pro provides the essential stress test: can your model truly see, read, and reason — or is it just exploiting shortcuts?

References

Yue, X., Zheng, T., Ni, Y., Wang, Y., Zhang, K., Tong, S., Sun, Y., Yu, B., Zhang, G., Sun, H., Su, Y., Chen, W., & Neubig, G. “MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark.” ACL 2025 Main. arXiv:2409.02813 (2024). arxiv.org/abs/2409.02813
Yue, X., Ni, Y., Zhang, K., Zheng, T. et al. “MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI.” CVPR 2024. mmmu-benchmark.github.io
MMMU Benchmark. “Official Leaderboard.” mmmu-benchmark.github.io/#leaderboard
MMMU. “MMMU-Pro Dataset.” Hugging Face. huggingface.co/datasets/MMMU/MMMU_Pro

Explore the hardest AI benchmark ever built — see Humanity’s Last Exam (HLE)
Test graduate-level science reasoning — see GPQA Diamond
Measure abstract reasoning and fluid intelligence — see ARC-AGI-2
Evaluate chart and figure understanding — see CharXiv Reasoning
Test multilingual knowledge across 14 languages — see MMMLU
Assess competitive programming skills — see LiveCodeBench Pro

Explore Benchmarks Home