MMMU-Pro

A more robust multimodal benchmark — filtering text-solvable questions, expanding to 10 options, and introducing vision-only evaluation across 30 college subjects

Published

September 8, 2025

Keywords: MMMU-Pro, MMMU, multimodal benchmark, vision-language model, LMM evaluation, college-level reasoning, visual understanding, multi-discipline, ACL 2025, expert AGI, OCR, Chain of Thought, 10-option multiple choice

Introduction

Multimodal benchmarks like MMMU have become central to evaluating vision-language models — but many questions in MMMU can be answered without even looking at the images. Text-only LLMs can exploit option-elimination shortcuts, and with only 4 answer choices, random guessing already gives a 25% baseline.

MMMU-Pro was built to fix these weaknesses. It takes the original MMMU benchmark and applies a rigorous three-step hardening process: (1) filtering out questions answerable by text-only models, (2) expanding from 4 to 10 candidate options, and (3) introducing a vision-only input setting where questions are embedded inside screenshots. The result is a dramatically harder benchmark — performance drops from 69% to 52% for top models, exposing genuine multimodal understanding gaps.

“MMMU-Pro rigorously assesses multimodal models’ true understanding and reasoning capabilities… This setting challenges AI to truly ‘see’ and ‘read’ simultaneously, testing a fundamental human cognitive skill.” — MMMU-Pro Paper

graph LR
    A["MMMU<br/>(Original)<br/>4 options, text shortcuts"] --> B["Models exploit<br/>text-only<br/>shortcuts"]
    B --> C["MMMU-Pro<br/>10 options, vision-only,<br/>text-solvable filtered"]
    C --> D["True multimodal<br/>understanding<br/>assessment"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is MMMU-Pro?

MMMU-Pro is a hardened version of the MMMU (Massive Multi-discipline Multimodal Understanding) benchmark. It evaluates multimodal AI models on college-level questions across 30 subjects in 6 disciplines — but with three critical improvements that make shortcut-based solving much harder.

The Three-Step Hardening Process

graph TD
    A["Step 1: Filter<br/>Remove questions<br/>answerable by<br/>text-only LLMs"] --> B["Step 2: Augment<br/>Expand from 4 to<br/>10 candidate options"]
    B --> C["Step 3: Vision-Only<br/>Embed questions in<br/>screenshots — no<br/>separate text input"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#f39c12,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333

  1. Filtering text-solvable questions — Four strong open-source LLMs attempt each MMMU question without images. Questions consistently answered correctly are excluded, ensuring the remaining problems truly require visual understanding
  2. Augmenting candidate options — The number of answer choices increases from 4 to 10, reducing the guessing baseline from 25% to 10% and making option-elimination strategies much less effective
  3. Vision-only input setting — Questions are embedded within screenshots or photos (with varied backgrounds, fonts, and sizes), requiring models to simultaneously “see” and “read” — a fundamental human cognitive skill

Key Characteristics

Feature Details
Base benchmark MMMU (college-level multimodal questions)
Subjects 30 across 6 disciplines
Evaluation modes Standard (10 options) and Vision-only
Total questions ~1,730 (Standard), ~1,730 (Vision)
Question format Multiple-choice (10 options)
Image types 30 types: diagrams, charts, chemical structures, music sheets, medical images, etc.
Anti-shortcut Text-solvable questions filtered out
Random baseline 10% (vs. 25% in original MMMU)
Publication ACL 2025 Main Conference

What Makes It Different from MMMU?

Feature MMMU MMMU-Pro
Answer options 4 10
Text-only filtering No Yes — removes text-solvable questions
Vision-only mode No Yes — question embedded in screenshot
Random baseline 25% 10%
Best model performance ~85% ~81% (Standard), ~78% (Vision)

Who Built It?

MMMU-Pro was developed by a multi-institutional research team led by:

  • Xiang Yue — Carnegie Mellon University (lead author)
  • Tianyu Zheng, Yuansheng Ni, Yubo Wang — Core contributors
  • Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu — Key researchers
  • Ge Zhang, Huan Sun, Yu Su — Ohio State University
  • Wenhu Chen — University of Waterloo
  • Graham Neubig — Carnegie Mellon University

The original MMMU benchmark was published at CVPR 2024, and MMMU-Pro was accepted at the ACL 2025 Main Conference.

Publication and Resources

Resource Link
arXiv paper arxiv.org/abs/2409.02813
Homepage mmmu-benchmark.github.io
GitHub github.com/MMMU-Benchmark/MMMU
Dataset (HuggingFace) huggingface.co/datasets/MMMU/MMMU_Pro
EvalAI Server eval.ai/web/challenges/challenge-page/2179

What Skills Does It Test?

MMMU-Pro tests expert-level multimodal understanding and reasoning across six core academic disciplines — specifically targeting capabilities that require genuine integration of visual and textual information.

graph TD
    MMMUPro["MMMU-Pro<br/>Multimodal Understanding"] --> A["Art & Design<br/>Paintings, architecture,<br/>design principles"]
    MMMUPro --> B["Business<br/>Accounting, economics,<br/>marketing"]
    MMMUPro --> C["Science<br/>Physics, chemistry,<br/>biology"]
    MMMUPro --> D["Health & Medicine<br/>Clinical knowledge,<br/>medical imaging"]
    MMMUPro --> E["Humanities &<br/>Social Science<br/>History, psychology,<br/>sociology"]
    MMMUPro --> F["Tech &<br/>Engineering<br/>CS, electrical eng.,<br/>materials science"]

    style MMMUPro fill:#e74c3c,color:#fff,stroke:#333
    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#27ae60,color:#fff,stroke:#333
    style C fill:#f39c12,color:#fff,stroke:#333
    style D fill:#8e44ad,color:#fff,stroke:#333
    style E fill:#e67e22,color:#fff,stroke:#333
    style F fill:#6cc3d5,color:#fff,stroke:#333

Capability What MMMU-Pro Tests
Visual perception Interpreting diagrams, charts, chemical structures, medical images, music sheets, and 25+ other image types
Domain knowledge College-level expertise across 30 subjects in 6 disciplines
Multimodal reasoning Integrating visual and textual information to solve multi-step problems
Vision-text integration Reading and reasoning from questions embedded in screenshots (vision-only mode)
Robustness to shortcuts Resisting option-elimination and text-only solving strategies
OCR capability Extracting text from images to answer questions in the vision-only setting

The 30+ Image Types

MMMU-Pro questions span an extraordinary range of visual formats:

Category Examples
Charts & Data Plots, tables, graphs, timelines
Scientific Chemical structures, DNA sequences, medical images, MRI/CT scans
Technical Diagrams, blueprints, geometric shapes, mathematical notations
Artistic Paintings, sculptures, photographs, portraits
Other Music sheets, maps, comics, logos, advertisements

Current Leaderboard

The leaderboard below shows model performance on MMMU-Pro across both the Standard (10 options) and MMMU (Val) settings, as published on the official MMMU benchmark website. Human expert performance is included for reference.

Source: MMMU Benchmark Leaderboard (consulted March 28, 2026). Last updated September 5, 2025. Zero-shot evaluation.

MMMU-Pro (Standard, 10 Options) and MMMU (Val)

Rank Model MMMU-Pro (%) MMMU Val (%)
Human Expert (High) 85.4 88.6
1 Gemini 3.0 Pro 81.0
Human Expert (Medium) 80.8 82.6
2 GPT-5 w/ thinking 78.4 84.2
3 o3 76.4 82.9
4 GPT-5.1 76.0 85.4
Human Expert (Low) 73.0 76.2
5 dots.vlm1 (37B) 70.1 80.1
6 Qwen3-VL 235B-A22B 68.1 78.7
7 Gemini 2.5 Pro (05-06) 68.0 79.6
8 Seed 1.5-VL Thinking (20B) 67.6 77.9
9 Seed 1.6-Thinking (20B) 66.4 74.8
10 GLM-4.5V w/ Thinking (12B) 65.2 75.4
11 GPT-5 w/o thinking 62.7 74.4
12 Seed 1.5-VL (20B) 59.9 73.6
13 Skywork-R1V3-38B 55.4 76.0
14 GPT-4o (0513) 51.9 69.1
15 Claude 3.5 Sonnet 51.5 68.3

Key takeaways:

  • Gemini 3.0 Pro leads at 81.0% on MMMU-Pro — approaching but not yet matching the high human expert level (85.4%)
  • Massive performance drop from MMMU to MMMU-Pro — GPT-5.1 scores 85.4% on MMMU Val but drops to 76.0% on MMMU-Pro, confirming the hardening is effective
  • Human experts still lead — the high expert baseline (85.4%) has only recently been approached by the top model
  • Thinking/reasoning capabilities matter — GPT-5 with thinking (78.4%) significantly outperforms GPT-5 without thinking (62.7%), a 16-point gap
  • Open-source models are competitive — dots.vlm1 (37B) at 70.1% and Qwen3-VL at 68.1% show strong multimodal reasoning from open-weight models
  • Random choice baseline is just 12.6% on MMMU-Pro, versus 22.1% on MMMU Val — confirming the 10-option format dramatically reduces guessing

For the full leaderboard with all models and settings, visit the official benchmark website linked in the next section.

Where to Explore the Benchmark

Leaderboard and Project

Resource Description Link
Official Leaderboard Full rankings across MMMU-Pro, MMMU Val, and MMMU Test with filters for open-source/proprietary mmmu-benchmark.github.io/#leaderboard
arXiv Paper Full technical paper with methodology, filtering analysis, and results arxiv.org/abs/2409.02813
EvalAI Server Submit your own model for official evaluation on the test set eval.ai/web/challenges/challenge-page/2179

Dataset and Code

Resource Description Link
MMMU-Pro Dataset Standard (10 options) and Vision subsets on Hugging Face huggingface.co/datasets/MMMU/MMMU_Pro
MMMU Dataset Original MMMU with 11.5K questions (test answers released Feb 2026) huggingface.co/datasets/MMMU/MMMU
GitHub Repository Evaluation code and benchmark infrastructure github.com/MMMU-Benchmark/MMMU

Load the Dataset

from datasets import load_dataset

# Standard setting (10 options)
mmmu_pro_standard = load_dataset("MMMU/MMMU_Pro", "standard (10 options)")

# Vision-only setting
mmmu_pro_vision = load_dataset("MMMU/MMMU_Pro", "vision")

# Standard setting (4 options, for comparison)
mmmu_pro_4 = load_dataset("MMMU/MMMU_Pro", "standard (4 options)")

Understanding the Metrics

Accuracy (Zero-Shot)

Models are evaluated in a zero-shot setting — no fine-tuning or few-shot examples. Each model receives a question (with images in Standard mode, or a screenshot in Vision mode) and must select the correct answer from 10 options.

Standard vs. Vision-Only

Mode Input Challenge
Standard (10 options) Text question + images, 10 answer choices Expanded options reduce guessing; text-solvable questions removed
Vision-only Screenshot containing question + images embedded together Model must “see” and “read” simultaneously from a single image
Standard (4 options) Same as Standard but with 4 options (for comparison) Intermediate difficulty between MMMU and full MMMU-Pro

Key Findings on Evaluation Methods

The paper explored two additional prompting strategies:

  • OCR prompts — Asking the model to first extract text from images before answering. Result: minimal effect on performance
  • Chain of Thought (CoT) — Asking the model to reason step by step. Result: generally improves performance across models

graph LR
    A["Standard<br/>(10 options)<br/>Text + Images"] --> D["Harder than<br/>MMMU"]
    B["Vision-Only<br/>Screenshot input<br/>only"] --> D
    C["OCR / CoT<br/>Prompting<br/>strategies"] --> E["CoT helps,<br/>OCR has<br/>minimal effect"]

    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#27ae60,color:#fff,stroke:#333
    style C fill:#f39c12,color:#fff,stroke:#333
    style D fill:#8e44ad,color:#fff,stroke:#333
    style E fill:#e67e22,color:#fff,stroke:#333

Why MMMU-Pro Matters

graph LR
    A["MMMU questions<br/>solvable without<br/>images"] --> C["MMMU-Pro<br/>filters shortcuts,<br/>10 options,<br/>vision-only"]
    B["4-option guessing<br/>inflates scores"] --> C
    C --> D["True multimodal<br/>understanding<br/>measurement"]
    C --> E["Reveals gap between<br/>perception and<br/>reasoning"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style E fill:#3498db,color:#fff,stroke:#333

  1. Eliminates text-only shortcuts — By filtering questions answerable without images, MMMU-Pro ensures models must genuinely understand visual content
  2. Reduces guessing advantage — 10 options drop the random baseline from 25% to 10%, providing a cleaner signal
  3. Tests real-world vision-text integration — The vision-only mode mirrors how humans encounter information: text and images embedded together in documents, slides, and photos
  4. 30 image types across 6 disciplines — From chemical structures and medical imaging to music sheets and architectural blueprints
  5. Accepted at ACL 2025 — Peer-reviewed and validated by the NLP/multimodal research community

Video: MMMU-Pro Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

MMMU-Pro sets a higher bar for multimodal AI evaluation:

  • Three-step hardening — filtering text-solvable questions, expanding to 10 options, and introducing vision-only input — exposes genuine multimodal understanding gaps
  • Published at ACL 2025 and built on the CVPR 2024 MMMU benchmark covering 30 subjects, 6 disciplines, and 30+ image types
  • The best model (Gemini 3.0 Pro) scores 81.0% on MMMU-Pro, approaching but not yet matching the high human expert level (85.4%)
  • Thinking capabilities provide a 16-point boost — GPT-5 with thinking (78.4%) vs. without (62.7%) — showing that reasoning is critical for hard multimodal problems
  • Substantial MMMU-to-MMMU-Pro drop confirms that many prior benchmark scores were inflated by text-only shortcuts and narrow option sets

As multimodal AI advances toward expert-level understanding, MMMU-Pro provides the essential stress test: can your model truly see, read, and reason — or is it just exploiting shortcuts?

References

  • Yue, X., Zheng, T., Ni, Y., Wang, Y., Zhang, K., Tong, S., Sun, Y., Yu, B., Zhang, G., Sun, H., Su, Y., Chen, W., & Neubig, G. “MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark.” ACL 2025 Main. arXiv:2409.02813 (2024). arxiv.org/abs/2409.02813
  • Yue, X., Ni, Y., Zhang, K., Zheng, T. et al. “MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI.” CVPR 2024. mmmu-benchmark.github.io
  • MMMU Benchmark. “Official Leaderboard.” mmmu-benchmark.github.io/#leaderboard
  • MMMU. “MMMU-Pro Dataset.” Hugging Face. huggingface.co/datasets/MMMU/MMMU_Pro

Read More