Video-MMMU

A multi-discipline benchmark evaluating how LLMs acquire knowledge from professional videos through three cognitive stages: Perception, Comprehension, and Adaptation

Author

Vectoring AI

Keywords

Video-MMMU, video understanding, knowledge acquisition, multi-modal benchmark, LLM evaluation, Bloom’s taxonomy, Perception, Comprehension, Adaptation, Δknowledge, educational video, multi-discipline, NTU, S-Lab

Video-MMMU multi-discipline video understanding benchmark

Introduction

Large Multimodal Models (LMMs) can now analyze images, generate code, and reason through complex problems. But can they actually learn from watching a video — the way a student absorbs a lecture, grasps the underlying concepts, and applies that knowledge to solve a new exam problem?

Video-MMMU (Massive Multi-discipline Multimodal Understanding) is a benchmark designed to answer exactly that question. It evaluates LMMs’ ability to acquire and utilize knowledge from educational videos across six professional disciplines — going far beyond simple scene description or action recognition. Grounded in Bloom’s taxonomy, it tests three progressive cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to novel scenarios.

“Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process.” — Hu et al., arXiv:2501.13826

graph LR
    A["Existing Video Benchmarks<br/>(Video-MME, MVBench, etc.)<br/>Scene understanding"] --> B["Videos treated as<br/>visual scenes"]
    B --> C["Video-MMMU<br/>300 videos × 900 questions<br/>Videos as learning medium"]
    C --> D["Evaluates knowledge<br/>acquisition capability"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is Video-MMMU?

Video-MMMU is a multi-modal, multi-disciplinary benchmark consisting of 300 expert-level educational videos and 900 human-annotated questions spanning 6 disciplines and 30 subjects. Each video is paired with three questions — one for each cognitive stage — creating a systematic framework for evaluating whether models can truly learn from video content.

The benchmark features two types of videos:

Concept-Introduction Videos — comprehensive explanations of factual knowledge, fundamental concepts, and theories
Problem-Solving Videos — step-by-step demonstrations of solutions, particularly in STEM disciplines

Key Characteristics

Feature	Details
Total videos	300 college-level educational videos
Total questions	900 (3 per video, one per cognitive stage)
Disciplines	6 (Art, Business, Science, Medicine, Humanities, Engineering)
Subjects	30 distributed across disciplines
Avg. video duration	506.2 seconds (~8.4 minutes)
Avg. question length	75.7 words
MCQ options	10 per question
Knowledge metric	Δknowledge (normalized performance gain after video viewing)
Evaluation framework	LMMs-Eval

The Three Cognitive Stages

Aligned with Bloom’s taxonomy, Video-MMMU evaluates knowledge acquisition through three progressively harder stages:

graph TD
    V["Educational Video<br/>300 expert-level videos"] --> P["Stage 1: Perception<br/>Identify key information<br/>(OCR, ASR)"]
    P --> C["Stage 2: Comprehension<br/>Understand underlying concepts<br/>(Concept & Problem-solving)"]
    C --> A["Stage 3: Adaptation<br/>Apply knowledge to novel scenarios<br/>(Case Study & Strategy)"]
    A --> DK["Δknowledge Metric<br/>Quantifies learning gain"]

    style V fill:#8e44ad,color:#fff,stroke:#333
    style P fill:#27ae60,color:#fff,stroke:#333
    style C fill:#f39c12,color:#fff,stroke:#333
    style A fill:#e74c3c,color:#fff,stroke:#333
    style DK fill:#3498db,color:#fff,stroke:#333

Stage	What It Tests	Question Types
Perception	Identifying key information from video	OCR (formulas, charts, handwritten notes), ASR (speech transcription)
Comprehension	Understanding presented knowledge	Concept Comprehension (MAMC format), Problem-solving Strategy Comprehension
Adaptation	Applying knowledge to new scenarios	Case Study Analysis (novel real-world scenarios), Problem-solving Strategy Adaptation

Who Built It?

Video-MMMU was developed by researchers at S-Lab, Nanyang Technological University (NTU) and Carnegie Mellon University (CMU):

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Bo Li, Ziwei Liu — S-Lab, NTU
Xiang Yue — Carnegie Mellon University

The project builds on the legacy of MMMU and MMMU-Pro — established multi-modal benchmarks by overlapping authors — extending evaluation from static images to video-based knowledge acquisition.

Resource	Link
Project Website	videommmu.github.io
arXiv Paper	arxiv.org/abs/2501.13826
HuggingFace Dataset	huggingface.co/datasets/Video-MMMU/Video-MMMU
HuggingFace Leaderboard	huggingface.co/spaces/Video-MMMU/Video-MMMU-Leaderboard

What Skills Does It Test?

Video-MMMU covers 30 subjects across 6 professional disciplines, each requiring domain-specific expertise:

graph TD
    VM["Video-MMMU<br/>300 videos · 900 questions"] --> Art["Art<br/>Art History, Art Theory,<br/>Design, Music"]
    VM --> Biz["Business<br/>Accounting, Economics,<br/>Finance, Management, Marketing"]
    VM --> Sci["Science<br/>Biology, Chemistry,<br/>Geography, Math, Physics"]
    VM --> Med["Medicine<br/>Basic Medical Science,<br/>Clinical Medicine, Pharmacy"]
    VM --> Hum["Humanities<br/>History, Literature,<br/>Psychology, Sociology"]
    VM --> Eng["Engineering<br/>Computer Science, Electronics,<br/>Architecture, Materials"]

    style VM fill:#e74c3c,color:#fff,stroke:#333
    style Art fill:#9b59b6,color:#fff,stroke:#333
    style Biz fill:#3498db,color:#fff,stroke:#333
    style Sci fill:#27ae60,color:#fff,stroke:#333
    style Med fill:#e67e22,color:#fff,stroke:#333
    style Hum fill:#f39c12,color:#fff,stroke:#333
    style Eng fill:#1abc9c,color:#fff,stroke:#333

Capability	What Video-MMMU Tests
Visual perception	Extracting formulas, charts, diagrams, and handwritten notes from video frames
Speech understanding	Transcribing and interpreting spoken lecture content
Concept comprehension	Understanding theories and concepts presented in video
Quantitative reasoning	Following step-by-step calculations and adapting them to new inputs
Knowledge transfer	Applying learned concepts to novel real-world scenarios and exam problems
Multi-modal integration	Combining visual, textual, and spoken information from educational videos

Current Leaderboard

The table below shows model accuracy (%) on Video-MMMU across the three cognitive tracks and six disciplines.

Source: Hu, K. et al. “Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos.” arXiv:2501.13826 (January 2025). Evaluation uses micro-averaged accuracy via LMMs-Eval.

Overall Results by Track

Model	Overall	Perception	Comprehension	Adaptation
Human Expert	74.44	84.33	78.67	60.33
Claude-3.5-Sonnet	65.78	72.00	69.67	55.67
GPT-4o	61.22	66.00	62.00	55.67
Gemini 1.5 Pro	53.89	59.00	53.33	49.33
Aria	50.78	65.67	46.67	40.00
Gemini 1.5 Flash	49.78	57.33	49.00	43.00
LLaVA-Video-72B	49.67	59.67	46.00	43.33
LLaVA-OneVision-72B	48.33	59.67	42.33	43.00
MAmmoTH-VL-8B	41.78	51.67	40.00	33.67
InternVL2-8B	37.44	47.33	33.33	31.67
LLaVA-Video-7B	36.11	41.67	33.33	33.33
VILA1.5-40B	34.00	38.67	30.67	32.67
LLaVA-OneVision-7B	33.89	40.00	31.00	30.67
Llama-3.2-11B	30.00	35.67	32.33	22.00
LongVA-7B	23.98	24.00	24.33	23.67
VILA1.5-8B	20.89	20.33	17.33	25.00
Random Choice	14.00	12.00	14.00	16.00

Knowledge Acquisition (Δknowledge)

The Δknowledge metric measures the normalized improvement in Adaptation accuracy after watching the video compared to before. It quantifies how much a model actually learns from the video.

Model	Δknowledge (%)	Wrong→Right Rate (%)	Right→Wrong Rate (%)
Human Expert	33.1	40.4	10.7
GPT-4o	15.6	28.0	13.3
Claude-3.5-Sonnet	11.4	28.8	19.5
VILA-1.5-40B	9.4	25.2	45.9
Gemini-1.5-Pro	8.7	29.5	24.6
LLaVA-Video-72B	7.1	22.0	24.6
LLaVA-OneVision-72B	6.6	20.9	—

Key takeaways:

Human experts achieve a Δknowledge of 33.1%, demonstrating strong video-based learning with a high Wrong→Right rate (40.4%) and low Right→Wrong rate (10.7%)
The best model (GPT-4o) achieves only 15.6% — less than half the human knowledge gain
Models show a troubling pattern: moderate Wrong→Right rates but high Right→Wrong rates (e.g., VILA-1.5-40B: 45.9%), meaning they forget correct answers after watching videos
Performance declines steeply from Perception → Comprehension → Adaptation across all models

For the latest leaderboard, visit the resources in the next section.

Where to Explore the Benchmark

Dashboards and Leaderboard

Resource	Description	Link
HuggingFace Leaderboard	Interactive leaderboard with model submissions	huggingface.co/spaces/Video-MMMU/Video-MMMU-Leaderboard
Project Website	Official website with examples and visualizations	videommmu.github.io

Dataset and Code

Resource	Description	Link
HuggingFace Dataset	300 videos and 900 annotated questions	huggingface.co/datasets/Video-MMMU/Video-MMMU
arXiv Paper	Full technical paper with methodology and analysis	arxiv.org/abs/2501.13826
LMMs-Eval	Evaluation framework used for benchmarking	github.com/EvolvingLMMs-Lab/lmms-eval

Understanding the Metrics

Micro-Averaged Accuracy

The primary metric. Models receive a video and question as input. Responses are evaluated using an automated, rule-based pipeline with regular expressions to extract option letters and numerical values. Responses lacking valid answers are marked incorrect.

Δknowledge (Knowledge Gain)

The signature metric of Video-MMMU. It measures how much a model’s performance on Adaptation questions improves after watching the video:

\Delta_{\text{knowledge}} = \frac{Acc_{\text{post}} - Acc_{\text{pre}}}{100\% - Acc_{\text{pre}}} \times 100\%

where Acc_{\text{pre}} and Acc_{\text{post}} are the accuracy before and after watching the video. This normalized metric accounts for baseline difficulty — improving from 90% to 95% (Δknowledge = 50%) indicates more substantial learning than improving from 0% to 5% (Δknowledge = 5%).

Error Analysis

An analysis of 100 randomly sampled Claude-3.5-Sonnet errors in the Adaptation track reveals that most failures stem from inability to adapt learned methods, not from misunderstanding:

Error Type	Proportion
Method Adaptation Error	64%
Question Misreading Error	15%
Method Selection Error	8%
Answer Extraction Error	5%
Refuse to Answer	4%
Annotation Error	4%

Why Video-MMMU Matters

graph LR
    A["Existing benchmarks treat<br/>video as visual scene"] --> B["No evaluation of<br/>learning from video"]
    B --> C["Video-MMMU fills<br/>the gap"]
    C --> D["Measures knowledge<br/>acquisition capability"]

    A2["Models score well on<br/>perception tasks"] --> B2["Performance collapses<br/>on adaptation"]
    B2 --> C
    C --> D2["Drives research toward<br/>better video learning"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style A2 fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style D2 fill:#3498db,color:#fff,stroke:#333

First knowledge-acquisition benchmark for video — uniquely treats video as an educational medium rather than a visual scene
Grounded in cognitive science — three stages aligned with Bloom’s taxonomy provide structured, interpretable evaluation
Δknowledge metric — quantifies actual learning gain, revealing a 2× gap between humans and the best models
Exposes a fundamental limitation — 64% of Adaptation errors are Method Adaptation Errors, showing models understand but cannot apply knowledge
Multi-disciplinary rigor — 30 subjects across 6 disciplines with expert-curated, college-level content

Video: Video-MMMU Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

Video-MMMU reveals a fundamental gap in how current AI models learn from video:

300 expert-level videos and 900 questions across 6 disciplines and 30 subjects provide rigorous, multi-disciplinary evaluation
Performance drops steeply from Perception (72% best model) → Comprehension (69.67%) → Adaptation (55.67%), exposing the challenge of deeper cognitive processing
Humans achieve Δknowledge of 33.1% while the best model (GPT-4o) reaches only 15.6% — less than half the human learning gain
Models exhibit a troubling Right→Wrong pattern: watching videos causes them to forget previously correct answers, unlike humans who retain prior knowledge
64% of errors in the Adaptation track are Method Adaptation Errors — models can recall knowledge from the video but fail to flexibly apply it to new problems

As LMMs increasingly need to learn from real-world video content — lectures, tutorials, demonstrations — Video-MMMU provides the first systematic measure of this critical capability. Closing the human-model gap on knowledge acquisition from video remains a significant open challenge.

References

Hu, K., Wu, P., Pu, F., Xiao, W., Zhang, Y., Yue, X., Li, B., & Liu, Z. “Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos.” arXiv preprint arXiv:2501.13826 (2025). arxiv.org/abs/2501.13826
Video-MMMU Project Website. videommmu.github.io
Video-MMMU Dataset. HuggingFace. huggingface.co/datasets/Video-MMMU/Video-MMMU
Video-MMMU Leaderboard. HuggingFace Spaces. huggingface.co/spaces/Video-MMMU/Video-MMMU-Leaderboard
LMMs-Eval. github.com/EvolvingLMMs-Lab/lmms-eval
Yue, X. et al. “MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI.” CVPR (2024). arxiv.org/abs/2311.16502

Explore the image-based predecessor — see MMMU-Pro
Track model costs when running evaluations — see FinOps Best Practices for LLM Applications
Deploy models for running your own evaluations — see Deploying and Serving LLM with vLLM
Scale your evaluation infrastructure — see Scaling LLM Serving for Enterprise Production
Video-MMMU Project Website
Video-MMMU Leaderboard on HuggingFace

Explore Benchmarks Home