graph LR
A["Existing Video Benchmarks<br/>(Video-MME, MVBench, etc.)<br/>Scene understanding"] --> B["Videos treated as<br/>visual scenes"]
B --> C["Video-MMMU<br/>300 videos × 900 questions<br/>Videos as learning medium"]
C --> D["Evaluates knowledge<br/>acquisition capability"]
style A fill:#e74c3c,stroke:#333,color:#fff
style B fill:#f39c12,stroke:#333,color:#fff
style C fill:#27ae60,stroke:#333,color:#fff
style D fill:#3498db,stroke:#333,color:#fff
Video-MMMU
A multi-discipline benchmark evaluating how LLMs acquire knowledge from professional videos through three cognitive stages: Perception, Comprehension, and Adaptation
Keywords: Video-MMMU, video understanding, knowledge acquisition, multi-modal benchmark, LLM evaluation, Bloom’s taxonomy, Perception, Comprehension, Adaptation, Δknowledge, educational video, multi-discipline, NTU, S-Lab

Introduction
Large Multimodal Models (LMMs) can now analyze images, generate code, and reason through complex problems. But can they actually learn from watching a video — the way a student absorbs a lecture, grasps the underlying concepts, and applies that knowledge to solve a new exam problem?
Video-MMMU (Massive Multi-discipline Multimodal Understanding) is a benchmark designed to answer exactly that question. It evaluates LMMs’ ability to acquire and utilize knowledge from educational videos across six professional disciplines — going far beyond simple scene description or action recognition. Grounded in Bloom’s taxonomy, it tests three progressive cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to novel scenarios.
“Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process.” — Hu et al., arXiv:2501.13826
What Is Video-MMMU?
Video-MMMU is a multi-modal, multi-disciplinary benchmark consisting of 300 expert-level educational videos and 900 human-annotated questions spanning 6 disciplines and 30 subjects. Each video is paired with three questions — one for each cognitive stage — creating a systematic framework for evaluating whether models can truly learn from video content.
The benchmark features two types of videos:
- Concept-Introduction Videos — comprehensive explanations of factual knowledge, fundamental concepts, and theories
- Problem-Solving Videos — step-by-step demonstrations of solutions, particularly in STEM disciplines
Key Characteristics
| Feature | Details |
|---|---|
| Total videos | 300 college-level educational videos |
| Total questions | 900 (3 per video, one per cognitive stage) |
| Disciplines | 6 (Art, Business, Science, Medicine, Humanities, Engineering) |
| Subjects | 30 distributed across disciplines |
| Avg. video duration | 506.2 seconds (~8.4 minutes) |
| Avg. question length | 75.7 words |
| MCQ options | 10 per question |
| Knowledge metric | Δknowledge (normalized performance gain after video viewing) |
| Evaluation framework | LMMs-Eval |
The Three Cognitive Stages
Aligned with Bloom’s taxonomy, Video-MMMU evaluates knowledge acquisition through three progressively harder stages:
graph TD
V["Educational Video<br/>300 expert-level videos"] --> P["Stage 1: Perception<br/>Identify key information<br/>(OCR, ASR)"]
P --> C["Stage 2: Comprehension<br/>Understand underlying concepts<br/>(Concept & Problem-solving)"]
C --> A["Stage 3: Adaptation<br/>Apply knowledge to novel scenarios<br/>(Case Study & Strategy)"]
A --> DK["Δknowledge Metric<br/>Quantifies learning gain"]
style V fill:#8e44ad,color:#fff,stroke:#333
style P fill:#27ae60,color:#fff,stroke:#333
style C fill:#f39c12,color:#fff,stroke:#333
style A fill:#e74c3c,color:#fff,stroke:#333
style DK fill:#3498db,color:#fff,stroke:#333
| Stage | What It Tests | Question Types |
|---|---|---|
| Perception | Identifying key information from video | OCR (formulas, charts, handwritten notes), ASR (speech transcription) |
| Comprehension | Understanding presented knowledge | Concept Comprehension (MAMC format), Problem-solving Strategy Comprehension |
| Adaptation | Applying knowledge to new scenarios | Case Study Analysis (novel real-world scenarios), Problem-solving Strategy Adaptation |
Who Built It?
Video-MMMU was developed by researchers at S-Lab, Nanyang Technological University (NTU) and Carnegie Mellon University (CMU):
- Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Bo Li, Ziwei Liu — S-Lab, NTU
- Xiang Yue — Carnegie Mellon University
The project builds on the legacy of MMMU and MMMU-Pro — established multi-modal benchmarks by overlapping authors — extending evaluation from static images to video-based knowledge acquisition.
| Resource | Link |
|---|---|
| Project Website | videommmu.github.io |
| arXiv Paper | arxiv.org/abs/2501.13826 |
| HuggingFace Dataset | huggingface.co/datasets/Video-MMMU/Video-MMMU |
| HuggingFace Leaderboard | huggingface.co/spaces/Video-MMMU/Video-MMMU-Leaderboard |
What Skills Does It Test?
Video-MMMU covers 30 subjects across 6 professional disciplines, each requiring domain-specific expertise:
graph TD
VM["Video-MMMU<br/>300 videos · 900 questions"] --> Art["Art<br/>Art History, Art Theory,<br/>Design, Music"]
VM --> Biz["Business<br/>Accounting, Economics,<br/>Finance, Management, Marketing"]
VM --> Sci["Science<br/>Biology, Chemistry,<br/>Geography, Math, Physics"]
VM --> Med["Medicine<br/>Basic Medical Science,<br/>Clinical Medicine, Pharmacy"]
VM --> Hum["Humanities<br/>History, Literature,<br/>Psychology, Sociology"]
VM --> Eng["Engineering<br/>Computer Science, Electronics,<br/>Architecture, Materials"]
style VM fill:#e74c3c,color:#fff,stroke:#333
style Art fill:#9b59b6,color:#fff,stroke:#333
style Biz fill:#3498db,color:#fff,stroke:#333
style Sci fill:#27ae60,color:#fff,stroke:#333
style Med fill:#e67e22,color:#fff,stroke:#333
style Hum fill:#f39c12,color:#fff,stroke:#333
style Eng fill:#1abc9c,color:#fff,stroke:#333
| Capability | What Video-MMMU Tests |
|---|---|
| Visual perception | Extracting formulas, charts, diagrams, and handwritten notes from video frames |
| Speech understanding | Transcribing and interpreting spoken lecture content |
| Concept comprehension | Understanding theories and concepts presented in video |
| Quantitative reasoning | Following step-by-step calculations and adapting them to new inputs |
| Knowledge transfer | Applying learned concepts to novel real-world scenarios and exam problems |
| Multi-modal integration | Combining visual, textual, and spoken information from educational videos |
Current Leaderboard
The table below shows model accuracy (%) on Video-MMMU across the three cognitive tracks and six disciplines.
Source: Hu, K. et al. “Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos.” arXiv:2501.13826 (January 2025). Evaluation uses micro-averaged accuracy via LMMs-Eval.
Overall Results by Track
| Model | Overall | Perception | Comprehension | Adaptation |
|---|---|---|---|---|
| Human Expert | 74.44 | 84.33 | 78.67 | 60.33 |
| Claude-3.5-Sonnet | 65.78 | 72.00 | 69.67 | 55.67 |
| GPT-4o | 61.22 | 66.00 | 62.00 | 55.67 |
| Gemini 1.5 Pro | 53.89 | 59.00 | 53.33 | 49.33 |
| Aria | 50.78 | 65.67 | 46.67 | 40.00 |
| Gemini 1.5 Flash | 49.78 | 57.33 | 49.00 | 43.00 |
| LLaVA-Video-72B | 49.67 | 59.67 | 46.00 | 43.33 |
| LLaVA-OneVision-72B | 48.33 | 59.67 | 42.33 | 43.00 |
| MAmmoTH-VL-8B | 41.78 | 51.67 | 40.00 | 33.67 |
| InternVL2-8B | 37.44 | 47.33 | 33.33 | 31.67 |
| LLaVA-Video-7B | 36.11 | 41.67 | 33.33 | 33.33 |
| VILA1.5-40B | 34.00 | 38.67 | 30.67 | 32.67 |
| LLaVA-OneVision-7B | 33.89 | 40.00 | 31.00 | 30.67 |
| Llama-3.2-11B | 30.00 | 35.67 | 32.33 | 22.00 |
| LongVA-7B | 23.98 | 24.00 | 24.33 | 23.67 |
| VILA1.5-8B | 20.89 | 20.33 | 17.33 | 25.00 |
| Random Choice | 14.00 | 12.00 | 14.00 | 16.00 |
Knowledge Acquisition (Δknowledge)
The Δknowledge metric measures the normalized improvement in Adaptation accuracy after watching the video compared to before. It quantifies how much a model actually learns from the video.
| Model | Δknowledge (%) | Wrong→Right Rate (%) | Right→Wrong Rate (%) |
|---|---|---|---|
| Human Expert | 33.1 | 40.4 | 10.7 |
| GPT-4o | 15.6 | 28.0 | 13.3 |
| Claude-3.5-Sonnet | 11.4 | 28.8 | 19.5 |
| VILA-1.5-40B | 9.4 | 25.2 | 45.9 |
| Gemini-1.5-Pro | 8.7 | 29.5 | 24.6 |
| LLaVA-Video-72B | 7.1 | 22.0 | 24.6 |
| LLaVA-OneVision-72B | 6.6 | 20.9 | — |
Key takeaways:
- Human experts achieve a Δknowledge of 33.1%, demonstrating strong video-based learning with a high Wrong→Right rate (40.4%) and low Right→Wrong rate (10.7%)
- The best model (GPT-4o) achieves only 15.6% — less than half the human knowledge gain
- Models show a troubling pattern: moderate Wrong→Right rates but high Right→Wrong rates (e.g., VILA-1.5-40B: 45.9%), meaning they forget correct answers after watching videos
- Performance declines steeply from Perception → Comprehension → Adaptation across all models
For the latest leaderboard, visit the resources in the next section.
Where to Explore the Benchmark
Dashboards and Leaderboard
| Resource | Description | Link |
|---|---|---|
| HuggingFace Leaderboard | Interactive leaderboard with model submissions | huggingface.co/spaces/Video-MMMU/Video-MMMU-Leaderboard |
| Project Website | Official website with examples and visualizations | videommmu.github.io |
Dataset and Code
| Resource | Description | Link |
|---|---|---|
| HuggingFace Dataset | 300 videos and 900 annotated questions | huggingface.co/datasets/Video-MMMU/Video-MMMU |
| arXiv Paper | Full technical paper with methodology and analysis | arxiv.org/abs/2501.13826 |
| LMMs-Eval | Evaluation framework used for benchmarking | github.com/EvolvingLMMs-Lab/lmms-eval |
Understanding the Metrics
Micro-Averaged Accuracy
The primary metric. Models receive a video and question as input. Responses are evaluated using an automated, rule-based pipeline with regular expressions to extract option letters and numerical values. Responses lacking valid answers are marked incorrect.
Δknowledge (Knowledge Gain)
The signature metric of Video-MMMU. It measures how much a model’s performance on Adaptation questions improves after watching the video:
\Delta_{\text{knowledge}} = \frac{Acc_{\text{post}} - Acc_{\text{pre}}}{100\% - Acc_{\text{pre}}} \times 100\%
where Acc_{\text{pre}} and Acc_{\text{post}} are the accuracy before and after watching the video. This normalized metric accounts for baseline difficulty — improving from 90% to 95% (Δknowledge = 50%) indicates more substantial learning than improving from 0% to 5% (Δknowledge = 5%).
Error Analysis
An analysis of 100 randomly sampled Claude-3.5-Sonnet errors in the Adaptation track reveals that most failures stem from inability to adapt learned methods, not from misunderstanding:
| Error Type | Proportion |
|---|---|
| Method Adaptation Error | 64% |
| Question Misreading Error | 15% |
| Method Selection Error | 8% |
| Answer Extraction Error | 5% |
| Refuse to Answer | 4% |
| Annotation Error | 4% |
Why Video-MMMU Matters
graph LR
A["Existing benchmarks treat<br/>video as visual scene"] --> B["No evaluation of<br/>learning from video"]
B --> C["Video-MMMU fills<br/>the gap"]
C --> D["Measures knowledge<br/>acquisition capability"]
A2["Models score well on<br/>perception tasks"] --> B2["Performance collapses<br/>on adaptation"]
B2 --> C
C --> D2["Drives research toward<br/>better video learning"]
style A fill:#e74c3c,color:#fff,stroke:#333
style A2 fill:#e74c3c,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#3498db,color:#fff,stroke:#333
style D2 fill:#3498db,color:#fff,stroke:#333
- First knowledge-acquisition benchmark for video — uniquely treats video as an educational medium rather than a visual scene
- Grounded in cognitive science — three stages aligned with Bloom’s taxonomy provide structured, interpretable evaluation
- Δknowledge metric — quantifies actual learning gain, revealing a 2× gap between humans and the best models
- Exposes a fundamental limitation — 64% of Adaptation errors are Method Adaptation Errors, showing models understand but cannot apply knowledge
- Multi-disciplinary rigor — 30 subjects across 6 disciplines with expert-curated, college-level content
Video: Video-MMMU Explained
Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀
Conclusion
Video-MMMU reveals a fundamental gap in how current AI models learn from video:
- 300 expert-level videos and 900 questions across 6 disciplines and 30 subjects provide rigorous, multi-disciplinary evaluation
- Performance drops steeply from Perception (72% best model) → Comprehension (69.67%) → Adaptation (55.67%), exposing the challenge of deeper cognitive processing
- Humans achieve Δknowledge of 33.1% while the best model (GPT-4o) reaches only 15.6% — less than half the human learning gain
- Models exhibit a troubling Right→Wrong pattern: watching videos causes them to forget previously correct answers, unlike humans who retain prior knowledge
- 64% of errors in the Adaptation track are Method Adaptation Errors — models can recall knowledge from the video but fail to flexibly apply it to new problems
As LMMs increasingly need to learn from real-world video content — lectures, tutorials, demonstrations — Video-MMMU provides the first systematic measure of this critical capability. Closing the human-model gap on knowledge acquisition from video remains a significant open challenge.
References
- Hu, K., Wu, P., Pu, F., Xiao, W., Zhang, Y., Yue, X., Li, B., & Liu, Z. “Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos.” arXiv preprint arXiv:2501.13826 (2025). arxiv.org/abs/2501.13826
- Video-MMMU Project Website. videommmu.github.io
- Video-MMMU Dataset. HuggingFace. huggingface.co/datasets/Video-MMMU/Video-MMMU
- Video-MMMU Leaderboard. HuggingFace Spaces. huggingface.co/spaces/Video-MMMU/Video-MMMU-Leaderboard
- LMMs-Eval. github.com/EvolvingLMMs-Lab/lmms-eval
- Yue, X. et al. “MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI.” CVPR (2024). arxiv.org/abs/2311.16502
Read More
- Explore the image-based predecessor — see MMMU-Pro
- Track model costs when running evaluations — see FinOps Best Practices for LLM Applications
- Deploy models for running your own evaluations — see Deploying and Serving LLM with vLLM
- Scale your evaluation infrastructure — see Scaling LLM Serving for Enterprise Production
- Video-MMMU Project Website
- Video-MMMU Leaderboard on HuggingFace