Video-MMMU

A multi-discipline benchmark evaluating how LLMs acquire knowledge from professional videos through three cognitive stages: Perception, Comprehension, and Adaptation

Published

September 16, 2025

Keywords: Video-MMMU, video understanding, knowledge acquisition, multi-modal benchmark, LLM evaluation, Bloom’s taxonomy, Perception, Comprehension, Adaptation, Δknowledge, educational video, multi-discipline, NTU, S-Lab

Introduction

Large Multimodal Models (LMMs) can now analyze images, generate code, and reason through complex problems. But can they actually learn from watching a video — the way a student absorbs a lecture, grasps the underlying concepts, and applies that knowledge to solve a new exam problem?

Video-MMMU (Massive Multi-discipline Multimodal Understanding) is a benchmark designed to answer exactly that question. It evaluates LMMs’ ability to acquire and utilize knowledge from educational videos across six professional disciplines — going far beyond simple scene description or action recognition. Grounded in Bloom’s taxonomy, it tests three progressive cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to novel scenarios.

“Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process.” — Hu et al., arXiv:2501.13826

graph LR
    A["Existing Video Benchmarks<br/>(Video-MME, MVBench, etc.)<br/>Scene understanding"] --> B["Videos treated as<br/>visual scenes"]
    B --> C["Video-MMMU<br/>300 videos × 900 questions<br/>Videos as learning medium"]
    C --> D["Evaluates knowledge<br/>acquisition capability"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is Video-MMMU?

Video-MMMU is a multi-modal, multi-disciplinary benchmark consisting of 300 expert-level educational videos and 900 human-annotated questions spanning 6 disciplines and 30 subjects. Each video is paired with three questions — one for each cognitive stage — creating a systematic framework for evaluating whether models can truly learn from video content.

The benchmark features two types of videos:

  • Concept-Introduction Videos — comprehensive explanations of factual knowledge, fundamental concepts, and theories
  • Problem-Solving Videos — step-by-step demonstrations of solutions, particularly in STEM disciplines

Key Characteristics

Feature Details
Total videos 300 college-level educational videos
Total questions 900 (3 per video, one per cognitive stage)
Disciplines 6 (Art, Business, Science, Medicine, Humanities, Engineering)
Subjects 30 distributed across disciplines
Avg. video duration 506.2 seconds (~8.4 minutes)
Avg. question length 75.7 words
MCQ options 10 per question
Knowledge metric Δknowledge (normalized performance gain after video viewing)
Evaluation framework LMMs-Eval

The Three Cognitive Stages

Aligned with Bloom’s taxonomy, Video-MMMU evaluates knowledge acquisition through three progressively harder stages:

graph TD
    V["Educational Video<br/>300 expert-level videos"] --> P["Stage 1: Perception<br/>Identify key information<br/>(OCR, ASR)"]
    P --> C["Stage 2: Comprehension<br/>Understand underlying concepts<br/>(Concept & Problem-solving)"]
    C --> A["Stage 3: Adaptation<br/>Apply knowledge to novel scenarios<br/>(Case Study & Strategy)"]
    A --> DK["Δknowledge Metric<br/>Quantifies learning gain"]

    style V fill:#8e44ad,color:#fff,stroke:#333
    style P fill:#27ae60,color:#fff,stroke:#333
    style C fill:#f39c12,color:#fff,stroke:#333
    style A fill:#e74c3c,color:#fff,stroke:#333
    style DK fill:#3498db,color:#fff,stroke:#333

Stage What It Tests Question Types
Perception Identifying key information from video OCR (formulas, charts, handwritten notes), ASR (speech transcription)
Comprehension Understanding presented knowledge Concept Comprehension (MAMC format), Problem-solving Strategy Comprehension
Adaptation Applying knowledge to new scenarios Case Study Analysis (novel real-world scenarios), Problem-solving Strategy Adaptation

Who Built It?

Video-MMMU was developed by researchers at S-Lab, Nanyang Technological University (NTU) and Carnegie Mellon University (CMU):

  • Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Bo Li, Ziwei Liu — S-Lab, NTU
  • Xiang Yue — Carnegie Mellon University

The project builds on the legacy of MMMU and MMMU-Pro — established multi-modal benchmarks by overlapping authors — extending evaluation from static images to video-based knowledge acquisition.

Resource Link
Project Website videommmu.github.io
arXiv Paper arxiv.org/abs/2501.13826
HuggingFace Dataset huggingface.co/datasets/Video-MMMU/Video-MMMU
HuggingFace Leaderboard huggingface.co/spaces/Video-MMMU/Video-MMMU-Leaderboard

What Skills Does It Test?

Video-MMMU covers 30 subjects across 6 professional disciplines, each requiring domain-specific expertise:

graph TD
    VM["Video-MMMU<br/>300 videos · 900 questions"] --> Art["Art<br/>Art History, Art Theory,<br/>Design, Music"]
    VM --> Biz["Business<br/>Accounting, Economics,<br/>Finance, Management, Marketing"]
    VM --> Sci["Science<br/>Biology, Chemistry,<br/>Geography, Math, Physics"]
    VM --> Med["Medicine<br/>Basic Medical Science,<br/>Clinical Medicine, Pharmacy"]
    VM --> Hum["Humanities<br/>History, Literature,<br/>Psychology, Sociology"]
    VM --> Eng["Engineering<br/>Computer Science, Electronics,<br/>Architecture, Materials"]

    style VM fill:#e74c3c,color:#fff,stroke:#333
    style Art fill:#9b59b6,color:#fff,stroke:#333
    style Biz fill:#3498db,color:#fff,stroke:#333
    style Sci fill:#27ae60,color:#fff,stroke:#333
    style Med fill:#e67e22,color:#fff,stroke:#333
    style Hum fill:#f39c12,color:#fff,stroke:#333
    style Eng fill:#1abc9c,color:#fff,stroke:#333

Capability What Video-MMMU Tests
Visual perception Extracting formulas, charts, diagrams, and handwritten notes from video frames
Speech understanding Transcribing and interpreting spoken lecture content
Concept comprehension Understanding theories and concepts presented in video
Quantitative reasoning Following step-by-step calculations and adapting them to new inputs
Knowledge transfer Applying learned concepts to novel real-world scenarios and exam problems
Multi-modal integration Combining visual, textual, and spoken information from educational videos

Current Leaderboard

The table below shows model accuracy (%) on Video-MMMU across the three cognitive tracks and six disciplines.

Source: Hu, K. et al. “Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos.” arXiv:2501.13826 (January 2025). Evaluation uses micro-averaged accuracy via LMMs-Eval.

Overall Results by Track

Model Overall Perception Comprehension Adaptation
Human Expert 74.44 84.33 78.67 60.33
Claude-3.5-Sonnet 65.78 72.00 69.67 55.67
GPT-4o 61.22 66.00 62.00 55.67
Gemini 1.5 Pro 53.89 59.00 53.33 49.33
Aria 50.78 65.67 46.67 40.00
Gemini 1.5 Flash 49.78 57.33 49.00 43.00
LLaVA-Video-72B 49.67 59.67 46.00 43.33
LLaVA-OneVision-72B 48.33 59.67 42.33 43.00
MAmmoTH-VL-8B 41.78 51.67 40.00 33.67
InternVL2-8B 37.44 47.33 33.33 31.67
LLaVA-Video-7B 36.11 41.67 33.33 33.33
VILA1.5-40B 34.00 38.67 30.67 32.67
LLaVA-OneVision-7B 33.89 40.00 31.00 30.67
Llama-3.2-11B 30.00 35.67 32.33 22.00
LongVA-7B 23.98 24.00 24.33 23.67
VILA1.5-8B 20.89 20.33 17.33 25.00
Random Choice 14.00 12.00 14.00 16.00

Knowledge Acquisition (Δknowledge)

The Δknowledge metric measures the normalized improvement in Adaptation accuracy after watching the video compared to before. It quantifies how much a model actually learns from the video.

Model Δknowledge (%) Wrong→Right Rate (%) Right→Wrong Rate (%)
Human Expert 33.1 40.4 10.7
GPT-4o 15.6 28.0 13.3
Claude-3.5-Sonnet 11.4 28.8 19.5
VILA-1.5-40B 9.4 25.2 45.9
Gemini-1.5-Pro 8.7 29.5 24.6
LLaVA-Video-72B 7.1 22.0 24.6
LLaVA-OneVision-72B 6.6 20.9

Key takeaways:

  • Human experts achieve a Δknowledge of 33.1%, demonstrating strong video-based learning with a high Wrong→Right rate (40.4%) and low Right→Wrong rate (10.7%)
  • The best model (GPT-4o) achieves only 15.6% — less than half the human knowledge gain
  • Models show a troubling pattern: moderate Wrong→Right rates but high Right→Wrong rates (e.g., VILA-1.5-40B: 45.9%), meaning they forget correct answers after watching videos
  • Performance declines steeply from Perception → Comprehension → Adaptation across all models

For the latest leaderboard, visit the resources in the next section.

Where to Explore the Benchmark

Dashboards and Leaderboard

Resource Description Link
HuggingFace Leaderboard Interactive leaderboard with model submissions huggingface.co/spaces/Video-MMMU/Video-MMMU-Leaderboard
Project Website Official website with examples and visualizations videommmu.github.io

Dataset and Code

Resource Description Link
HuggingFace Dataset 300 videos and 900 annotated questions huggingface.co/datasets/Video-MMMU/Video-MMMU
arXiv Paper Full technical paper with methodology and analysis arxiv.org/abs/2501.13826
LMMs-Eval Evaluation framework used for benchmarking github.com/EvolvingLMMs-Lab/lmms-eval

Understanding the Metrics

Micro-Averaged Accuracy

The primary metric. Models receive a video and question as input. Responses are evaluated using an automated, rule-based pipeline with regular expressions to extract option letters and numerical values. Responses lacking valid answers are marked incorrect.

Δknowledge (Knowledge Gain)

The signature metric of Video-MMMU. It measures how much a model’s performance on Adaptation questions improves after watching the video:

\Delta_{\text{knowledge}} = \frac{Acc_{\text{post}} - Acc_{\text{pre}}}{100\% - Acc_{\text{pre}}} \times 100\%

where Acc_{\text{pre}} and Acc_{\text{post}} are the accuracy before and after watching the video. This normalized metric accounts for baseline difficulty — improving from 90% to 95% (Δknowledge = 50%) indicates more substantial learning than improving from 0% to 5% (Δknowledge = 5%).

Error Analysis

An analysis of 100 randomly sampled Claude-3.5-Sonnet errors in the Adaptation track reveals that most failures stem from inability to adapt learned methods, not from misunderstanding:

Error Type Proportion
Method Adaptation Error 64%
Question Misreading Error 15%
Method Selection Error 8%
Answer Extraction Error 5%
Refuse to Answer 4%
Annotation Error 4%

Why Video-MMMU Matters

graph LR
    A["Existing benchmarks treat<br/>video as visual scene"] --> B["No evaluation of<br/>learning from video"]
    B --> C["Video-MMMU fills<br/>the gap"]
    C --> D["Measures knowledge<br/>acquisition capability"]

    A2["Models score well on<br/>perception tasks"] --> B2["Performance collapses<br/>on adaptation"]
    B2 --> C
    C --> D2["Drives research toward<br/>better video learning"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style A2 fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style D2 fill:#3498db,color:#fff,stroke:#333

  1. First knowledge-acquisition benchmark for video — uniquely treats video as an educational medium rather than a visual scene
  2. Grounded in cognitive science — three stages aligned with Bloom’s taxonomy provide structured, interpretable evaluation
  3. Δknowledge metric — quantifies actual learning gain, revealing a 2× gap between humans and the best models
  4. Exposes a fundamental limitation — 64% of Adaptation errors are Method Adaptation Errors, showing models understand but cannot apply knowledge
  5. Multi-disciplinary rigor — 30 subjects across 6 disciplines with expert-curated, college-level content

Video: Video-MMMU Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

Video-MMMU reveals a fundamental gap in how current AI models learn from video:

  • 300 expert-level videos and 900 questions across 6 disciplines and 30 subjects provide rigorous, multi-disciplinary evaluation
  • Performance drops steeply from Perception (72% best model) → Comprehension (69.67%) → Adaptation (55.67%), exposing the challenge of deeper cognitive processing
  • Humans achieve Δknowledge of 33.1% while the best model (GPT-4o) reaches only 15.6% — less than half the human learning gain
  • Models exhibit a troubling Right→Wrong pattern: watching videos causes them to forget previously correct answers, unlike humans who retain prior knowledge
  • 64% of errors in the Adaptation track are Method Adaptation Errors — models can recall knowledge from the video but fail to flexibly apply it to new problems

As LMMs increasingly need to learn from real-world video content — lectures, tutorials, demonstrations — Video-MMMU provides the first systematic measure of this critical capability. Closing the human-model gap on knowledge acquisition from video remains a significant open challenge.

References

Read More