graph LR
A["MMLU<br/>(English only)<br/>57 subjects"] --> B["Limited to<br/>English-speaking<br/>evaluation"]
B --> C["MMMLU<br/>14 languages<br/>Human-translated"]
C --> D["True multilingual<br/>knowledge<br/>assessment"]
style A fill:#e74c3c,stroke:#333,color:#fff
style B fill:#f39c12,stroke:#333,color:#fff
style C fill:#27ae60,stroke:#333,color:#fff
style D fill:#3498db,stroke:#333,color:#fff
MMMLU (Multilingual MMLU)
Testing LLMs across 14 languages and 57 subjects — OpenAI’s professionally human-translated benchmark for multilingual knowledge and reasoning
Keywords: MMMLU, Multilingual MMLU, multilingual benchmark, LLM evaluation, MMLU, OpenAI, multilingual reasoning, low-resource languages, professional translation, cross-lingual evaluation, knowledge assessment, Yoruba, Swahili, Arabic, Bengali

Introduction
Most AI benchmarks evaluate LLMs in English only — but billions of people around the world interact with AI in their native language. How do we know if a model that scores 90% on English knowledge tests can perform equally well in Arabic, Bengali, Swahili, or Yoruba?
MMMLU (Multilingual Massive Multitask Language Understanding) answers this question directly. Created by OpenAI, it takes the widely used MMLU benchmark — 57 subjects spanning elementary to professional-level knowledge — and translates the entire test set into 14 languages using professional human translators. The result is a rigorous, high-quality multilingual evaluation that exposes dramatic performance gaps between high-resource and low-resource languages.
“Relying on human translators for this evaluation increases confidence in the accuracy of the translations, especially for low-resource languages like Yoruba.” — OpenAI MMMLU Dataset
What Is MMMLU?
MMMLU is a multilingual extension of the MMLU (Massive Multitask Language Understanding) benchmark. It contains the complete MMLU test set — covering 57 subjects from elementary mathematics to advanced professional topics like law, medicine, and computer science — professionally translated into 14 languages by human translators.
The benchmark ensures that translations are accurate and culturally appropriate, particularly for low-resource languages where machine translation quality is unreliable. This makes MMMLU the gold standard for evaluating whether LLMs can reason and recall knowledge across linguistic boundaries.
Languages Covered
| Language | Locale | Resource Level |
|---|---|---|
| Arabic | AR_XY | Medium |
| Bengali | BN_BD | Low |
| Chinese (Simplified) | ZH_CN | High |
| French | FR_FR | High |
| German | DE_DE | High |
| Hindi | HI_IN | Medium |
| Indonesian | ID_ID | Medium |
| Italian | IT_IT | High |
| Japanese | JA_JP | High |
| Korean | KO_KR | High |
| Portuguese (Brazil) | PT_BR | High |
| Spanish | ES_LA | High |
| Swahili | SW_KE | Low |
| Yoruba | YO_NG | Low |
Key Characteristics
| Feature | Details |
|---|---|
| Base benchmark | MMLU (57 subjects, ~14,000 test questions) |
| Languages | 14 (professional human translations) |
| Total questions | ~197,000 (14 × ~14,000) |
| Question format | Multiple-choice (4 options) |
| Evaluation | Zero-shot, chain-of-thought |
| Subjects | Elementary math to professional law, medicine, CS |
| Translation quality | Professional human translators (not machine translation) |
| License | MIT |
Who Built It?
MMMLU was created by OpenAI as part of their commitment to improving multilingual AI capabilities. The translations were commissioned using professional human translators — a deliberate choice over machine translation to ensure accuracy, especially for low-resource languages like Yoruba and Swahili.
The original MMLU benchmark that MMMLU builds upon was created by:
- Dan Hendrycks — UC Berkeley (now Center for AI Safety)
- Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika — UC Berkeley
- Dawn Song, Jacob Steinhardt — UC Berkeley
MMLU was published at ICLR 2021 and quickly became one of the most widely used benchmarks in AI evaluation.
Publication and Resources
| Resource | Link |
|---|---|
| MMMLU Dataset | huggingface.co/datasets/openai/MMMLU |
| Evaluation code | github.com/openai/simple-evals |
| Original MMLU paper | arxiv.org/abs/2009.03300 |
| Community Leaderboard | Multilingual MMLU Benchmark Leaderboard |
What Skills Does It Test?
MMMLU tests the same broad spectrum of knowledge and reasoning as MMLU — but crucially, it measures whether models can perform these tasks in non-English languages. This reveals both knowledge depth and cross-lingual transfer capabilities.
graph TD
MMMLU["MMMLU<br/>Multilingual Knowledge"] --> A["STEM<br/>Math, Physics,<br/>Computer Science"]
MMMLU --> B["Humanities<br/>History, Philosophy,<br/>Literature"]
MMMLU --> C["Social Sciences<br/>Economics, Law,<br/>Psychology"]
MMMLU --> D["Professional<br/>Medicine, Law,<br/>Engineering"]
MMMLU --> E["Cross-Lingual<br/>Transfer<br/>Same knowledge,<br/>14 languages"]
MMMLU --> F["Low-Resource<br/>Language<br/>Understanding<br/>Yoruba, Swahili,<br/>Bengali"]
style MMMLU fill:#e74c3c,color:#fff,stroke:#333
style A fill:#3498db,color:#fff,stroke:#333
style B fill:#27ae60,color:#fff,stroke:#333
style C fill:#f39c12,color:#fff,stroke:#333
style D fill:#8e44ad,color:#fff,stroke:#333
style E fill:#e67e22,color:#fff,stroke:#333
style F fill:#6cc3d5,color:#fff,stroke:#333
| Capability | What MMMLU Tests |
|---|---|
| Multilingual knowledge recall | Can the model access the same factual knowledge in Arabic as in English? |
| Cross-lingual reasoning | Can the model solve multi-step problems when the question is in Japanese or Hindi? |
| Low-resource language fluency | How much performance degrades for Yoruba, Swahili, and Bengali vs. high-resource languages |
| Subject breadth | 57 subjects from elementary to professional level — same questions across all languages |
| Translation robustness | Whether subtle linguistic differences affect model accuracy |
| Inclusive AI assessment | Real-world readiness for global deployment across diverse language communities |
The 57 MMLU Subject Categories (Grouped)
| Domain | Example Subjects |
|---|---|
| STEM | Abstract Algebra, Astronomy, College Mathematics, Computer Science, Electrical Engineering, Physics |
| Humanities | Formal Logic, Jurisprudence, Moral Disputes, Philosophy, Prehistory, World Religions |
| Social Sciences | Econometrics, Human Sexuality, Marketing, Public Relations, Sociology, US Foreign Policy |
| Professional | Clinical Knowledge, Medical Genetics, Professional Accounting, Professional Law, Professional Medicine |
| Other | Global Facts, Miscellaneous, Nutrition, Virology |
Current Leaderboard
The table below shows the average accuracy across all 14 languages for each model, as published in the official OpenAI MMMLU benchmark results.
Source: OpenAI Simple Evals — MMMLU Results (consulted March 28, 2026). Evaluation uses zero-shot chain-of-thought prompting.
Average Accuracy Across 14 Languages
| Rank | Model | Avg. Accuracy (%) |
|---|---|---|
| 1 | o3 (high) | 88.8 |
| 2 | o1 | 87.7 |
| 3 | o4-mini (high) | 85.2 |
| 4 | GPT-4.5 Preview | 85.1 |
| 5 | GPT-4.1 | 83.7 |
| 6 | o3-mini (high) | 80.7 |
| 7 | GPT-4o (Nov 2024) | 81.4 |
| 8 | GPT-4.1 Mini | 78.5 |
| 9 | GPT-4o Mini | 70.5 |
| 10 | GPT-4.1 Nano | 66.9 |
Performance by Language (Top Model: o3-high)
The table below reveals the language performance gap — even for the best model, accuracy ranges from 91.2% (Italian) to just 78.0% (Yoruba).
| Language | o3 (high) | o1 | GPT-4.1 | GPT-4.1 Nano |
|---|---|---|---|---|
| Italian | 91.2% | 89.7% | 86.9% | 73.4% |
| Spanish | 91.1% | 89.9% | 87.6% | 74.8% |
| Portuguese (Brazil) | 91.0% | 89.5% | 87.0% | 74.1% |
| French | 90.6% | 89.3% | 87.0% | 73.9% |
| German | 90.5% | 89.0% | 85.5% | 72.2% |
| Arabic | 90.4% | 89.0% | 84.4% | 65.9% |
| Hindi | 89.8% | 88.3% | 84.2% | 62.9% |
| Indonesian | 89.8% | 88.6% | 85.9% | 71.4% |
| Chinese (Simplified) | 89.3% | 88.9% | 86.1% | 71.0% |
| Korean | 89.3% | 88.2% | 84.9% | 67.9% |
| Japanese | 89.0% | 88.9% | 85.6% | 69.0% |
| Bengali | 87.8% | 87.3% | 82.7% | 58.3% |
| Swahili | 86.0% | 85.4% | 79.5% | 56.6% |
| Yoruba | 78.0% | 75.4% | 64.7% | 45.5% |
Key takeaways:
- Massive gap between high-resource and low-resource languages — o3 (high) scores 91.2% on Italian but only 78.0% on Yoruba, a 13+ point gap
- Yoruba is the hardest language for all models — GPT-4.1 Nano drops to 45.5%, barely above random chance (25%)
- Reasoning models (o-series) lead the rankings — o3 (high) at 88.8% average outperforms the best non-reasoning model GPT-4.5 Preview at 85.1%
- Smaller models suffer disproportionately on low-resource languages — GPT-4.1 Nano drops 28 points from Italian (73.4%) to Yoruba (45.5%)
- European languages cluster together — Italian, Spanish, Portuguese, French, and German all score within ~1% of each other
Where to Explore the Benchmark
Dataset and Evaluation
| Resource | Description | Link |
|---|---|---|
| MMMLU Dataset | Full 197K-question dataset across 14 languages on Hugging Face | huggingface.co/datasets/openai/MMMLU |
| Official Results | Benchmark results with scores for all models and languages | github.com/openai/simple-evals |
| Community Leaderboard | Interactive HuggingFace Space for exploring multilingual results | Multilingual MMLU Leaderboard |
Load the Dataset
from datasets import load_dataset
# Load all languages
dataset = load_dataset("openai/MMMLU", split="test")
# Load a specific language
dataset_fr = load_dataset("openai/MMMLU", "FR_FR", split="test")Understanding the Metric
Accuracy (Zero-Shot Chain-of-Thought)
Models are evaluated using zero-shot chain-of-thought prompting — no few-shot examples, no role-playing prompts. The model receives a multiple-choice question in the target language and must select the correct answer (A, B, C, or D).
| Approach | Description |
|---|---|
| Zero-shot | No examples provided — tests raw capability |
| Chain-of-thought | Model can reason step by step before answering |
| Per-language scoring | Accuracy computed separately for each of the 14 languages |
| Average score | Mean accuracy across all 14 languages |
Why Professional Human Translation Matters
graph LR
A["Machine Translation<br/>Errors in low-resource<br/>languages"] --> B["Unreliable<br/>benchmark<br/>scores"]
C["Professional Human<br/>Translation<br/>(MMMLU approach)"] --> D["Accurate, culturally<br/>appropriate<br/>evaluation"]
style A fill:#e74c3c,color:#fff,stroke:#333
style B fill:#f39c12,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#3498db,color:#fff,stroke:#333
Machine-translated benchmarks often contain errors that disproportionately affect low-resource languages, making it unclear whether poor performance reflects model weakness or translation quality. By using professional human translators, MMMLU isolates the variable being tested: the model’s actual multilingual capability.
Why MMMLU Matters
graph LR
A["English-only<br/>benchmarks"] --> C["MMMLU<br/>14 languages<br/>human-translated"]
B["Machine-translated<br/>benchmarks<br/>(unreliable)"] --> C
C --> D["True multilingual<br/>AI performance<br/>measurement"]
C --> E["Exposes gaps<br/>for underserved<br/>language communities"]
style A fill:#e74c3c,color:#fff,stroke:#333
style B fill:#e74c3c,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#3498db,color:#fff,stroke:#333
style E fill:#3498db,color:#fff,stroke:#333
- Exposes the multilingual gap — Even the best model drops 13+ points between its strongest and weakest language, revealing that “multilingual” models are far from language-equitable
- High-quality human translations — Professional translators ensure the benchmark tests the model, not the translation quality
- Low-resource language visibility — Yoruba, Swahili, and Bengali scores expose the real-world readiness (or lack thereof) of LLMs for billions of speakers
- 57-subject breadth — Tests knowledge and reasoning across the full academic spectrum, not just narrow domains
- Practical deployment signal — Organizations deploying AI globally need to know exactly how much performance they lose in each language
Video: MMMLU Explained
Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀
Conclusion
MMMLU sets the standard for multilingual AI evaluation:
- 14 languages, 57 subjects, ~197,000 questions — the most comprehensive professionally translated multilingual knowledge benchmark
- Professional human translators ensure accuracy, especially for low-resource languages where machine translation fails
- The best model (o3 high) averages 88.8% — but drops to just 78.0% on Yoruba, exposing a 13+ point multilingual gap
- Smaller models suffer disproportionately — GPT-4.1 Nano scores 73.4% on Italian but 45.5% on Yoruba, near random chance
- Low-resource languages need urgent attention — Yoruba and Swahili consistently lag 5–15% behind high-resource languages across all models
As AI goes global, MMMLU provides the essential reality check: how well does your model actually work for the world’s diverse language communities? For most languages, the answer is “significantly worse than English” — and for low-resource languages, the gap is alarming.
References
- OpenAI. “Multilingual Massive Multitask Language Understanding (MMMLU).” Hugging Face Dataset. huggingface.co/datasets/openai/MMMLU
- OpenAI. “Simple Evals — MMMLU Benchmark Results.” GitHub. github.com/openai/simple-evals
- Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. “Measuring Massive Multitask Language Understanding.” ICLR 2021. arXiv:2009.03300 (2020). arxiv.org/abs/2009.03300
Read More
- Explore the hardest AI benchmark ever built — see Humanity’s Last Exam (HLE)
- Test graduate-level science reasoning — see GPQA Diamond
- Measure abstract reasoning and fluid intelligence — see ARC-AGI-2
- Evaluate mathematical reasoning across competitions — see MathArena
- Assess competitive programming skills — see LiveCodeBench Pro
- Evaluate chart and figure understanding — see CharXiv Reasoning