graph LR
A["Existing benchmarks<br/>translated from English<br/>culturally biased"] --> B["No way to measure<br/>everyday knowledge<br/>across 100+ languages"]
B --> C["Global PIQA<br/>116 languages, 65 countries<br/>culturally-specific examples"]
C --> D["Reveals multilingual<br/>performance gaps<br/>in everyday reasoning"]
style A fill:#e74c3c,stroke:#333,color:#fff
style B fill:#f39c12,stroke:#333,color:#fff
style C fill:#27ae60,stroke:#333,color:#fff
style D fill:#3498db,stroke:#333,color:#fff
Global PIQA
A participatory physical commonsense reasoning benchmark spanning 116 languages and 65 countries, revealing how LLMs struggle with everyday knowledge across the world’s cultures
Keywords: Global PIQA, physical commonsense reasoning, multilingual benchmark, culturally-specific evaluation, LLM evaluation, low-resource languages, multilingual NLP, PIQA, MRL workshop, EleutherAI, UC San Diego

Introduction
When you ask someone how to keep a loaf of bread fresh, the answer depends on where they live. In Finland, the answer might involve a leivänjäähdytysrauta (bread cooling rack). In Nigeria, it could involve wrapping food in banana leaves. In Japan, it might involve nori storage techniques. Physical commonsense — the everyday knowledge of how objects and materials behave — is deeply tied to culture, language, and geography.
Yet nearly all LLM benchmarks are built in English, translated from English, or designed around Western cultural contexts. This means we have almost no way to measure whether LLMs understand everyday physical knowledge for the vast majority of the world’s languages and cultures.
Global PIQA fills this gap. It is a participatory benchmark built by hand by 335 researchers from 65 countries, covering 116 language varieties across five continents, 14 language families, and 23 writing systems. Unlike translated benchmarks, Global PIQA examples are written natively in each language, with nearly 60% referencing local foods, customs, traditions, or culturally-specific elements.
The results are striking: while the best model scores 91.7% overall, performance drops to as low as 60% for some languages — and Sub-Saharan African languages see a 15-point accuracy gap compared to Western European languages.
What Is Global PIQA?
Global PIQA is a culturally-specific physical commonsense reasoning benchmark for over 100 languages. It follows the format of the original English PIQA (Bisk et al., 2020): each example consists of a prompt (goal or question) and two candidate solutions — one correct and one incorrect. The correct solution requires physical commonsense reasoning to identify.
What makes Global PIQA unique is that examples are not translated from English. Instead, native speakers from around the world created examples directly in their own languages, referencing local foods, customs, objects, traditions, and everyday activities that would not appear in a translated benchmark.
Key Characteristics
| Feature | Details |
|---|---|
| Total examples | 11,600 (official split: 100 per language × 116 languages) |
| Full unsampled dataset | 27,000+ examples across all languages |
| Languages | 116 language varieties (101 unique ISO 639-3 codes) |
| Coverage | 5 continents, 14 language families, 23 writing systems |
| Contributors | 335 researchers from 65 countries, 173 affiliations |
| Culturally-specific | 59.9% of examples reference local foods, customs, traditions |
| Human-written | 96.5% of examples created without LLM assistance |
| Validation | All examples verified by at least 1 native speaker; 72.9% by multiple |
| Format | Prompt + 2 candidate solutions (binary choice, 50% chance) |
| Evaluation | Accuracy (prompted format for instruction-tuned; completion for pretrained) |
Example Questions
Global PIQA examples span extraordinary cultural breadth:
- Finnish: How to properly heat a traditional sauna
- Yoruba: How to prepare a specific local dish or preserve food using traditional methods
- Hawaiian: Knowledge about native plants and their traditional uses
- Korean: Techniques for specific food preparation involving kimchi or jjigae
- Turkish: Everyday practices tied to local customs and household routines
graph TD
GPIQA["Global PIQA<br/>11,600 examples"] --> NP["Non-Parallel Split<br/>(culturally-specific)"]
NP --> L1["116 languages<br/>100 examples each"]
NP --> CS["59.9% culturally-specific<br/>Local foods, customs, traditions"]
NP --> HW["96.5% human-written<br/>No LLM generation"]
GPIQA --> PS["Parallel Split<br/>(in development)"]
PS --> L2["Translated from English<br/>For cross-lingual comparison"]
style GPIQA fill:#e74c3c,color:#fff,stroke:#333
style NP fill:#3498db,color:#fff,stroke:#333
style PS fill:#8e44ad,color:#fff,stroke:#333
style CS fill:#27ae60,color:#fff,stroke:#333
style HW fill:#f39c12,color:#fff,stroke:#333
Who Built It?
Global PIQA was organized as a shared task at the 5th Multilingual Representation Learning (MRL) Workshop at EMNLP 2025, and built through a participatory approach — the researchers who contributed datasets are co-authors of the paper.
Co-Leads
- Tyler A. Chang — UC San Diego
- Catherine Arnett — EleutherAI
Scale of Participation
| Metric | Count |
|---|---|
| Contributors | 335 researchers |
| Countries | 65 |
| Affiliations | 173 universities and companies |
| Dataset groups | 132 independent research groups |
| Language varieties | 116 |
Contributors ranged from undergraduate researchers to professors at major global universities. Each group was offered co-authorship, reflecting the intellectual significance of their contributions. The evaluation infrastructure was supported by EleutherAI (Baber Abbasi and Stella Biderman).
Timeline
| Date | Milestone |
|---|---|
| 2025 | MRL Workshop shared task opens for dataset contributions |
| September 15, 2025 | Data submission deadline |
| October 28, 2025 | Global PIQA v0.1 preprint released (arXiv:2510.24081) |
| November 9, 2025 | MRL Workshop at EMNLP 2025 |
| Ongoing | Accepting new language contributions for v1 |
| Resource | Link |
|---|---|
| arXiv Paper | arxiv.org/abs/2510.24081 |
| Project Website | mrlbenchmarks.github.io |
| GitHub | github.com/mrlbenchmarks/global-piqa |
| Hugging Face Dataset | huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel |
What Skills Does It Test?
Global PIQA tests physical commonsense reasoning — the broad ability to understand how objects, materials, and physical processes work in everyday life. This is a uniquely human capability that varies significantly across cultures.
graph TD
GPIQA["Global PIQA<br/>Skills Tested"] --> PP["Physical Properties<br/>of objects & materials"]
GPIQA --> AF["Affordances<br/>What actions can be<br/>performed with objects"]
GPIQA --> TR["Temporal & Physical<br/>Relations"]
GPIQA --> CK["Cultural Knowledge<br/>Local customs, foods,<br/>traditions"]
GPIQA --> ML["Multilingual<br/>Understanding<br/>23 writing systems"]
style GPIQA fill:#e74c3c,color:#fff,stroke:#333
style PP fill:#3498db,color:#fff,stroke:#333
style AF fill:#27ae60,color:#fff,stroke:#333
style TR fill:#f39c12,color:#fff,stroke:#333
style CK fill:#8e44ad,color:#fff,stroke:#333
style ML fill:#e67e22,color:#fff,stroke:#333
| Capability | What It Tests |
|---|---|
| Physical properties | Knowledge of how materials, substances, and objects behave |
| Affordances | Understanding what actions can be performed with specific objects |
| Temporal relations | Sequencing of physical processes (cooking, building, cleaning) |
| Cultural commonsense | Local foods, customs, tools, clothing, and traditions |
| Multilingual reasoning | Ability to reason in 116 language varieties and 23 writing systems |
| Low-resource language understanding | Performance on languages with minimal training data |
What Makes It Different from Other Benchmarks?
| Dimension | Traditional Multilingual Benchmarks | Global PIQA |
|---|---|---|
| Source | Translated from English | Written natively in each language |
| Content | Generic/universal | 60% culturally-specific |
| Knowledge type | Expert/academic | Everyday commonsense |
| Cultural bias | Anglocentric | Community-authored |
| Creation | Machine-translated or crowdsourced | Hand-crafted by NLP researchers |
Current Leaderboard
Results are from the Global PIQA paper (v0.1), evaluated on the official non-parallel split (100 examples per language). All models below are instruction-tuned and evaluated using the prompted format. Accuracy is averaged across all 116 languages (chance = 50%).
Source: Global PIQA paper, arXiv:2510.24081 (October 28, 2025). Results tables in Section 5.3 and Appendix F.
Closed (Proprietary) Models
| Rank | Model | Avg. Accuracy | W. Europe | E. Europe | Sub-Saharan Africa |
|---|---|---|---|---|---|
| 1 | Gemini 2.5 Pro | 91.7% | 95.6% | 95.2% | 80.2% |
| 2 | Gemini 2.5 Flash | 89.8% | 94.1% | 93.7% | 76.3% |
| 3 | Claude Sonnet 4.5 | 89.5% | 94.6% | 93.7% | 74.7% |
| 4 | GPT-5 | 88.3% | 94.7% | 93.9% | 70.4% |
| 5 | GPT-5 mini | 87.4% | 93.6% | 92.8% | 72.4% |
| 6 | Gemini 2.5 Flash-Lite | 86.4% | 91.9% | 90.4% | 69.5% |
| 7 | GPT-5 nano | 75.4% | 82.6% | 81.1% | 52.7% |
Open-Weight Models (Top per Weight Class)
| Rank | Model | Avg. Accuracy | W. Europe | E. Europe | Sub-Saharan Africa |
|---|---|---|---|---|---|
| 1 | Gemma 3 (27B) | 82.4% | 86.1% | 86.5% | 67.2% |
| 2 | Qwen 2.5 (72B) | 80.6% | 88.7% | 84.6% | 61.5% |
| 3 | Gemma 3 (12B) | 79.5% | 83.6% | 82.6% | 65.5% |
| 4 | Llama 3.1 (70B) | 79.2% | 83.7% | 82.1% | 66.2% |
| 5 | GPT-oss (20B) | 79.1% | 84.6% | 81.0% | 65.9% |
| 6 | Qwen 3 (14B) | 78.5% | 84.0% | 83.2% | 57.6% |
| 7 | Qwen 3 (8B) | 75.1% | 80.6% | 79.1% | 56.3% |
Lowest-Performing Languages
For seven languages, no LLM achieves above 80% accuracy — despite human accuracy estimated at ~95%:
| Language | Best Model | Best Accuracy |
|---|---|---|
| Ekpeye (ekp_latn) | GPT-5 Mini | 60% |
| Manipuri (mni_mtei) | Gemma 3 27B | 63% |
| Urhobo (urh_latn) | Gemma 3 12B | 64% |
| Burushaski (bsk_arab) | Phi-4 | 66% |
| Lingala (lin_latn) | Gemini 2.5 Pro | 68% |
| Idoma (idu_latn) | Gemma SEA-LION 9B | 71% |
| Chakavian (ckm_latn) | Gemini 2.5 Pro | 74% |
Key Observations
graph LR
A["Western Europe<br/>Best: 95.6%"] --> C["15.4% accuracy gap<br/>between regions"]
B["Sub-Saharan Africa<br/>Best: 80.2%"] --> C
D["Open vs Closed<br/>9.3% gap<br/>(82.4% vs 91.7%)"] --> E["Everyday knowledge<br/>remains unsolved<br/>for many languages"]
C --> E
style A fill:#27ae60,color:#fff,stroke:#333
style B fill:#e74c3c,color:#fff,stroke:#333
style D fill:#f39c12,color:#fff,stroke:#333
style E fill:#3498db,color:#fff,stroke:#333
- Massive regional disparities — The best model scores 95.6% on Western European languages but only 80.2% on Sub-Saharan African languages (a 15.4-point gap)
- Open vs proprietary gap — The best open-weight model (Gemma 3 27B, 82.4%) trails the best proprietary model (Gemini 2.5 Pro, 91.7%) by 9.3 points
- 7 languages below 80% — For Ekpeye, Manipuri, Urhobo, Burushaski, Lingala, Idoma, and Chakavian, even the best LLMs struggle with basic everyday knowledge
- 146 models evaluated — Including 7 closed and 139 open-weight models across many size classes
- Human accuracy ~95% — Ad hoc evaluations show humans achieve ~95.1%, far above the best models for most languages
Where to Explore the Benchmark
Project Resources
| Resource | Description | Link |
|---|---|---|
| Project Website | Official Global PIQA website with overview and contribution form | mrlbenchmarks.github.io |
| arXiv Paper | Full technical paper with methodology, results, and appendices | arxiv.org/abs/2510.24081 |
| GitHub | Code, evaluation scripts, and contribution guidelines | github.com/mrlbenchmarks/global-piqa |
| Hugging Face Dataset | Official non-parallel split (11,600 examples) | huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel |
Load the Dataset
from datasets import load_dataset
# Load the official non-parallel split (100 examples × 116 languages)
dataset = load_dataset("mrlbenchmarks/global-piqa-nonparallel")Contribute a New Language
Global PIQA is actively accepting contributions for languages not yet represented, especially low-resource languages and non-prestige varieties. Register your interest through the contribution form.
Evaluation with lm-evaluation-harness
Results in the paper were generated using the Language Model Evaluation Harness (lm-eval) for open-weight models, and API calls for proprietary models.
Why Global PIQA Matters
graph LR
A["Benchmarks translated<br/>from English"] --> B["Anglocentric bias<br/>in evaluation"]
B --> C["Global PIQA<br/>fills the gap"]
C --> D["Equitable AI<br/>for 100+ languages"]
A2["Expert knowledge<br/>benchmarks only"] --> B2["Everyday reasoning<br/>not measured"]
B2 --> C
C --> D2["Close the gap<br/>between open & closed"]
style A fill:#e74c3c,color:#fff,stroke:#333
style A2 fill:#e74c3c,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#3498db,color:#fff,stroke:#333
style D2 fill:#3498db,color:#fff,stroke:#333
- Culturally native — Examples written directly in each language by native speakers, not translated from English
- Reveals hidden gaps — Everyday knowledge that LLMs fail at in low-resource languages, invisible to English-only benchmarks
- Massive scale — 116 languages, 65 countries, 335 contributors — the largest culturally-specific commonsense benchmark
- Participatory design — Researchers from each language community decide what to test, ensuring cultural authenticity
- Open and extensible — Dataset, code, and contribution pipeline are all publicly available
- Bridges research and communities — Contributors are co-authors, giving ownership back to language communities
Video: Global PIQA Explained
Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀
Conclusion
Global PIQA is the first benchmark to systematically evaluate physical commonsense reasoning across 100+ languages and cultures:
- 116 language varieties covering 5 continents, 14 language families, and 23 writing systems
- Built by 335 researchers from 65 countries — a truly global, participatory effort
- 59.9% culturally-specific examples that cannot be created by translation
- The best model (Gemini 2.5 Pro) scores 91.7% overall, but drops to 80.2% for Sub-Saharan African languages and as low as 60% for individual languages
- Open-weight models trail proprietary models by 9.3 points, highlighting the need for continued investment in multilingual capabilities
- Human accuracy is ~95% — the gap with AI is still significant for many languages
Global PIQA demonstrates that everyday knowledge — knowing how to cook, clean, build, and navigate daily life — remains a challenge for LLMs in much of the world. As the authors note:
“In many languages and cultures, everyday knowledge remains an area for improvement, alongside more widely-discussed capabilities such as complex reasoning and expert knowledge.”
References
- Chang, T. A., Arnett, C. et al. “Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures.” arXiv preprint arXiv:2510.24081 (2025). arxiv.org/abs/2510.24081
- Bisk, Y., Zellers, R., Le Bras, R., Gao, J., and Choi, Y. “PIQA: Reasoning about Physical Commonsense in Natural Language.” Proceedings of AAAI 34(05):7432–7439 (2020). arxiv.org/abs/1911.11641
- MRL Benchmarks. “Global PIQA — Project Website.” mrlbenchmarks.github.io
- MRL Benchmarks. “Global PIQA Dataset.” Hugging Face. huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel
- MRL Benchmarks. “Global PIQA — GitHub Repository.” github.com/mrlbenchmarks/global-piqa
Read More
- Compare with the hardest academic benchmark — see Humanity’s Last Exam (HLE)
- Compare with the AGI fluid intelligence benchmark — see ARC-AGI-2
- Compare with the chart understanding benchmark — see CharXiv Reasoning
- Compare with factuality evaluation — see FACTS Benchmark Suite
- Deploy models for running your own evaluations — see Deploying and Serving LLM with vLLM
- Scale your evaluation infrastructure — see Scaling LLM Serving for Enterprise Production
- Global PIQA Project Website
- Global PIQA on Hugging Face
- Global PIQA on GitHub