Global PIQA

A participatory physical commonsense reasoning benchmark spanning 116 languages and 65 countries, revealing how LLMs struggle with everyday knowledge across the world’s cultures

Published

September 5, 2025

Keywords: Global PIQA, physical commonsense reasoning, multilingual benchmark, culturally-specific evaluation, LLM evaluation, low-resource languages, multilingual NLP, PIQA, MRL workshop, EleutherAI, UC San Diego

Introduction

When you ask someone how to keep a loaf of bread fresh, the answer depends on where they live. In Finland, the answer might involve a leivänjäähdytysrauta (bread cooling rack). In Nigeria, it could involve wrapping food in banana leaves. In Japan, it might involve nori storage techniques. Physical commonsense — the everyday knowledge of how objects and materials behave — is deeply tied to culture, language, and geography.

Yet nearly all LLM benchmarks are built in English, translated from English, or designed around Western cultural contexts. This means we have almost no way to measure whether LLMs understand everyday physical knowledge for the vast majority of the world’s languages and cultures.

Global PIQA fills this gap. It is a participatory benchmark built by hand by 335 researchers from 65 countries, covering 116 language varieties across five continents, 14 language families, and 23 writing systems. Unlike translated benchmarks, Global PIQA examples are written natively in each language, with nearly 60% referencing local foods, customs, traditions, or culturally-specific elements.

The results are striking: while the best model scores 91.7% overall, performance drops to as low as 60% for some languages — and Sub-Saharan African languages see a 15-point accuracy gap compared to Western European languages.

graph LR
    A["Existing benchmarks<br/>translated from English<br/>culturally biased"] --> B["No way to measure<br/>everyday knowledge<br/>across 100+ languages"]
    B --> C["Global PIQA<br/>116 languages, 65 countries<br/>culturally-specific examples"]
    C --> D["Reveals multilingual<br/>performance gaps<br/>in everyday reasoning"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is Global PIQA?

Global PIQA is a culturally-specific physical commonsense reasoning benchmark for over 100 languages. It follows the format of the original English PIQA (Bisk et al., 2020): each example consists of a prompt (goal or question) and two candidate solutions — one correct and one incorrect. The correct solution requires physical commonsense reasoning to identify.

What makes Global PIQA unique is that examples are not translated from English. Instead, native speakers from around the world created examples directly in their own languages, referencing local foods, customs, objects, traditions, and everyday activities that would not appear in a translated benchmark.

Key Characteristics

Feature	Details
Total examples	11,600 (official split: 100 per language × 116 languages)
Full unsampled dataset	27,000+ examples across all languages
Languages	116 language varieties (101 unique ISO 639-3 codes)
Coverage	5 continents, 14 language families, 23 writing systems
Contributors	335 researchers from 65 countries, 173 affiliations
Culturally-specific	59.9% of examples reference local foods, customs, traditions
Human-written	96.5% of examples created without LLM assistance
Validation	All examples verified by at least 1 native speaker; 72.9% by multiple
Format	Prompt + 2 candidate solutions (binary choice, 50% chance)
Evaluation	Accuracy (prompted format for instruction-tuned; completion for pretrained)

Example Questions

Global PIQA examples span extraordinary cultural breadth:

Finnish: How to properly heat a traditional sauna
Yoruba: How to prepare a specific local dish or preserve food using traditional methods
Hawaiian: Knowledge about native plants and their traditional uses
Korean: Techniques for specific food preparation involving kimchi or jjigae
Turkish: Everyday practices tied to local customs and household routines

graph TD
    GPIQA["Global PIQA<br/>11,600 examples"] --> NP["Non-Parallel Split<br/>(culturally-specific)"]
    NP --> L1["116 languages<br/>100 examples each"]
    NP --> CS["59.9% culturally-specific<br/>Local foods, customs, traditions"]
    NP --> HW["96.5% human-written<br/>No LLM generation"]
    
    GPIQA --> PS["Parallel Split<br/>(in development)"]
    PS --> L2["Translated from English<br/>For cross-lingual comparison"]

    style GPIQA fill:#e74c3c,color:#fff,stroke:#333
    style NP fill:#3498db,color:#fff,stroke:#333
    style PS fill:#8e44ad,color:#fff,stroke:#333
    style CS fill:#27ae60,color:#fff,stroke:#333
    style HW fill:#f39c12,color:#fff,stroke:#333

Who Built It?

Global PIQA was organized as a shared task at the 5th Multilingual Representation Learning (MRL) Workshop at EMNLP 2025, and built through a participatory approach — the researchers who contributed datasets are co-authors of the paper.

Co-Leads

Tyler A. Chang — UC San Diego
Catherine Arnett — EleutherAI

Scale of Participation

Metric	Count
Contributors	335 researchers
Countries	65
Affiliations	173 universities and companies
Dataset groups	132 independent research groups
Language varieties	116

Contributors ranged from undergraduate researchers to professors at major global universities. Each group was offered co-authorship, reflecting the intellectual significance of their contributions. The evaluation infrastructure was supported by EleutherAI (Baber Abbasi and Stella Biderman).

Timeline

Date	Milestone
2025	MRL Workshop shared task opens for dataset contributions
September 15, 2025	Data submission deadline
October 28, 2025	Global PIQA v0.1 preprint released (arXiv:2510.24081)
November 9, 2025	MRL Workshop at EMNLP 2025
Ongoing	Accepting new language contributions for v1

Resource	Link
arXiv Paper	arxiv.org/abs/2510.24081
Project Website	mrlbenchmarks.github.io
GitHub	github.com/mrlbenchmarks/global-piqa
Hugging Face Dataset	huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel

What Skills Does It Test?

Global PIQA tests physical commonsense reasoning — the broad ability to understand how objects, materials, and physical processes work in everyday life. This is a uniquely human capability that varies significantly across cultures.

graph TD
    GPIQA["Global PIQA<br/>Skills Tested"] --> PP["Physical Properties<br/>of objects & materials"]
    GPIQA --> AF["Affordances<br/>What actions can be<br/>performed with objects"]
    GPIQA --> TR["Temporal & Physical<br/>Relations"]
    GPIQA --> CK["Cultural Knowledge<br/>Local customs, foods,<br/>traditions"]
    GPIQA --> ML["Multilingual<br/>Understanding<br/>23 writing systems"]

    style GPIQA fill:#e74c3c,color:#fff,stroke:#333
    style PP fill:#3498db,color:#fff,stroke:#333
    style AF fill:#27ae60,color:#fff,stroke:#333
    style TR fill:#f39c12,color:#fff,stroke:#333
    style CK fill:#8e44ad,color:#fff,stroke:#333
    style ML fill:#e67e22,color:#fff,stroke:#333

Capability	What It Tests
Physical properties	Knowledge of how materials, substances, and objects behave
Affordances	Understanding what actions can be performed with specific objects
Temporal relations	Sequencing of physical processes (cooking, building, cleaning)
Cultural commonsense	Local foods, customs, tools, clothing, and traditions
Multilingual reasoning	Ability to reason in 116 language varieties and 23 writing systems
Low-resource language understanding	Performance on languages with minimal training data

What Makes It Different from Other Benchmarks?

Dimension	Traditional Multilingual Benchmarks	Global PIQA
Source	Translated from English	Written natively in each language
Content	Generic/universal	60% culturally-specific
Knowledge type	Expert/academic	Everyday commonsense
Cultural bias	Anglocentric	Community-authored
Creation	Machine-translated or crowdsourced	Hand-crafted by NLP researchers

Current Leaderboard

Results are from the Global PIQA paper (v0.1), evaluated on the official non-parallel split (100 examples per language). All models below are instruction-tuned and evaluated using the prompted format. Accuracy is averaged across all 116 languages (chance = 50%).

Source: Global PIQA paper, arXiv:2510.24081 (October 28, 2025). Results tables in Section 5.3 and Appendix F.

Closed (Proprietary) Models

Rank	Model	Avg. Accuracy	W. Europe	E. Europe	Sub-Saharan Africa
1	Gemini 2.5 Pro	91.7%	95.6%	95.2%	80.2%
2	Gemini 2.5 Flash	89.8%	94.1%	93.7%	76.3%
3	Claude Sonnet 4.5	89.5%	94.6%	93.7%	74.7%
4	GPT-5	88.3%	94.7%	93.9%	70.4%
5	GPT-5 mini	87.4%	93.6%	92.8%	72.4%
6	Gemini 2.5 Flash-Lite	86.4%	91.9%	90.4%	69.5%
7	GPT-5 nano	75.4%	82.6%	81.1%	52.7%

Open-Weight Models (Top per Weight Class)

Rank	Model	Avg. Accuracy	W. Europe	E. Europe	Sub-Saharan Africa
1	Gemma 3 (27B)	82.4%	86.1%	86.5%	67.2%
2	Qwen 2.5 (72B)	80.6%	88.7%	84.6%	61.5%
3	Gemma 3 (12B)	79.5%	83.6%	82.6%	65.5%
4	Llama 3.1 (70B)	79.2%	83.7%	82.1%	66.2%
5	GPT-oss (20B)	79.1%	84.6%	81.0%	65.9%
6	Qwen 3 (14B)	78.5%	84.0%	83.2%	57.6%
7	Qwen 3 (8B)	75.1%	80.6%	79.1%	56.3%

Lowest-Performing Languages

For seven languages, no LLM achieves above 80% accuracy — despite human accuracy estimated at ~95%:

Language	Best Model	Best Accuracy
Ekpeye (ekp_latn)	GPT-5 Mini	60%
Manipuri (mni_mtei)	Gemma 3 27B	63%
Urhobo (urh_latn)	Gemma 3 12B	64%
Burushaski (bsk_arab)	Phi-4	66%
Lingala (lin_latn)	Gemini 2.5 Pro	68%
Idoma (idu_latn)	Gemma SEA-LION 9B	71%
Chakavian (ckm_latn)	Gemini 2.5 Pro	74%

Key Observations

graph LR
    A["Western Europe<br/>Best: 95.6%"] --> C["15.4% accuracy gap<br/>between regions"]
    B["Sub-Saharan Africa<br/>Best: 80.2%"] --> C
    D["Open vs Closed<br/>9.3% gap<br/>(82.4% vs 91.7%)"] --> E["Everyday knowledge<br/>remains unsolved<br/>for many languages"]
    C --> E

    style A fill:#27ae60,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#f39c12,color:#fff,stroke:#333
    style E fill:#3498db,color:#fff,stroke:#333

Massive regional disparities — The best model scores 95.6% on Western European languages but only 80.2% on Sub-Saharan African languages (a 15.4-point gap)
Open vs proprietary gap — The best open-weight model (Gemma 3 27B, 82.4%) trails the best proprietary model (Gemini 2.5 Pro, 91.7%) by 9.3 points
7 languages below 80% — For Ekpeye, Manipuri, Urhobo, Burushaski, Lingala, Idoma, and Chakavian, even the best LLMs struggle with basic everyday knowledge
146 models evaluated — Including 7 closed and 139 open-weight models across many size classes
Human accuracy ~95% — Ad hoc evaluations show humans achieve ~95.1%, far above the best models for most languages

Where to Explore the Benchmark

Project Resources

Resource	Description	Link
Project Website	Official Global PIQA website with overview and contribution form	mrlbenchmarks.github.io
arXiv Paper	Full technical paper with methodology, results, and appendices	arxiv.org/abs/2510.24081
GitHub	Code, evaluation scripts, and contribution guidelines	github.com/mrlbenchmarks/global-piqa
Hugging Face Dataset	Official non-parallel split (11,600 examples)	huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel

Load the Dataset

from datasets import load_dataset

# Load the official non-parallel split (100 examples × 116 languages)
dataset = load_dataset("mrlbenchmarks/global-piqa-nonparallel")

Contribute a New Language

Global PIQA is actively accepting contributions for languages not yet represented, especially low-resource languages and non-prestige varieties. Register your interest through the contribution form.

Evaluation with lm-evaluation-harness

Results in the paper were generated using the Language Model Evaluation Harness (lm-eval) for open-weight models, and API calls for proprietary models.

Why Global PIQA Matters

graph LR
    A["Benchmarks translated<br/>from English"] --> B["Anglocentric bias<br/>in evaluation"]
    B --> C["Global PIQA<br/>fills the gap"]
    C --> D["Equitable AI<br/>for 100+ languages"]

    A2["Expert knowledge<br/>benchmarks only"] --> B2["Everyday reasoning<br/>not measured"]
    B2 --> C
    C --> D2["Close the gap<br/>between open & closed"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style A2 fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style D2 fill:#3498db,color:#fff,stroke:#333

Culturally native — Examples written directly in each language by native speakers, not translated from English
Reveals hidden gaps — Everyday knowledge that LLMs fail at in low-resource languages, invisible to English-only benchmarks
Massive scale — 116 languages, 65 countries, 335 contributors — the largest culturally-specific commonsense benchmark
Participatory design — Researchers from each language community decide what to test, ensuring cultural authenticity
Open and extensible — Dataset, code, and contribution pipeline are all publicly available
Bridges research and communities — Contributors are co-authors, giving ownership back to language communities

Video: Global PIQA Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

Global PIQA is the first benchmark to systematically evaluate physical commonsense reasoning across 100+ languages and cultures:

116 language varieties covering 5 continents, 14 language families, and 23 writing systems
Built by 335 researchers from 65 countries — a truly global, participatory effort
59.9% culturally-specific examples that cannot be created by translation
The best model (Gemini 2.5 Pro) scores 91.7% overall, but drops to 80.2% for Sub-Saharan African languages and as low as 60% for individual languages
Open-weight models trail proprietary models by 9.3 points, highlighting the need for continued investment in multilingual capabilities
Human accuracy is ~95% — the gap with AI is still significant for many languages

Global PIQA demonstrates that everyday knowledge — knowing how to cook, clean, build, and navigate daily life — remains a challenge for LLMs in much of the world. As the authors note:

“In many languages and cultures, everyday knowledge remains an area for improvement, alongside more widely-discussed capabilities such as complex reasoning and expert knowledge.”

References

Chang, T. A., Arnett, C. et al. “Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures.” arXiv preprint arXiv:2510.24081 (2025). arxiv.org/abs/2510.24081
Bisk, Y., Zellers, R., Le Bras, R., Gao, J., and Choi, Y. “PIQA: Reasoning about Physical Commonsense in Natural Language.” Proceedings of AAAI 34(05):7432–7439 (2020). arxiv.org/abs/1911.11641
MRL Benchmarks. “Global PIQA — Project Website.” mrlbenchmarks.github.io
MRL Benchmarks. “Global PIQA Dataset.” Hugging Face. huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel
MRL Benchmarks. “Global PIQA — GitHub Repository.” github.com/mrlbenchmarks/global-piqa

Compare with the hardest academic benchmark — see Humanity’s Last Exam (HLE)
Compare with the AGI fluid intelligence benchmark — see ARC-AGI-2
Compare with the chart understanding benchmark — see CharXiv Reasoning
Compare with factuality evaluation — see FACTS Benchmark Suite
Deploy models for running your own evaluations — see Deploying and Serving LLM with vLLM
Scale your evaluation infrastructure — see Scaling LLM Serving for Enterprise Production
Global PIQA Project Website
Global PIQA on Hugging Face
Global PIQA on GitHub