Global PIQA

A participatory physical commonsense reasoning benchmark spanning 116 languages and 65 countries, revealing how LLMs struggle with everyday knowledge across the world’s cultures

Published

September 5, 2025

Keywords: Global PIQA, physical commonsense reasoning, multilingual benchmark, culturally-specific evaluation, LLM evaluation, low-resource languages, multilingual NLP, PIQA, MRL workshop, EleutherAI, UC San Diego

Introduction

When you ask someone how to keep a loaf of bread fresh, the answer depends on where they live. In Finland, the answer might involve a leivänjäähdytysrauta (bread cooling rack). In Nigeria, it could involve wrapping food in banana leaves. In Japan, it might involve nori storage techniques. Physical commonsense — the everyday knowledge of how objects and materials behave — is deeply tied to culture, language, and geography.

Yet nearly all LLM benchmarks are built in English, translated from English, or designed around Western cultural contexts. This means we have almost no way to measure whether LLMs understand everyday physical knowledge for the vast majority of the world’s languages and cultures.

Global PIQA fills this gap. It is a participatory benchmark built by hand by 335 researchers from 65 countries, covering 116 language varieties across five continents, 14 language families, and 23 writing systems. Unlike translated benchmarks, Global PIQA examples are written natively in each language, with nearly 60% referencing local foods, customs, traditions, or culturally-specific elements.

The results are striking: while the best model scores 91.7% overall, performance drops to as low as 60% for some languages — and Sub-Saharan African languages see a 15-point accuracy gap compared to Western European languages.

graph LR
    A["Existing benchmarks<br/>translated from English<br/>culturally biased"] --> B["No way to measure<br/>everyday knowledge<br/>across 100+ languages"]
    B --> C["Global PIQA<br/>116 languages, 65 countries<br/>culturally-specific examples"]
    C --> D["Reveals multilingual<br/>performance gaps<br/>in everyday reasoning"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is Global PIQA?

Global PIQA is a culturally-specific physical commonsense reasoning benchmark for over 100 languages. It follows the format of the original English PIQA (Bisk et al., 2020): each example consists of a prompt (goal or question) and two candidate solutions — one correct and one incorrect. The correct solution requires physical commonsense reasoning to identify.

What makes Global PIQA unique is that examples are not translated from English. Instead, native speakers from around the world created examples directly in their own languages, referencing local foods, customs, objects, traditions, and everyday activities that would not appear in a translated benchmark.

Key Characteristics

Feature Details
Total examples 11,600 (official split: 100 per language × 116 languages)
Full unsampled dataset 27,000+ examples across all languages
Languages 116 language varieties (101 unique ISO 639-3 codes)
Coverage 5 continents, 14 language families, 23 writing systems
Contributors 335 researchers from 65 countries, 173 affiliations
Culturally-specific 59.9% of examples reference local foods, customs, traditions
Human-written 96.5% of examples created without LLM assistance
Validation All examples verified by at least 1 native speaker; 72.9% by multiple
Format Prompt + 2 candidate solutions (binary choice, 50% chance)
Evaluation Accuracy (prompted format for instruction-tuned; completion for pretrained)

Example Questions

Global PIQA examples span extraordinary cultural breadth:

  • Finnish: How to properly heat a traditional sauna
  • Yoruba: How to prepare a specific local dish or preserve food using traditional methods
  • Hawaiian: Knowledge about native plants and their traditional uses
  • Korean: Techniques for specific food preparation involving kimchi or jjigae
  • Turkish: Everyday practices tied to local customs and household routines

graph TD
    GPIQA["Global PIQA<br/>11,600 examples"] --> NP["Non-Parallel Split<br/>(culturally-specific)"]
    NP --> L1["116 languages<br/>100 examples each"]
    NP --> CS["59.9% culturally-specific<br/>Local foods, customs, traditions"]
    NP --> HW["96.5% human-written<br/>No LLM generation"]
    
    GPIQA --> PS["Parallel Split<br/>(in development)"]
    PS --> L2["Translated from English<br/>For cross-lingual comparison"]

    style GPIQA fill:#e74c3c,color:#fff,stroke:#333
    style NP fill:#3498db,color:#fff,stroke:#333
    style PS fill:#8e44ad,color:#fff,stroke:#333
    style CS fill:#27ae60,color:#fff,stroke:#333
    style HW fill:#f39c12,color:#fff,stroke:#333

Who Built It?

Global PIQA was organized as a shared task at the 5th Multilingual Representation Learning (MRL) Workshop at EMNLP 2025, and built through a participatory approach — the researchers who contributed datasets are co-authors of the paper.

Co-Leads

  • Tyler A. Chang — UC San Diego
  • Catherine Arnett — EleutherAI

Scale of Participation

Metric Count
Contributors 335 researchers
Countries 65
Affiliations 173 universities and companies
Dataset groups 132 independent research groups
Language varieties 116

Contributors ranged from undergraduate researchers to professors at major global universities. Each group was offered co-authorship, reflecting the intellectual significance of their contributions. The evaluation infrastructure was supported by EleutherAI (Baber Abbasi and Stella Biderman).

Timeline

Date Milestone
2025 MRL Workshop shared task opens for dataset contributions
September 15, 2025 Data submission deadline
October 28, 2025 Global PIQA v0.1 preprint released (arXiv:2510.24081)
November 9, 2025 MRL Workshop at EMNLP 2025
Ongoing Accepting new language contributions for v1
Resource Link
arXiv Paper arxiv.org/abs/2510.24081
Project Website mrlbenchmarks.github.io
GitHub github.com/mrlbenchmarks/global-piqa
Hugging Face Dataset huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel

What Skills Does It Test?

Global PIQA tests physical commonsense reasoning — the broad ability to understand how objects, materials, and physical processes work in everyday life. This is a uniquely human capability that varies significantly across cultures.

graph TD
    GPIQA["Global PIQA<br/>Skills Tested"] --> PP["Physical Properties<br/>of objects & materials"]
    GPIQA --> AF["Affordances<br/>What actions can be<br/>performed with objects"]
    GPIQA --> TR["Temporal & Physical<br/>Relations"]
    GPIQA --> CK["Cultural Knowledge<br/>Local customs, foods,<br/>traditions"]
    GPIQA --> ML["Multilingual<br/>Understanding<br/>23 writing systems"]

    style GPIQA fill:#e74c3c,color:#fff,stroke:#333
    style PP fill:#3498db,color:#fff,stroke:#333
    style AF fill:#27ae60,color:#fff,stroke:#333
    style TR fill:#f39c12,color:#fff,stroke:#333
    style CK fill:#8e44ad,color:#fff,stroke:#333
    style ML fill:#e67e22,color:#fff,stroke:#333

Capability What It Tests
Physical properties Knowledge of how materials, substances, and objects behave
Affordances Understanding what actions can be performed with specific objects
Temporal relations Sequencing of physical processes (cooking, building, cleaning)
Cultural commonsense Local foods, customs, tools, clothing, and traditions
Multilingual reasoning Ability to reason in 116 language varieties and 23 writing systems
Low-resource language understanding Performance on languages with minimal training data

What Makes It Different from Other Benchmarks?

Dimension Traditional Multilingual Benchmarks Global PIQA
Source Translated from English Written natively in each language
Content Generic/universal 60% culturally-specific
Knowledge type Expert/academic Everyday commonsense
Cultural bias Anglocentric Community-authored
Creation Machine-translated or crowdsourced Hand-crafted by NLP researchers

Current Leaderboard

Results are from the Global PIQA paper (v0.1), evaluated on the official non-parallel split (100 examples per language). All models below are instruction-tuned and evaluated using the prompted format. Accuracy is averaged across all 116 languages (chance = 50%).

Source: Global PIQA paper, arXiv:2510.24081 (October 28, 2025). Results tables in Section 5.3 and Appendix F.

Closed (Proprietary) Models

Rank Model Avg. Accuracy W. Europe E. Europe Sub-Saharan Africa
1 Gemini 2.5 Pro 91.7% 95.6% 95.2% 80.2%
2 Gemini 2.5 Flash 89.8% 94.1% 93.7% 76.3%
3 Claude Sonnet 4.5 89.5% 94.6% 93.7% 74.7%
4 GPT-5 88.3% 94.7% 93.9% 70.4%
5 GPT-5 mini 87.4% 93.6% 92.8% 72.4%
6 Gemini 2.5 Flash-Lite 86.4% 91.9% 90.4% 69.5%
7 GPT-5 nano 75.4% 82.6% 81.1% 52.7%

Open-Weight Models (Top per Weight Class)

Rank Model Avg. Accuracy W. Europe E. Europe Sub-Saharan Africa
1 Gemma 3 (27B) 82.4% 86.1% 86.5% 67.2%
2 Qwen 2.5 (72B) 80.6% 88.7% 84.6% 61.5%
3 Gemma 3 (12B) 79.5% 83.6% 82.6% 65.5%
4 Llama 3.1 (70B) 79.2% 83.7% 82.1% 66.2%
5 GPT-oss (20B) 79.1% 84.6% 81.0% 65.9%
6 Qwen 3 (14B) 78.5% 84.0% 83.2% 57.6%
7 Qwen 3 (8B) 75.1% 80.6% 79.1% 56.3%

Lowest-Performing Languages

For seven languages, no LLM achieves above 80% accuracy — despite human accuracy estimated at ~95%:

Language Best Model Best Accuracy
Ekpeye (ekp_latn) GPT-5 Mini 60%
Manipuri (mni_mtei) Gemma 3 27B 63%
Urhobo (urh_latn) Gemma 3 12B 64%
Burushaski (bsk_arab) Phi-4 66%
Lingala (lin_latn) Gemini 2.5 Pro 68%
Idoma (idu_latn) Gemma SEA-LION 9B 71%
Chakavian (ckm_latn) Gemini 2.5 Pro 74%

Key Observations

graph LR
    A["Western Europe<br/>Best: 95.6%"] --> C["15.4% accuracy gap<br/>between regions"]
    B["Sub-Saharan Africa<br/>Best: 80.2%"] --> C
    D["Open vs Closed<br/>9.3% gap<br/>(82.4% vs 91.7%)"] --> E["Everyday knowledge<br/>remains unsolved<br/>for many languages"]
    C --> E

    style A fill:#27ae60,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#f39c12,color:#fff,stroke:#333
    style E fill:#3498db,color:#fff,stroke:#333

  • Massive regional disparities — The best model scores 95.6% on Western European languages but only 80.2% on Sub-Saharan African languages (a 15.4-point gap)
  • Open vs proprietary gap — The best open-weight model (Gemma 3 27B, 82.4%) trails the best proprietary model (Gemini 2.5 Pro, 91.7%) by 9.3 points
  • 7 languages below 80% — For Ekpeye, Manipuri, Urhobo, Burushaski, Lingala, Idoma, and Chakavian, even the best LLMs struggle with basic everyday knowledge
  • 146 models evaluated — Including 7 closed and 139 open-weight models across many size classes
  • Human accuracy ~95% — Ad hoc evaluations show humans achieve ~95.1%, far above the best models for most languages

Where to Explore the Benchmark

Project Resources

Resource Description Link
Project Website Official Global PIQA website with overview and contribution form mrlbenchmarks.github.io
arXiv Paper Full technical paper with methodology, results, and appendices arxiv.org/abs/2510.24081
GitHub Code, evaluation scripts, and contribution guidelines github.com/mrlbenchmarks/global-piqa
Hugging Face Dataset Official non-parallel split (11,600 examples) huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel

Load the Dataset

from datasets import load_dataset

# Load the official non-parallel split (100 examples × 116 languages)
dataset = load_dataset("mrlbenchmarks/global-piqa-nonparallel")

Contribute a New Language

Global PIQA is actively accepting contributions for languages not yet represented, especially low-resource languages and non-prestige varieties. Register your interest through the contribution form.

Evaluation with lm-evaluation-harness

Results in the paper were generated using the Language Model Evaluation Harness (lm-eval) for open-weight models, and API calls for proprietary models.

Why Global PIQA Matters

graph LR
    A["Benchmarks translated<br/>from English"] --> B["Anglocentric bias<br/>in evaluation"]
    B --> C["Global PIQA<br/>fills the gap"]
    C --> D["Equitable AI<br/>for 100+ languages"]

    A2["Expert knowledge<br/>benchmarks only"] --> B2["Everyday reasoning<br/>not measured"]
    B2 --> C
    C --> D2["Close the gap<br/>between open & closed"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style A2 fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style D2 fill:#3498db,color:#fff,stroke:#333

  1. Culturally native — Examples written directly in each language by native speakers, not translated from English
  2. Reveals hidden gaps — Everyday knowledge that LLMs fail at in low-resource languages, invisible to English-only benchmarks
  3. Massive scale — 116 languages, 65 countries, 335 contributors — the largest culturally-specific commonsense benchmark
  4. Participatory design — Researchers from each language community decide what to test, ensuring cultural authenticity
  5. Open and extensible — Dataset, code, and contribution pipeline are all publicly available
  6. Bridges research and communities — Contributors are co-authors, giving ownership back to language communities

Video: Global PIQA Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

Global PIQA is the first benchmark to systematically evaluate physical commonsense reasoning across 100+ languages and cultures:

  • 116 language varieties covering 5 continents, 14 language families, and 23 writing systems
  • Built by 335 researchers from 65 countries — a truly global, participatory effort
  • 59.9% culturally-specific examples that cannot be created by translation
  • The best model (Gemini 2.5 Pro) scores 91.7% overall, but drops to 80.2% for Sub-Saharan African languages and as low as 60% for individual languages
  • Open-weight models trail proprietary models by 9.3 points, highlighting the need for continued investment in multilingual capabilities
  • Human accuracy is ~95% — the gap with AI is still significant for many languages

Global PIQA demonstrates that everyday knowledge — knowing how to cook, clean, build, and navigate daily life — remains a challenge for LLMs in much of the world. As the authors note:

“In many languages and cultures, everyday knowledge remains an area for improvement, alongside more widely-discussed capabilities such as complex reasoning and expert knowledge.”

References

Read More