FACTS Benchmark Suite

A comprehensive benchmark suite from Google DeepMind for systematically evaluating the factuality of LLMs across grounding, parametric knowledge, search, and multimodal tasks

Published

September 5, 2025

Keywords: FACTS Benchmark Suite, FACTS Grounding, factuality, hallucination, LLM evaluation, Google DeepMind, Google Research, Kaggle, grounding accuracy, parametric knowledge, search benchmark, multimodal factuality

Introduction

LLMs are increasingly becoming a primary source for information delivery — from answering questions to summarizing documents to analyzing images. But their grip on factual accuracy remains imperfect. They “hallucinate” false information, particularly when given complex inputs, eroding trust and limiting real-world applications.

Most benchmarks test knowledge or reasoning in isolation. But factuality failures happen in many different ways: a model might hallucinate when answering from memory, fail to ground its response in a provided document, retrieve the wrong information from the web, or misinterpret an image. Testing only one dimension gives an incomplete picture.

The FACTS Benchmark Suite addresses this by systematically evaluating LLM factuality across four distinct dimensions: grounding, parametric knowledge, search, and multimodal understanding. No model scores above 68% on the overall suite, revealing substantial room for improvement.

graph LR
    A["Single-dimension<br/>factuality tests"] --> B["Incomplete picture<br/>of model accuracy"]
    B --> C["FACTS Benchmark Suite<br/>4 benchmarks, 3,513 examples<br/>Best model < 68%"]
    C --> D["Systematic measure<br/>of LLM factuality"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is the FACTS Benchmark Suite?

The FACTS Benchmark Suite is a collection of four complementary benchmarks designed to evaluate the factual accuracy of LLMs across different use cases. Each benchmark tests a distinct factuality capability, and the FACTS Score is the average accuracy across all four.

graph TD
    FACTS["FACTS Benchmark Suite<br/>3,513+ examples"] --> G["Grounding<br/>1,719 examples"]
    FACTS --> P["Parametric<br/>2,104 examples"]
    FACTS --> S["Search<br/>1,884 examples"]
    FACTS --> M["Multimodal<br/>1,522 examples"]

    G --> G1["Ground responses<br/>in provided documents<br/>Up to 32K tokens"]
    P --> P1["Answer factoid questions<br/>from internal knowledge<br/>No external tools"]
    S --> S1["Use web search<br/>to retrieve and synthesize<br/>Multi-hop queries"]
    M --> M1["Answer questions<br/>about input images<br/>Visual + world knowledge"]

    style FACTS fill:#e74c3c,color:#fff,stroke:#333
    style G fill:#3498db,color:#fff,stroke:#333
    style P fill:#27ae60,color:#fff,stroke:#333
    style S fill:#f39c12,color:#fff,stroke:#333
    style M fill:#8e44ad,color:#fff,stroke:#333

The Four Benchmarks

1. FACTS Grounding (v2)

Tests whether LLMs can generate factually accurate responses grounded in provided long-form documents (up to 32K tokens). Each example includes a system instruction, user request, and context document requiring a long-form response. Responses must be both comprehensive (addressing the user’s request) and fully grounded (no hallucinated claims).

1,719 examples (860 public + 859 private)
Domains: finance, technology, retail, medicine, law
Tasks: summarization, Q&A generation, rewriting

2. FACTS Parametric

Tests the model’s ability to access its internal knowledge accurately in factoid question use-cases — without the aid of external tools like web search. Questions are “trivia-style” driven by user interest, answerable via Wikipedia.

2,104 examples (1,052 public + 1,052 private)
Diverse domains and answer types
Example: “Who played harmonica on ‘The Rockford Files’ theme song?”

3. FACTS Search

Tests a model’s ability to use web search as a tool to retrieve information and synthesize it correctly. Designed to be challenging even with web access, often requiring multi-hop retrieval — finding multiple facts sequentially to answer a single query. The same search tool is provided to all models to isolate model capability from tool quality.

1,884 examples (890 public + 994 private)
Multi-hop queries requiring sequential fact retrieval
Example: “What is the sum of the birth years of the British boxer who defeated Vazik Kazarian at the 1960 Summer Olympics, the Moroccan boxer who also competed… and the Danish boxer who competed in both the 1960 and 1964 Summer Olympics?”

4. FACTS Multimodal

Tests a model’s ability to answer questions about input images in a factually correct manner. Requires integrating visual grounding (accurately interpreting visual input) with internal world knowledge.

1,522 examples (711 public + 811 private)
Diverse image types and question categories
Example: An image of a moth with the prompt “What genus does this animal belong to?”

Key Characteristics

Feature	Details
Total examples	3,513+ across 4 benchmarks (public + private)
Benchmarks	Grounding, Parametric, Search, Multimodal
Document length	Up to 32K tokens (Grounding)
Evaluation	Ensemble of 3 frontier LLM judges
Anti-gaming	Quality filtering disqualifies evasive responses
Anti-contamination	Private held-out sets for each benchmark
FACTS Score	Average accuracy across all 4 benchmarks
Hosted by	Kaggle (independent reproduction)

Who Built It?

The FACTS Benchmark Suite was developed by Google DeepMind and Google Research, in partnership with Kaggle for hosting and independent result reproduction.

Lead Contributors

The FACTS team includes:

Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, Nate Keating, Dipanjan Das — Core FACTS team

With support from senior leadership including Avinatan Hassidim, D. Sculley, Fernando Pereira, Koray Kavukcuoglu, Slav Petrov, Ya Xu, and Yossi Matias.

Evolution

Date	Milestone
December 2024	FACTS Grounding v1 launched with leaderboard on Kaggle
January 2025	Technical report published (arXiv:2501.03200)
December 2025	FACTS Benchmark Suite launched (4 benchmarks), Grounding updated to v2
Ongoing	Leaderboard actively maintained and updated by Kaggle

Resource	Link
Google DeepMind Blog	deepmind.google/blog/facts-benchmark-suite…
FACTS Grounding Paper (v1)	arxiv.org/abs/2501.03200
FACTS Grounding Paper (v2)	arxiv.org/abs/2512.10791
FACTS Benchmark Suite Paper	PDF (Google DeepMind)

What Skills Does It Test?

The FACTS Benchmark Suite tests the complete factuality pipeline of LLMs — from internal knowledge recall to document grounding to web-based retrieval to visual understanding. This multi-dimensional approach reveals that models can excel in one dimension while failing in another.

graph TD
    FACTS["FACTS Suite<br/>Factuality Skills"] --> IK["Internal Knowledge<br/>(Parametric)"]
    FACTS --> DG["Document Grounding<br/>(Grounding)"]
    FACTS --> WR["Web Retrieval<br/>(Search)"]
    FACTS --> VU["Visual Understanding<br/>(Multimodal)"]
    FACTS --> QF["Quality & Completeness<br/>(All benchmarks)"]
    FACTS --> AH["Anti-Hallucination<br/>(All benchmarks)"]

    style FACTS fill:#e74c3c,color:#fff,stroke:#333
    style IK fill:#3498db,color:#fff,stroke:#333
    style DG fill:#27ae60,color:#fff,stroke:#333
    style WR fill:#f39c12,color:#fff,stroke:#333
    style VU fill:#8e44ad,color:#fff,stroke:#333
    style QF fill:#e67e22,color:#fff,stroke:#333
    style AH fill:#6cc3d5,color:#fff,stroke:#333

Capability	Benchmark	What It Tests
Internal knowledge	Parametric	Accurate recall of factual information from training data
Document grounding	Grounding	Generating responses fully supported by provided context
Information retrieval	Search	Using web search to find and synthesize facts correctly
Visual reasoning	Multimodal	Answering factual questions about images
Response quality	All	Providing comprehensive, useful responses (not evasive)
Anti-hallucination	All	Avoiding fabricated claims not supported by evidence

Evaluation Methodology

FACTS uses an ensemble of 3 frontier LLM judges (originally Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet) to evaluate responses. This multi-judge approach mitigates evaluation bias. Each response is evaluated in two phases:

Eligibility — Is the response a comprehensive answer to the user’s request? (Disqualified only if all 3 judges agree it’s “ineligible”)
Factuality — Is the response fully grounded / factually correct? (Average of all 3 judges’ scores)

Current Leaderboard

FACTS Benchmark Suite (Overall)

The overall FACTS Score is the average accuracy across all four benchmarks. Results are independently reproduced by Kaggle.

Source: FACTS Benchmark Suite Leaderboard on Kaggle (consulted March 28, 2026). Last updated March 25, 2026.

Rank	Model	FACTS Score	Grounding	Multimodal	Search	Parametric
1	Gemini 3.1 Pro Preview	67.7%	65.0%	41.3%	85.6%	78.9%
2	Gemini 2.5 Pro	62.1%	74.3%	46.9%	63.9%	63.2%
3	GPT-5	61.8%	69.6%	44.1%	77.7%	55.8%
4	Gemini 3 Flash Preview	60.4%	59.0%	41.3%	81.0%	—
5	Gemini 3.1 Flash-Lite Preview	57.6%	66.5%	39.4%	66.8%	—
6	GPT-5.2	54.4%	76.2%	39.7%	72.2%	29.7%
7	Grok 4	53.6%	54.7%	25.7%	75.3%	58.6%
8	o3	52.0%	36.2%	39.9%	74.8%	57.1%
9	Claude Opus 4.5	51.3%	62.1%	39.2%	73.2%	30.6%

FACTS Grounding (Standalone)

The Grounding benchmark has the largest number of evaluated models (66+). Top performers:

Source: FACTS Grounding Leaderboard on Kaggle (consulted March 28, 2026).

Rank	Model	Score	Public	Private
1	GPT-5.2	76.2% ± 2.0	77.3%	75.1%
2	Gemini 2.5 Pro	74.3% ± 2.1	74.3%	74.3%
3	Llama 3 – Grounded LM	71.8% ± 2.1	72.0%	71.5%
4	Gemini 2.5 Flash	70.0% ± 2.2	70.5%	69.5%
5	GPT-5	69.6% ± 2.2	69.3%	70.0%
6	Gemini 3.1 Flash-Lite	66.5% ± 2.2	67.4%	65.7%
7	Gemini 3.1 Pro Preview	65.0% ± 2.3	65.9%	65.5%
8	Claude Opus 4.5	62.1% ± 2.3	64.4%	59.8%
9	Claude Sonnet 4.5 (thinking)	61.8% ± 2.3	64.5%	59.1%

Key Observations

graph LR
    A["Multimodal is hardest<br/>Best: 46.9%<br/>(Gemini 2.5 Pro)"] --> C["Factuality remains<br/>an unsolved problem"]
    B["Search is strongest<br/>Best: 85.6%<br/>(Gemini 3.1 Pro)"] --> C
    D["No model > 68%<br/>overall FACTS Score"] --> C
    C --> E["Multi-dimensional<br/>evaluation is essential"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#27ae60,color:#fff,stroke:#333
    style D fill:#f39c12,color:#fff,stroke:#333
    style E fill:#3498db,color:#fff,stroke:#333

No model exceeds 68% overall — Even the best model (Gemini 3.1 Pro Preview at 67.7%) leaves substantial room for improvement
Multimodal is the hardest dimension — The best multimodal score is just 46.9%, far below other dimensions
Search is the strongest dimension — Models score up to 85.6% when given web search tools
Grounding and Parametric vary widely — Some models excel at grounding (GPT-5.2 at 76.2%) but struggle with parametric knowledge (29.7%)
Specialization vs generalization — Models that top one benchmark often trail in others, highlighting the need for multi-dimensional evaluation

Where to Explore the Benchmark

Leaderboards on Kaggle

Resource	Description	Link
FACTS Suite Leaderboard	Overall ranking across all 4 benchmarks	kaggle.com/benchmarks/google/facts
FACTS Grounding	Standalone grounding leaderboard (66+ models)	kaggle.com/benchmarks/google/facts-grounding
FACTS Parametric	Standalone parametric knowledge leaderboard	kaggle.com/benchmarks/google/facts-parametric
FACTS Search	Standalone search-augmented leaderboard	kaggle.com/benchmarks/google/facts-search
FACTS Multimodal	Standalone multimodal factuality leaderboard	kaggle.com/benchmarks/google/facts-multimodal

Dataset and Code

Resource	Description	Link
FACTS Grounding Public Dataset	860 public examples for self-evaluation	kaggle.com/datasets/deepmind/facts-grounding-examples
Starter Notebook	Kaggle notebook for running FACTS Grounding v2	kaggle.com/code/prathameshbang/facts-grounding-v2-benchmark-starter
FACTS Grounding Paper (v1)	Technical report with methodology	arxiv.org/abs/2501.03200
FACTS Grounding Paper (v2)	Updated judges and methodology	arxiv.org/abs/2512.10791
FACTS Suite Paper	Full technical report for all 4 benchmarks	PDF
DeepMind Blog (Grounding)	Blog post introducing FACTS Grounding	deepmind.google/blog/facts-grounding…
DeepMind Blog (Suite)	Blog post introducing the full FACTS Suite	deepmind.google/blog/facts-benchmark-suite…

Submit Your Model

To request evaluation of a new model on the full FACTS leaderboard (including private held-out sets), fill out the submission form. Official results are run by the Kaggle team to ensure integrity.

Why FACTS Matters

graph LR
    A["Single-dimension<br/>factuality tests"] --> B["Models appear<br/>more accurate<br/>than they are"]
    B --> C["FACTS Suite<br/>exposes blind spots"]
    C --> D["Trustworthy<br/>LLM deployment"]

    A2["Hallucination<br/>is multi-faceted"] --> B2["Grounding ≠ Knowledge<br/>≠ Search ≠ Vision"]
    B2 --> C
    C --> D2["Targeted research<br/>on each dimension"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style A2 fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style D2 fill:#3498db,color:#fff,stroke:#333

Multi-dimensional evaluation — Tests grounding, parametric knowledge, search, and multimodal factuality in one suite
Reveals specialization gaps — Models excelling at grounding may fail at parametric knowledge, and vice versa
Anti-gaming design — Quality filtering prevents evasive short responses; private held-out sets guard against contamination
Multi-judge evaluation — Ensemble of 3 frontier LLM judges mitigates scoring bias
Independently hosted — Kaggle independently reproduces all results, ensuring integrity
Actively maintained — Leaderboard continuously updated with new models and benchmark improvements

Video: FACTS Benchmark Suite Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

The FACTS Benchmark Suite provides the most comprehensive evaluation of LLM factuality available today:

4 benchmarks covering grounding, parametric knowledge, search, and multimodal factuality
3,513+ examples across diverse domains (finance, technology, medicine, law, retail)
Built by Google DeepMind and Google Research, hosted and independently verified by Kaggle
The best model scores 67.7% overall — substantial headroom for improvement remains
Multimodal factuality is the weakest dimension across all models (best: 46.9%)
Grounding and parametric knowledge show wide variance — models that ground well can fail at knowledge recall, and vice versa

As LLMs become primary information sources, the FACTS Benchmark Suite ensures we can measure not just whether models know facts, but how reliably they use them — whether from internal knowledge, provided documents, web search, or visual inputs.

“We hope this work encourages deeper research into LLM factuality, leading to better and more accurate models and products for the people that rely on them.” — Google DeepMind FACTS Team

References

Jacovi, A., Wang, A., Alberti, C., Tao, C., Lipovetz, J., Olszewska, K., Haas, L., Liu, M., Keating, N., Das, D. “The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input.” arXiv preprint arXiv:2501.03200 (2025). arxiv.org/abs/2501.03200
Google DeepMind. “FACTS Grounding v2 Technical Report.” arXiv preprint arXiv:2512.10791 (2025). arxiv.org/abs/2512.10791
Google DeepMind. “FACTS Benchmark Suite Paper.” PDF
Google DeepMind. “FACTS Grounding: A new benchmark for evaluating the factuality of large language models.” deepmind.google/blog/facts-grounding… (December 2024)
Google DeepMind. “FACTS Benchmark Suite: Systematically evaluating the factuality of large language models.” deepmind.google/blog/facts-benchmark-suite… (December 2025)
Google DeepMind & Kaggle. “FACTS Benchmark Suite Leaderboard.” kaggle.com/benchmarks/google/facts (consulted March 28, 2026)
Google DeepMind & Kaggle. “FACTS Grounding Leaderboard.” kaggle.com/benchmarks/google/facts-grounding (consulted March 28, 2026)

Compare with the hardest academic benchmark — see Humanity’s Last Exam (HLE)
Compare with the AGI fluid intelligence benchmark — see ARC-AGI-2
Compare with the chart understanding benchmark — see CharXiv Reasoning
Deploy models for running your own evaluations — see Deploying and Serving LLM with vLLM
Scale your evaluation infrastructure — see Scaling LLM Serving for Enterprise Production
Track model costs when running evaluations — see FinOps Best Practices for LLM Applications
FACTS Suite Leaderboard on Kaggle
FACTS Grounding Leaderboard on Kaggle
FACTS Grounding Public Dataset