FACTS Benchmark Suite

A comprehensive benchmark suite from Google DeepMind for systematically evaluating the factuality of LLMs across grounding, parametric knowledge, search, and multimodal tasks

Published

September 5, 2025

Keywords: FACTS Benchmark Suite, FACTS Grounding, factuality, hallucination, LLM evaluation, Google DeepMind, Google Research, Kaggle, grounding accuracy, parametric knowledge, search benchmark, multimodal factuality

Introduction

LLMs are increasingly becoming a primary source for information delivery — from answering questions to summarizing documents to analyzing images. But their grip on factual accuracy remains imperfect. They “hallucinate” false information, particularly when given complex inputs, eroding trust and limiting real-world applications.

Most benchmarks test knowledge or reasoning in isolation. But factuality failures happen in many different ways: a model might hallucinate when answering from memory, fail to ground its response in a provided document, retrieve the wrong information from the web, or misinterpret an image. Testing only one dimension gives an incomplete picture.

The FACTS Benchmark Suite addresses this by systematically evaluating LLM factuality across four distinct dimensions: grounding, parametric knowledge, search, and multimodal understanding. No model scores above 68% on the overall suite, revealing substantial room for improvement.

graph LR
    A["Single-dimension<br/>factuality tests"] --> B["Incomplete picture<br/>of model accuracy"]
    B --> C["FACTS Benchmark Suite<br/>4 benchmarks, 3,513 examples<br/>Best model < 68%"]
    C --> D["Systematic measure<br/>of LLM factuality"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is the FACTS Benchmark Suite?

The FACTS Benchmark Suite is a collection of four complementary benchmarks designed to evaluate the factual accuracy of LLMs across different use cases. Each benchmark tests a distinct factuality capability, and the FACTS Score is the average accuracy across all four.

graph TD
    FACTS["FACTS Benchmark Suite<br/>3,513+ examples"] --> G["Grounding<br/>1,719 examples"]
    FACTS --> P["Parametric<br/>2,104 examples"]
    FACTS --> S["Search<br/>1,884 examples"]
    FACTS --> M["Multimodal<br/>1,522 examples"]

    G --> G1["Ground responses<br/>in provided documents<br/>Up to 32K tokens"]
    P --> P1["Answer factoid questions<br/>from internal knowledge<br/>No external tools"]
    S --> S1["Use web search<br/>to retrieve and synthesize<br/>Multi-hop queries"]
    M --> M1["Answer questions<br/>about input images<br/>Visual + world knowledge"]

    style FACTS fill:#e74c3c,color:#fff,stroke:#333
    style G fill:#3498db,color:#fff,stroke:#333
    style P fill:#27ae60,color:#fff,stroke:#333
    style S fill:#f39c12,color:#fff,stroke:#333
    style M fill:#8e44ad,color:#fff,stroke:#333

The Four Benchmarks

1. FACTS Grounding (v2)

Tests whether LLMs can generate factually accurate responses grounded in provided long-form documents (up to 32K tokens). Each example includes a system instruction, user request, and context document requiring a long-form response. Responses must be both comprehensive (addressing the user’s request) and fully grounded (no hallucinated claims).

  • 1,719 examples (860 public + 859 private)
  • Domains: finance, technology, retail, medicine, law
  • Tasks: summarization, Q&A generation, rewriting

2. FACTS Parametric

Tests the model’s ability to access its internal knowledge accurately in factoid question use-cases — without the aid of external tools like web search. Questions are “trivia-style” driven by user interest, answerable via Wikipedia.

  • 2,104 examples (1,052 public + 1,052 private)
  • Diverse domains and answer types
  • Example: “Who played harmonica on ‘The Rockford Files’ theme song?”

4. FACTS Multimodal

Tests a model’s ability to answer questions about input images in a factually correct manner. Requires integrating visual grounding (accurately interpreting visual input) with internal world knowledge.

  • 1,522 examples (711 public + 811 private)
  • Diverse image types and question categories
  • Example: An image of a moth with the prompt “What genus does this animal belong to?”

Key Characteristics

Feature Details
Total examples 3,513+ across 4 benchmarks (public + private)
Benchmarks Grounding, Parametric, Search, Multimodal
Document length Up to 32K tokens (Grounding)
Evaluation Ensemble of 3 frontier LLM judges
Anti-gaming Quality filtering disqualifies evasive responses
Anti-contamination Private held-out sets for each benchmark
FACTS Score Average accuracy across all 4 benchmarks
Hosted by Kaggle (independent reproduction)

Who Built It?

The FACTS Benchmark Suite was developed by Google DeepMind and Google Research, in partnership with Kaggle for hosting and independent result reproduction.

Lead Contributors

The FACTS team includes:

  • Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, Nate Keating, Dipanjan Das — Core FACTS team

With support from senior leadership including Avinatan Hassidim, D. Sculley, Fernando Pereira, Koray Kavukcuoglu, Slav Petrov, Ya Xu, and Yossi Matias.

Evolution

Date Milestone
December 2024 FACTS Grounding v1 launched with leaderboard on Kaggle
January 2025 Technical report published (arXiv:2501.03200)
December 2025 FACTS Benchmark Suite launched (4 benchmarks), Grounding updated to v2
Ongoing Leaderboard actively maintained and updated by Kaggle
Resource Link
Google DeepMind Blog deepmind.google/blog/facts-benchmark-suite…
FACTS Grounding Paper (v1) arxiv.org/abs/2501.03200
FACTS Grounding Paper (v2) arxiv.org/abs/2512.10791
FACTS Benchmark Suite Paper PDF (Google DeepMind)

What Skills Does It Test?

The FACTS Benchmark Suite tests the complete factuality pipeline of LLMs — from internal knowledge recall to document grounding to web-based retrieval to visual understanding. This multi-dimensional approach reveals that models can excel in one dimension while failing in another.

graph TD
    FACTS["FACTS Suite<br/>Factuality Skills"] --> IK["Internal Knowledge<br/>(Parametric)"]
    FACTS --> DG["Document Grounding<br/>(Grounding)"]
    FACTS --> WR["Web Retrieval<br/>(Search)"]
    FACTS --> VU["Visual Understanding<br/>(Multimodal)"]
    FACTS --> QF["Quality & Completeness<br/>(All benchmarks)"]
    FACTS --> AH["Anti-Hallucination<br/>(All benchmarks)"]

    style FACTS fill:#e74c3c,color:#fff,stroke:#333
    style IK fill:#3498db,color:#fff,stroke:#333
    style DG fill:#27ae60,color:#fff,stroke:#333
    style WR fill:#f39c12,color:#fff,stroke:#333
    style VU fill:#8e44ad,color:#fff,stroke:#333
    style QF fill:#e67e22,color:#fff,stroke:#333
    style AH fill:#6cc3d5,color:#fff,stroke:#333

Capability Benchmark What It Tests
Internal knowledge Parametric Accurate recall of factual information from training data
Document grounding Grounding Generating responses fully supported by provided context
Information retrieval Search Using web search to find and synthesize facts correctly
Visual reasoning Multimodal Answering factual questions about images
Response quality All Providing comprehensive, useful responses (not evasive)
Anti-hallucination All Avoiding fabricated claims not supported by evidence

Evaluation Methodology

FACTS uses an ensemble of 3 frontier LLM judges (originally Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet) to evaluate responses. This multi-judge approach mitigates evaluation bias. Each response is evaluated in two phases:

  1. Eligibility — Is the response a comprehensive answer to the user’s request? (Disqualified only if all 3 judges agree it’s “ineligible”)
  2. Factuality — Is the response fully grounded / factually correct? (Average of all 3 judges’ scores)

Current Leaderboard

FACTS Benchmark Suite (Overall)

The overall FACTS Score is the average accuracy across all four benchmarks. Results are independently reproduced by Kaggle.

Source: FACTS Benchmark Suite Leaderboard on Kaggle (consulted March 28, 2026). Last updated March 25, 2026.

Rank Model FACTS Score Grounding Multimodal Search Parametric
1 Gemini 3.1 Pro Preview 67.7% 65.0% 41.3% 85.6% 78.9%
2 Gemini 2.5 Pro 62.1% 74.3% 46.9% 63.9% 63.2%
3 GPT-5 61.8% 69.6% 44.1% 77.7% 55.8%
4 Gemini 3 Flash Preview 60.4% 59.0% 41.3% 81.0%
5 Gemini 3.1 Flash-Lite Preview 57.6% 66.5% 39.4% 66.8%
6 GPT-5.2 54.4% 76.2% 39.7% 72.2% 29.7%
7 Grok 4 53.6% 54.7% 25.7% 75.3% 58.6%
8 o3 52.0% 36.2% 39.9% 74.8% 57.1%
9 Claude Opus 4.5 51.3% 62.1% 39.2% 73.2% 30.6%

FACTS Grounding (Standalone)

The Grounding benchmark has the largest number of evaluated models (66+). Top performers:

Source: FACTS Grounding Leaderboard on Kaggle (consulted March 28, 2026).

Rank Model Score Public Private
1 GPT-5.2 76.2% ± 2.0 77.3% 75.1%
2 Gemini 2.5 Pro 74.3% ± 2.1 74.3% 74.3%
3 Llama 3 – Grounded LM 71.8% ± 2.1 72.0% 71.5%
4 Gemini 2.5 Flash 70.0% ± 2.2 70.5% 69.5%
5 GPT-5 69.6% ± 2.2 69.3% 70.0%
6 Gemini 3.1 Flash-Lite 66.5% ± 2.2 67.4% 65.7%
7 Gemini 3.1 Pro Preview 65.0% ± 2.3 65.9% 65.5%
8 Claude Opus 4.5 62.1% ± 2.3 64.4% 59.8%
9 Claude Sonnet 4.5 (thinking) 61.8% ± 2.3 64.5% 59.1%

Key Observations

graph LR
    A["Multimodal is hardest<br/>Best: 46.9%<br/>(Gemini 2.5 Pro)"] --> C["Factuality remains<br/>an unsolved problem"]
    B["Search is strongest<br/>Best: 85.6%<br/>(Gemini 3.1 Pro)"] --> C
    D["No model > 68%<br/>overall FACTS Score"] --> C
    C --> E["Multi-dimensional<br/>evaluation is essential"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#27ae60,color:#fff,stroke:#333
    style D fill:#f39c12,color:#fff,stroke:#333
    style E fill:#3498db,color:#fff,stroke:#333

  • No model exceeds 68% overall — Even the best model (Gemini 3.1 Pro Preview at 67.7%) leaves substantial room for improvement
  • Multimodal is the hardest dimension — The best multimodal score is just 46.9%, far below other dimensions
  • Search is the strongest dimension — Models score up to 85.6% when given web search tools
  • Grounding and Parametric vary widely — Some models excel at grounding (GPT-5.2 at 76.2%) but struggle with parametric knowledge (29.7%)
  • Specialization vs generalization — Models that top one benchmark often trail in others, highlighting the need for multi-dimensional evaluation

Where to Explore the Benchmark

Leaderboards on Kaggle

Resource Description Link
FACTS Suite Leaderboard Overall ranking across all 4 benchmarks kaggle.com/benchmarks/google/facts
FACTS Grounding Standalone grounding leaderboard (66+ models) kaggle.com/benchmarks/google/facts-grounding
FACTS Parametric Standalone parametric knowledge leaderboard kaggle.com/benchmarks/google/facts-parametric
FACTS Search Standalone search-augmented leaderboard kaggle.com/benchmarks/google/facts-search
FACTS Multimodal Standalone multimodal factuality leaderboard kaggle.com/benchmarks/google/facts-multimodal

Dataset and Code

Resource Description Link
FACTS Grounding Public Dataset 860 public examples for self-evaluation kaggle.com/datasets/deepmind/facts-grounding-examples
Starter Notebook Kaggle notebook for running FACTS Grounding v2 kaggle.com/code/prathameshbang/facts-grounding-v2-benchmark-starter
FACTS Grounding Paper (v1) Technical report with methodology arxiv.org/abs/2501.03200
FACTS Grounding Paper (v2) Updated judges and methodology arxiv.org/abs/2512.10791
FACTS Suite Paper Full technical report for all 4 benchmarks PDF
DeepMind Blog (Grounding) Blog post introducing FACTS Grounding deepmind.google/blog/facts-grounding…
DeepMind Blog (Suite) Blog post introducing the full FACTS Suite deepmind.google/blog/facts-benchmark-suite…

Submit Your Model

To request evaluation of a new model on the full FACTS leaderboard (including private held-out sets), fill out the submission form. Official results are run by the Kaggle team to ensure integrity.

Why FACTS Matters

graph LR
    A["Single-dimension<br/>factuality tests"] --> B["Models appear<br/>more accurate<br/>than they are"]
    B --> C["FACTS Suite<br/>exposes blind spots"]
    C --> D["Trustworthy<br/>LLM deployment"]

    A2["Hallucination<br/>is multi-faceted"] --> B2["Grounding ≠ Knowledge<br/>≠ Search ≠ Vision"]
    B2 --> C
    C --> D2["Targeted research<br/>on each dimension"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style A2 fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style D2 fill:#3498db,color:#fff,stroke:#333

  1. Multi-dimensional evaluation — Tests grounding, parametric knowledge, search, and multimodal factuality in one suite
  2. Reveals specialization gaps — Models excelling at grounding may fail at parametric knowledge, and vice versa
  3. Anti-gaming design — Quality filtering prevents evasive short responses; private held-out sets guard against contamination
  4. Multi-judge evaluation — Ensemble of 3 frontier LLM judges mitigates scoring bias
  5. Independently hosted — Kaggle independently reproduces all results, ensuring integrity
  6. Actively maintained — Leaderboard continuously updated with new models and benchmark improvements

Video: FACTS Benchmark Suite Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

The FACTS Benchmark Suite provides the most comprehensive evaluation of LLM factuality available today:

  • 4 benchmarks covering grounding, parametric knowledge, search, and multimodal factuality
  • 3,513+ examples across diverse domains (finance, technology, medicine, law, retail)
  • Built by Google DeepMind and Google Research, hosted and independently verified by Kaggle
  • The best model scores 67.7% overall — substantial headroom for improvement remains
  • Multimodal factuality is the weakest dimension across all models (best: 46.9%)
  • Grounding and parametric knowledge show wide variance — models that ground well can fail at knowledge recall, and vice versa

As LLMs become primary information sources, the FACTS Benchmark Suite ensures we can measure not just whether models know facts, but how reliably they use them — whether from internal knowledge, provided documents, web search, or visual inputs.

“We hope this work encourages deeper research into LLM factuality, leading to better and more accurate models and products for the people that rely on them.” — Google DeepMind FACTS Team

References

  • Jacovi, A., Wang, A., Alberti, C., Tao, C., Lipovetz, J., Olszewska, K., Haas, L., Liu, M., Keating, N., Das, D. “The FACTS Grounding Leaderboard: Benchmarking LLMs’ Ability to Ground Responses to Long-Form Input.” arXiv preprint arXiv:2501.03200 (2025). arxiv.org/abs/2501.03200
  • Google DeepMind. “FACTS Grounding v2 Technical Report.” arXiv preprint arXiv:2512.10791 (2025). arxiv.org/abs/2512.10791
  • Google DeepMind. “FACTS Benchmark Suite Paper.” PDF
  • Google DeepMind. “FACTS Grounding: A new benchmark for evaluating the factuality of large language models.” deepmind.google/blog/facts-grounding… (December 2024)
  • Google DeepMind. “FACTS Benchmark Suite: Systematically evaluating the factuality of large language models.” deepmind.google/blog/facts-benchmark-suite… (December 2025)
  • Google DeepMind & Kaggle. “FACTS Benchmark Suite Leaderboard.” kaggle.com/benchmarks/google/facts (consulted March 28, 2026)
  • Google DeepMind & Kaggle. “FACTS Grounding Leaderboard.” kaggle.com/benchmarks/google/facts-grounding (consulted March 28, 2026)

Read More