OpenAI MRCR (Multi-Round Coreference Resolution)

A long-context ‘multiple needles in a haystack’ benchmark that tests whether LLMs can find and disambiguate 2, 4, or 8 hidden needles across conversations up to 1 million tokens

Published

September 9, 2025

Keywords: OpenAI MRCR, Multi-Round Coreference Resolution, long context benchmark, needle in a haystack, multiple needles, long-context retrieval, GPT-4.1, GPT-5, context window, 1M tokens, Michelangelo, Latent Structure Queries, LLM evaluation, coreference resolution

Introduction

The “needle in a haystack” test has become the default way to evaluate whether LLMs can use their long context windows. But finding one obvious needle is trivially easy for modern frontier models — GPT-4.1, Gemini, and Claude all score nearly 100% on single-needle retrieval across their full context windows.

OpenAI MRCR (Multi-Round Coreference Resolution) raises the bar dramatically. Instead of hiding one obvious piece of information, it embeds 2, 4, or 8 nearly identical requests throughout a long, realistic conversation — and asks the model to retrieve a specific instance (e.g., “return the 3rd poem about tapirs”). The needles are drawn from the same distribution as the distractors, making them almost impossible to distinguish without genuine long-context comprehension.

“The challenge arises from the similarity between these requests and the rest of the context — models can easily be misled by subtle differences, such as a short story about tapirs rather than a poem, or a poem about frogs instead of tapirs.” — OpenAI, GPT-4.1 Blog

graph LR
    A["Single Needle<br/>in a Haystack<br/>~100% for frontier models"] --> B["Too easy<br/>for modern LLMs"]
    B --> C["OpenAI MRCR<br/>2, 4, or 8 needles<br/>up to 1M tokens"]
    C --> D["Tests genuine<br/>long-context<br/>comprehension"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is OpenAI MRCR?

OpenAI MRCR is a long-context benchmark that evaluates an LLM’s ability to find and disambiguate between multiple hidden needles in a realistic multi-turn conversation. The task goes far beyond simple retrieval — the model must track identity, order, and type across potentially thousands of conversation turns.

How It Works

A long, multi-turn conversation is synthetically generated: the user asks for writings (poems, blog posts, social media posts, stories) about various topics
Hidden among the conversation are 2, 4, or 8 identical requests (e.g., “write a poem about tapirs”) — each producing a distinct response
The model is asked to return the i-th instance of a specific request (e.g., “return the 2nd poem about tapirs”)
The model must prepend an alphanumeric hash to its answer — if the hash is missing, the score is 0

Key Characteristics

Feature	Details
Needle variants	2-needle, 4-needle, and 8-needle
Context length bins	4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K, 1M tokens
Samples per bin	100 per bin
Total dataset	2,400 data points
Distinct entities	438 topics
Writing formats	10 different formats (poems, blog posts, stories, etc.)
Metric	SequenceMatcher ratio (Python difflib)
Anti-gaming	Alphanumeric hash must be prepended to the answer
License	MIT

Why Is the 8-Needle Variant So Hard?

The 8-needle variant is the most challenging configuration:

8 identical requests are scattered among hundreds of distractor turns
Each needle produces a distinct but stylistically similar response (all generated by GPT-4o)
The model must identify the correct ordinal instance (e.g., the 5th poem about tapirs, not the 4th or 6th)
At long context lengths (256K–1M tokens), even frontier models struggle to maintain accuracy

graph TD
    A["Multi-turn conversation<br/>thousands of turns"] --> B["User asks for<br/>poems, blog posts,<br/>stories, etc."]
    B --> C["8 identical requests<br/>hidden among<br/>distractors"]
    C --> D["Model must return<br/>the i-th instance<br/>of a specific ask"]
    D --> E["Score: string match<br/>ratio + hash check"]

    style A fill:#ecf0f1,color:#333,stroke:#bdc3c7
    style B fill:#f39c12,color:#fff,stroke:#333
    style C fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333

Who Built It?

OpenAI MRCR was created by OpenAI and released alongside the GPT-4.1 model family on April 14, 2025. The benchmark is inspired by the MRCR evaluation first introduced by Google DeepMind in the Michelangelo paper (Vodrahalli et al., arXiv:2409.12640, September 2024), which proposed the Latent Structure Queries (LSQ) framework for evaluating long-context reasoning.

OpenAI expanded upon Google’s concept by:

Increasing difficulty with 2, 4, and 8 needle variants
Scaling context lengths up to 1 million tokens
Open-sourcing the full dataset on Hugging Face for reproducibility
Providing evaluation code so anyone can benchmark their own models

Key People and Institutions

Entity	Role
OpenAI	Created and open-sourced OpenAI MRCR
Google DeepMind	Introduced the original MRCR concept in the Michelangelo paper
Kiran Vodrahalli et al.	Authors of the Michelangelo paper (Google)

Publication History

Date	Milestone
September 2024	Google publishes Michelangelo paper introducing MRCR concept
April 14, 2025	OpenAI releases OpenAI MRCR alongside GPT-4.1 launch
August 7, 2025	OpenAI publishes updated MRCR results with GPT-5
December 5, 2025	Bugfix: ~10% of data points corrected for needle count errors

What Skills Does It Test?

OpenAI MRCR tests a focused but critical set of long-context comprehension capabilities that go far beyond simple information retrieval:

graph TD
    MRCR["OpenAI MRCR<br/>Long-Context Benchmark"] --> R["Multi-Needle<br/>Retrieval"]
    MRCR --> C["Coreference<br/>Resolution"]
    MRCR --> O["Ordinal<br/>Tracking"]
    MRCR --> D["Distractor<br/>Resistance"]
    MRCR --> S["Scale<br/>Invariance"]

    style MRCR fill:#e74c3c,color:#fff,stroke:#333
    style R fill:#3498db,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style O fill:#f39c12,color:#fff,stroke:#333
    style D fill:#8e44ad,color:#fff,stroke:#333
    style S fill:#e67e22,color:#fff,stroke:#333

Capability	What MRCR Tests
Multi-needle retrieval	Finding multiple pieces of information that look nearly identical to surrounding context
Coreference resolution	Linking “the 3rd poem about tapirs” to the correct response among 8 similar poems
Ordinal tracking	Maintaining correct count of repeated requests across long contexts
Distractor resistance	Ignoring similar but non-matching content (a story vs. a poem, frogs vs. tapirs)
Scale invariance	Performing consistently as context length increases from 4K to 1M tokens

Why It Matters for Real-World Applications

MRCR reflects real-world scenarios where models must:

Navigate large codebases — finding the right function definition among many similar ones
Process legal documents — identifying the correct clause among dozens of similar paragraphs
Handle customer support logs — locating the right conversation thread in a long history
Work with research papers — distinguishing between multiple experiments with similar setups

Current Leaderboard

The tables below show model accuracy on OpenAI MRCR (2-needle) at 128K and 256K context lengths, as published in OpenAI’s official blog posts.

Source: Introducing GPT-5 for developers (August 7, 2025) and Introducing GPT-4.1 in the API (April 14, 2025). Consulted March 28, 2026.

2-Needle at 128K Context

Rank	Model	Match Ratio (%)
1	GPT-5 (high)	95.2
2	GPT-5 mini (high)	84.3
3	GPT-4.1	57.2
4	o4-mini (high)	56.4
5	o3 (high)	55.0
6	GPT-4.1 mini	47.2
7	GPT-5 nano (high)	43.2
8	GPT-4.5 Preview	38.5
9	GPT-4.1 nano	36.6
10	GPT-4o	31.9
11	GPT-4o mini	24.5
12	o1	22.1
13	o3-mini	18.7

2-Needle at 256K Context

Rank	Model	Match Ratio (%)
1	GPT-5 (high)	86.8
2	GPT-5 mini (high)	58.8
3	GPT-4.1	56.2
4	GPT-4.1 mini	45.5
5	GPT-5 nano (high)	34.9
6	GPT-4.1 nano	22.6

Key takeaways:

GPT-5 dominates at 95.2% on 2-needle 128K — a massive 38-point jump over GPT-4.1 (57.2%)
At 256K tokens, GPT-5 still leads at 86.8%, demonstrating robust long-context understanding
Reasoning models (o3, o4-mini) do not automatically outperform non-reasoning models on this task — GPT-4.1 beats o3 at 128K
The 8-needle variant at long context lengths remains extremely challenging for all models — performance drops significantly compared to 2-needle results
The benchmark is far from saturated, with the 4-needle and 8-needle variants providing headroom for future models

Where to Explore the Benchmark

Dashboards and Leaderboards

Resource	Description	Link
OpenAI GPT-4.1 Blog	Original MRCR results with interactive 2/4/8 needle charts	openai.com/index/gpt-4-1
OpenAI GPT-5 Dev Blog	Updated MRCR results with GPT-5 family	openai.com/index/introducing-gpt-5-for-developers

Dataset and Code

Resource	Description	Link
Hugging Face Dataset	Full 2,400-sample dataset (2-needle, 4-needle, 8-needle)	huggingface.co/datasets/openai/mrcr
Michelangelo Paper	Original Google paper introducing MRCR concept	arxiv.org/abs/2409.12640

Load the Dataset

from huggingface_hub import hf_hub_download
import pandas as pd

# Load 2-needle data
dataset_2needle = pd.concat([
    pd.read_parquet(hf_hub_download(
        repo_id="openai/mrcr",
        filename="2needle/2needle_0.parquet",
        repo_type="dataset"
    )),
    pd.read_parquet(hf_hub_download(
        repo_id="openai/mrcr",
        filename="2needle/2needle_1.parquet",
        repo_type="dataset"
    ))
])

# Load 8-needle data
dataset_8needle = pd.concat([
    pd.read_parquet(hf_hub_download(
        repo_id="openai/mrcr",
        filename="8needle/8needle_0.parquet",
        repo_type="dataset"
    )),
    pd.read_parquet(hf_hub_download(
        repo_id="openai/mrcr",
        filename="8needle/8needle_1.parquet",
        repo_type="dataset"
    ))
])

Run the Evaluation

from openai import OpenAI
from difflib import SequenceMatcher
import json

client = OpenAI()

def grade(response, answer, random_string_to_prepend) -> float:
    if not response.startswith(random_string_to_prepend):
        return 0
    response = response.removeprefix(random_string_to_prepend)
    answer = answer.removeprefix(random_string_to_prepend)
    return float(SequenceMatcher(None, response, answer).ratio())

Understanding the Metrics

Match Ratio (SequenceMatcher)

The primary metric is the SequenceMatcher ratio from Python’s difflib library. It measures the similarity between the model’s response and the ground-truth answer on a scale from 0 to 1 (reported as percentage).

Component	Description
Hash verification	Model must prepend a specific alphanumeric hash; if missing, score = 0
String matching	After hash removal, the stripped response is compared to stripped ground truth
Per-bin aggregation	Results are averaged within each context-length bin (100 samples per bin)

Why SequenceMatcher?

Unlike exact match, SequenceMatcher provides a gradient signal — a model that retrieves most of the correct content but makes a small error still gets partial credit, while a model that retrieves the wrong needle entirely scores near 0.

Why OpenAI MRCR Matters

graph LR
    A["Single-needle<br/>benchmarks<br/>saturated"] --> B["Cannot distinguish<br/>frontier models'<br/>long-context ability"]
    B --> C["OpenAI MRCR<br/>fills the gap"]
    C --> D["Better long-context<br/>evaluation"]

    A2["Real-world tasks<br/>need multi-hop<br/>retrieval"] --> B2["Simple retrieval<br/>evals miss<br/>this signal"]
    B2 --> C
    C --> D2["Drives progress<br/>in long-context<br/>models"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style A2 fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style D2 fill:#3498db,color:#fff,stroke:#333

Goes beyond trivial retrieval — needles are drawn from the same distribution as distractors, making the task genuinely hard
Scales with context length — bins from 4K to 1M tokens reveal how models degrade at extreme lengths
Multiple difficulty levels — 2, 4, and 8 needle variants provide unsaturated headroom
Open and reproducible — full dataset on Hugging Face with MIT license and evaluation code
Reflects real-world needs — legal, coding, customer support, and research tasks all require multi-needle long-context retrieval

Video: OpenAI MRCR Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

OpenAI MRCR represents a much-needed evolution in long-context evaluation:

2,400 data points testing 2, 4, and 8 needle retrieval across context lengths from 4K to 1M tokens
Created by OpenAI, inspired by Google DeepMind’s Michelangelo paper
Even the best model (GPT-5) drops from 95.2% at 128K to 86.8% at 256K on just the 2-needle variant — the 8-needle variant is even harder
Open-sourced with MIT license, full dataset, and evaluation code
Exposes a critical capability gap between models that can merely find one needle and models that can genuinely comprehend long contexts

As context windows grow to millions of tokens, OpenAI MRCR provides a meaningful benchmark for measuring whether models can actually use all that context — not just claim it.

References

OpenAI. “Introducing GPT-4.1 in the API.” OpenAI Blog (April 14, 2025). openai.com/index/gpt-4-1
OpenAI. “Introducing GPT-5 for developers.” OpenAI Blog (August 7, 2025). openai.com/index/introducing-gpt-5-for-developers
OpenAI. “OpenAI MRCR Dataset.” Hugging Face. huggingface.co/datasets/openai/mrcr
Vodrahalli, K., Ontanon, S., Tripuraneni, N. et al. “Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries.” arXiv preprint arXiv:2409.12640 (2024). arxiv.org/abs/2409.12640

Explore other AI benchmarks — see Humanity’s Last Exam (HLE)
Multimodal evaluation — see MMMU-Pro
Graduate-level reasoning — see GPQA Diamond
Multilingual evaluation — see MMMLU
Track model costs when running evaluations — see FinOps Best Practices for LLM Applications
Deploy models for running your own evaluations — see Deploying and Serving LLM with vLLM
OpenAI MRCR Dataset on Hugging Face
GPT-4.1 Blog with MRCR Interactive Charts