OpenAI MRCR (Multi-Round Coreference Resolution)

A long-context ‘multiple needles in a haystack’ benchmark that tests whether LLMs can find and disambiguate 2, 4, or 8 hidden needles across conversations up to 1 million tokens

Published

September 9, 2025

Keywords: OpenAI MRCR, Multi-Round Coreference Resolution, long context benchmark, needle in a haystack, multiple needles, long-context retrieval, GPT-4.1, GPT-5, context window, 1M tokens, Michelangelo, Latent Structure Queries, LLM evaluation, coreference resolution

Introduction

The “needle in a haystack” test has become the default way to evaluate whether LLMs can use their long context windows. But finding one obvious needle is trivially easy for modern frontier models — GPT-4.1, Gemini, and Claude all score nearly 100% on single-needle retrieval across their full context windows.

OpenAI MRCR (Multi-Round Coreference Resolution) raises the bar dramatically. Instead of hiding one obvious piece of information, it embeds 2, 4, or 8 nearly identical requests throughout a long, realistic conversation — and asks the model to retrieve a specific instance (e.g., “return the 3rd poem about tapirs”). The needles are drawn from the same distribution as the distractors, making them almost impossible to distinguish without genuine long-context comprehension.

“The challenge arises from the similarity between these requests and the rest of the context — models can easily be misled by subtle differences, such as a short story about tapirs rather than a poem, or a poem about frogs instead of tapirs.” — OpenAI, GPT-4.1 Blog

graph LR
    A["Single Needle<br/>in a Haystack<br/>~100% for frontier models"] --> B["Too easy<br/>for modern LLMs"]
    B --> C["OpenAI MRCR<br/>2, 4, or 8 needles<br/>up to 1M tokens"]
    C --> D["Tests genuine<br/>long-context<br/>comprehension"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is OpenAI MRCR?

OpenAI MRCR is a long-context benchmark that evaluates an LLM’s ability to find and disambiguate between multiple hidden needles in a realistic multi-turn conversation. The task goes far beyond simple retrieval — the model must track identity, order, and type across potentially thousands of conversation turns.

How It Works

  1. A long, multi-turn conversation is synthetically generated: the user asks for writings (poems, blog posts, social media posts, stories) about various topics
  2. Hidden among the conversation are 2, 4, or 8 identical requests (e.g., “write a poem about tapirs”) — each producing a distinct response
  3. The model is asked to return the i-th instance of a specific request (e.g., “return the 2nd poem about tapirs”)
  4. The model must prepend an alphanumeric hash to its answer — if the hash is missing, the score is 0

Key Characteristics

Feature Details
Needle variants 2-needle, 4-needle, and 8-needle
Context length bins 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K, 1M tokens
Samples per bin 100 per bin
Total dataset 2,400 data points
Distinct entities 438 topics
Writing formats 10 different formats (poems, blog posts, stories, etc.)
Metric SequenceMatcher ratio (Python difflib)
Anti-gaming Alphanumeric hash must be prepended to the answer
License MIT

Why Is the 8-Needle Variant So Hard?

The 8-needle variant is the most challenging configuration:

  • 8 identical requests are scattered among hundreds of distractor turns
  • Each needle produces a distinct but stylistically similar response (all generated by GPT-4o)
  • The model must identify the correct ordinal instance (e.g., the 5th poem about tapirs, not the 4th or 6th)
  • At long context lengths (256K–1M tokens), even frontier models struggle to maintain accuracy

graph TD
    A["Multi-turn conversation<br/>thousands of turns"] --> B["User asks for<br/>poems, blog posts,<br/>stories, etc."]
    B --> C["8 identical requests<br/>hidden among<br/>distractors"]
    C --> D["Model must return<br/>the i-th instance<br/>of a specific ask"]
    D --> E["Score: string match<br/>ratio + hash check"]

    style A fill:#ecf0f1,color:#333,stroke:#bdc3c7
    style B fill:#f39c12,color:#fff,stroke:#333
    style C fill:#e74c3c,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style E fill:#27ae60,color:#fff,stroke:#333

Who Built It?

OpenAI MRCR was created by OpenAI and released alongside the GPT-4.1 model family on April 14, 2025. The benchmark is inspired by the MRCR evaluation first introduced by Google DeepMind in the Michelangelo paper (Vodrahalli et al., arXiv:2409.12640, September 2024), which proposed the Latent Structure Queries (LSQ) framework for evaluating long-context reasoning.

OpenAI expanded upon Google’s concept by:

  • Increasing difficulty with 2, 4, and 8 needle variants
  • Scaling context lengths up to 1 million tokens
  • Open-sourcing the full dataset on Hugging Face for reproducibility
  • Providing evaluation code so anyone can benchmark their own models

Key People and Institutions

Entity Role
OpenAI Created and open-sourced OpenAI MRCR
Google DeepMind Introduced the original MRCR concept in the Michelangelo paper
Kiran Vodrahalli et al. Authors of the Michelangelo paper (Google)

Publication History

Date Milestone
September 2024 Google publishes Michelangelo paper introducing MRCR concept
April 14, 2025 OpenAI releases OpenAI MRCR alongside GPT-4.1 launch
August 7, 2025 OpenAI publishes updated MRCR results with GPT-5
December 5, 2025 Bugfix: ~10% of data points corrected for needle count errors

What Skills Does It Test?

OpenAI MRCR tests a focused but critical set of long-context comprehension capabilities that go far beyond simple information retrieval:

graph TD
    MRCR["OpenAI MRCR<br/>Long-Context Benchmark"] --> R["Multi-Needle<br/>Retrieval"]
    MRCR --> C["Coreference<br/>Resolution"]
    MRCR --> O["Ordinal<br/>Tracking"]
    MRCR --> D["Distractor<br/>Resistance"]
    MRCR --> S["Scale<br/>Invariance"]

    style MRCR fill:#e74c3c,color:#fff,stroke:#333
    style R fill:#3498db,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style O fill:#f39c12,color:#fff,stroke:#333
    style D fill:#8e44ad,color:#fff,stroke:#333
    style S fill:#e67e22,color:#fff,stroke:#333

Capability What MRCR Tests
Multi-needle retrieval Finding multiple pieces of information that look nearly identical to surrounding context
Coreference resolution Linking “the 3rd poem about tapirs” to the correct response among 8 similar poems
Ordinal tracking Maintaining correct count of repeated requests across long contexts
Distractor resistance Ignoring similar but non-matching content (a story vs. a poem, frogs vs. tapirs)
Scale invariance Performing consistently as context length increases from 4K to 1M tokens

Why It Matters for Real-World Applications

MRCR reflects real-world scenarios where models must:

  • Navigate large codebases — finding the right function definition among many similar ones
  • Process legal documents — identifying the correct clause among dozens of similar paragraphs
  • Handle customer support logs — locating the right conversation thread in a long history
  • Work with research papers — distinguishing between multiple experiments with similar setups

Current Leaderboard

The tables below show model accuracy on OpenAI MRCR (2-needle) at 128K and 256K context lengths, as published in OpenAI’s official blog posts.

Source: Introducing GPT-5 for developers (August 7, 2025) and Introducing GPT-4.1 in the API (April 14, 2025). Consulted March 28, 2026.

2-Needle at 128K Context

Rank Model Match Ratio (%)
1 GPT-5 (high) 95.2
2 GPT-5 mini (high) 84.3
3 GPT-4.1 57.2
4 o4-mini (high) 56.4
5 o3 (high) 55.0
6 GPT-4.1 mini 47.2
7 GPT-5 nano (high) 43.2
8 GPT-4.5 Preview 38.5
9 GPT-4.1 nano 36.6
10 GPT-4o 31.9
11 GPT-4o mini 24.5
12 o1 22.1
13 o3-mini 18.7

2-Needle at 256K Context

Rank Model Match Ratio (%)
1 GPT-5 (high) 86.8
2 GPT-5 mini (high) 58.8
3 GPT-4.1 56.2
4 GPT-4.1 mini 45.5
5 GPT-5 nano (high) 34.9
6 GPT-4.1 nano 22.6

Key takeaways:

  • GPT-5 dominates at 95.2% on 2-needle 128K — a massive 38-point jump over GPT-4.1 (57.2%)
  • At 256K tokens, GPT-5 still leads at 86.8%, demonstrating robust long-context understanding
  • Reasoning models (o3, o4-mini) do not automatically outperform non-reasoning models on this task — GPT-4.1 beats o3 at 128K
  • The 8-needle variant at long context lengths remains extremely challenging for all models — performance drops significantly compared to 2-needle results
  • The benchmark is far from saturated, with the 4-needle and 8-needle variants providing headroom for future models

Where to Explore the Benchmark

Dashboards and Leaderboards

Resource Description Link
OpenAI GPT-4.1 Blog Original MRCR results with interactive 2/4/8 needle charts openai.com/index/gpt-4-1
OpenAI GPT-5 Dev Blog Updated MRCR results with GPT-5 family openai.com/index/introducing-gpt-5-for-developers

Dataset and Code

Resource Description Link
Hugging Face Dataset Full 2,400-sample dataset (2-needle, 4-needle, 8-needle) huggingface.co/datasets/openai/mrcr
Michelangelo Paper Original Google paper introducing MRCR concept arxiv.org/abs/2409.12640

Load the Dataset

from huggingface_hub import hf_hub_download
import pandas as pd

# Load 2-needle data
dataset_2needle = pd.concat([
    pd.read_parquet(hf_hub_download(
        repo_id="openai/mrcr",
        filename="2needle/2needle_0.parquet",
        repo_type="dataset"
    )),
    pd.read_parquet(hf_hub_download(
        repo_id="openai/mrcr",
        filename="2needle/2needle_1.parquet",
        repo_type="dataset"
    ))
])

# Load 8-needle data
dataset_8needle = pd.concat([
    pd.read_parquet(hf_hub_download(
        repo_id="openai/mrcr",
        filename="8needle/8needle_0.parquet",
        repo_type="dataset"
    )),
    pd.read_parquet(hf_hub_download(
        repo_id="openai/mrcr",
        filename="8needle/8needle_1.parquet",
        repo_type="dataset"
    ))
])

Run the Evaluation

from openai import OpenAI
from difflib import SequenceMatcher
import json

client = OpenAI()

def grade(response, answer, random_string_to_prepend) -> float:
    if not response.startswith(random_string_to_prepend):
        return 0
    response = response.removeprefix(random_string_to_prepend)
    answer = answer.removeprefix(random_string_to_prepend)
    return float(SequenceMatcher(None, response, answer).ratio())

Understanding the Metrics

Match Ratio (SequenceMatcher)

The primary metric is the SequenceMatcher ratio from Python’s difflib library. It measures the similarity between the model’s response and the ground-truth answer on a scale from 0 to 1 (reported as percentage).

Component Description
Hash verification Model must prepend a specific alphanumeric hash; if missing, score = 0
String matching After hash removal, the stripped response is compared to stripped ground truth
Per-bin aggregation Results are averaged within each context-length bin (100 samples per bin)

Why SequenceMatcher?

Unlike exact match, SequenceMatcher provides a gradient signal — a model that retrieves most of the correct content but makes a small error still gets partial credit, while a model that retrieves the wrong needle entirely scores near 0.

Why OpenAI MRCR Matters

graph LR
    A["Single-needle<br/>benchmarks<br/>saturated"] --> B["Cannot distinguish<br/>frontier models'<br/>long-context ability"]
    B --> C["OpenAI MRCR<br/>fills the gap"]
    C --> D["Better long-context<br/>evaluation"]

    A2["Real-world tasks<br/>need multi-hop<br/>retrieval"] --> B2["Simple retrieval<br/>evals miss<br/>this signal"]
    B2 --> C
    C --> D2["Drives progress<br/>in long-context<br/>models"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style A2 fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style D2 fill:#3498db,color:#fff,stroke:#333

  1. Goes beyond trivial retrieval — needles are drawn from the same distribution as distractors, making the task genuinely hard
  2. Scales with context length — bins from 4K to 1M tokens reveal how models degrade at extreme lengths
  3. Multiple difficulty levels — 2, 4, and 8 needle variants provide unsaturated headroom
  4. Open and reproducible — full dataset on Hugging Face with MIT license and evaluation code
  5. Reflects real-world needs — legal, coding, customer support, and research tasks all require multi-needle long-context retrieval

Video: OpenAI MRCR Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

OpenAI MRCR represents a much-needed evolution in long-context evaluation:

  • 2,400 data points testing 2, 4, and 8 needle retrieval across context lengths from 4K to 1M tokens
  • Created by OpenAI, inspired by Google DeepMind’s Michelangelo paper
  • Even the best model (GPT-5) drops from 95.2% at 128K to 86.8% at 256K on just the 2-needle variant — the 8-needle variant is even harder
  • Open-sourced with MIT license, full dataset, and evaluation code
  • Exposes a critical capability gap between models that can merely find one needle and models that can genuinely comprehend long contexts

As context windows grow to millions of tokens, OpenAI MRCR provides a meaningful benchmark for measuring whether models can actually use all that context — not just claim it.

References

Read More