graph LR
A["Single Needle<br/>in a Haystack<br/>~100% for frontier models"] --> B["Too easy<br/>for modern LLMs"]
B --> C["OpenAI MRCR<br/>2, 4, or 8 needles<br/>up to 1M tokens"]
C --> D["Tests genuine<br/>long-context<br/>comprehension"]
style A fill:#e74c3c,stroke:#333,color:#fff
style B fill:#f39c12,stroke:#333,color:#fff
style C fill:#27ae60,stroke:#333,color:#fff
style D fill:#3498db,stroke:#333,color:#fff
OpenAI MRCR (Multi-Round Coreference Resolution)
A long-context ‘multiple needles in a haystack’ benchmark that tests whether LLMs can find and disambiguate 2, 4, or 8 hidden needles across conversations up to 1 million tokens
Keywords: OpenAI MRCR, Multi-Round Coreference Resolution, long context benchmark, needle in a haystack, multiple needles, long-context retrieval, GPT-4.1, GPT-5, context window, 1M tokens, Michelangelo, Latent Structure Queries, LLM evaluation, coreference resolution

Introduction
The “needle in a haystack” test has become the default way to evaluate whether LLMs can use their long context windows. But finding one obvious needle is trivially easy for modern frontier models — GPT-4.1, Gemini, and Claude all score nearly 100% on single-needle retrieval across their full context windows.
OpenAI MRCR (Multi-Round Coreference Resolution) raises the bar dramatically. Instead of hiding one obvious piece of information, it embeds 2, 4, or 8 nearly identical requests throughout a long, realistic conversation — and asks the model to retrieve a specific instance (e.g., “return the 3rd poem about tapirs”). The needles are drawn from the same distribution as the distractors, making them almost impossible to distinguish without genuine long-context comprehension.
“The challenge arises from the similarity between these requests and the rest of the context — models can easily be misled by subtle differences, such as a short story about tapirs rather than a poem, or a poem about frogs instead of tapirs.” — OpenAI, GPT-4.1 Blog
What Is OpenAI MRCR?
OpenAI MRCR is a long-context benchmark that evaluates an LLM’s ability to find and disambiguate between multiple hidden needles in a realistic multi-turn conversation. The task goes far beyond simple retrieval — the model must track identity, order, and type across potentially thousands of conversation turns.
How It Works
- A long, multi-turn conversation is synthetically generated: the user asks for writings (poems, blog posts, social media posts, stories) about various topics
- Hidden among the conversation are 2, 4, or 8 identical requests (e.g., “write a poem about tapirs”) — each producing a distinct response
- The model is asked to return the i-th instance of a specific request (e.g., “return the 2nd poem about tapirs”)
- The model must prepend an alphanumeric hash to its answer — if the hash is missing, the score is 0
Key Characteristics
| Feature | Details |
|---|---|
| Needle variants | 2-needle, 4-needle, and 8-needle |
| Context length bins | 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K, 1M tokens |
| Samples per bin | 100 per bin |
| Total dataset | 2,400 data points |
| Distinct entities | 438 topics |
| Writing formats | 10 different formats (poems, blog posts, stories, etc.) |
| Metric | SequenceMatcher ratio (Python difflib) |
| Anti-gaming | Alphanumeric hash must be prepended to the answer |
| License | MIT |
Why Is the 8-Needle Variant So Hard?
The 8-needle variant is the most challenging configuration:
- 8 identical requests are scattered among hundreds of distractor turns
- Each needle produces a distinct but stylistically similar response (all generated by GPT-4o)
- The model must identify the correct ordinal instance (e.g., the 5th poem about tapirs, not the 4th or 6th)
- At long context lengths (256K–1M tokens), even frontier models struggle to maintain accuracy
graph TD
A["Multi-turn conversation<br/>thousands of turns"] --> B["User asks for<br/>poems, blog posts,<br/>stories, etc."]
B --> C["8 identical requests<br/>hidden among<br/>distractors"]
C --> D["Model must return<br/>the i-th instance<br/>of a specific ask"]
D --> E["Score: string match<br/>ratio + hash check"]
style A fill:#ecf0f1,color:#333,stroke:#bdc3c7
style B fill:#f39c12,color:#fff,stroke:#333
style C fill:#e74c3c,color:#fff,stroke:#333
style D fill:#3498db,color:#fff,stroke:#333
style E fill:#27ae60,color:#fff,stroke:#333
Who Built It?
OpenAI MRCR was created by OpenAI and released alongside the GPT-4.1 model family on April 14, 2025. The benchmark is inspired by the MRCR evaluation first introduced by Google DeepMind in the Michelangelo paper (Vodrahalli et al., arXiv:2409.12640, September 2024), which proposed the Latent Structure Queries (LSQ) framework for evaluating long-context reasoning.
OpenAI expanded upon Google’s concept by:
- Increasing difficulty with 2, 4, and 8 needle variants
- Scaling context lengths up to 1 million tokens
- Open-sourcing the full dataset on Hugging Face for reproducibility
- Providing evaluation code so anyone can benchmark their own models
Key People and Institutions
| Entity | Role |
|---|---|
| OpenAI | Created and open-sourced OpenAI MRCR |
| Google DeepMind | Introduced the original MRCR concept in the Michelangelo paper |
| Kiran Vodrahalli et al. | Authors of the Michelangelo paper (Google) |
Publication History
| Date | Milestone |
|---|---|
| September 2024 | Google publishes Michelangelo paper introducing MRCR concept |
| April 14, 2025 | OpenAI releases OpenAI MRCR alongside GPT-4.1 launch |
| August 7, 2025 | OpenAI publishes updated MRCR results with GPT-5 |
| December 5, 2025 | Bugfix: ~10% of data points corrected for needle count errors |
What Skills Does It Test?
OpenAI MRCR tests a focused but critical set of long-context comprehension capabilities that go far beyond simple information retrieval:
graph TD
MRCR["OpenAI MRCR<br/>Long-Context Benchmark"] --> R["Multi-Needle<br/>Retrieval"]
MRCR --> C["Coreference<br/>Resolution"]
MRCR --> O["Ordinal<br/>Tracking"]
MRCR --> D["Distractor<br/>Resistance"]
MRCR --> S["Scale<br/>Invariance"]
style MRCR fill:#e74c3c,color:#fff,stroke:#333
style R fill:#3498db,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style O fill:#f39c12,color:#fff,stroke:#333
style D fill:#8e44ad,color:#fff,stroke:#333
style S fill:#e67e22,color:#fff,stroke:#333
| Capability | What MRCR Tests |
|---|---|
| Multi-needle retrieval | Finding multiple pieces of information that look nearly identical to surrounding context |
| Coreference resolution | Linking “the 3rd poem about tapirs” to the correct response among 8 similar poems |
| Ordinal tracking | Maintaining correct count of repeated requests across long contexts |
| Distractor resistance | Ignoring similar but non-matching content (a story vs. a poem, frogs vs. tapirs) |
| Scale invariance | Performing consistently as context length increases from 4K to 1M tokens |
Why It Matters for Real-World Applications
MRCR reflects real-world scenarios where models must:
- Navigate large codebases — finding the right function definition among many similar ones
- Process legal documents — identifying the correct clause among dozens of similar paragraphs
- Handle customer support logs — locating the right conversation thread in a long history
- Work with research papers — distinguishing between multiple experiments with similar setups
Current Leaderboard
The tables below show model accuracy on OpenAI MRCR (2-needle) at 128K and 256K context lengths, as published in OpenAI’s official blog posts.
Source: Introducing GPT-5 for developers (August 7, 2025) and Introducing GPT-4.1 in the API (April 14, 2025). Consulted March 28, 2026.
2-Needle at 128K Context
| Rank | Model | Match Ratio (%) |
|---|---|---|
| 1 | GPT-5 (high) | 95.2 |
| 2 | GPT-5 mini (high) | 84.3 |
| 3 | GPT-4.1 | 57.2 |
| 4 | o4-mini (high) | 56.4 |
| 5 | o3 (high) | 55.0 |
| 6 | GPT-4.1 mini | 47.2 |
| 7 | GPT-5 nano (high) | 43.2 |
| 8 | GPT-4.5 Preview | 38.5 |
| 9 | GPT-4.1 nano | 36.6 |
| 10 | GPT-4o | 31.9 |
| 11 | GPT-4o mini | 24.5 |
| 12 | o1 | 22.1 |
| 13 | o3-mini | 18.7 |
2-Needle at 256K Context
| Rank | Model | Match Ratio (%) |
|---|---|---|
| 1 | GPT-5 (high) | 86.8 |
| 2 | GPT-5 mini (high) | 58.8 |
| 3 | GPT-4.1 | 56.2 |
| 4 | GPT-4.1 mini | 45.5 |
| 5 | GPT-5 nano (high) | 34.9 |
| 6 | GPT-4.1 nano | 22.6 |
Key takeaways:
- GPT-5 dominates at 95.2% on 2-needle 128K — a massive 38-point jump over GPT-4.1 (57.2%)
- At 256K tokens, GPT-5 still leads at 86.8%, demonstrating robust long-context understanding
- Reasoning models (o3, o4-mini) do not automatically outperform non-reasoning models on this task — GPT-4.1 beats o3 at 128K
- The 8-needle variant at long context lengths remains extremely challenging for all models — performance drops significantly compared to 2-needle results
- The benchmark is far from saturated, with the 4-needle and 8-needle variants providing headroom for future models
Where to Explore the Benchmark
Dashboards and Leaderboards
| Resource | Description | Link |
|---|---|---|
| OpenAI GPT-4.1 Blog | Original MRCR results with interactive 2/4/8 needle charts | openai.com/index/gpt-4-1 |
| OpenAI GPT-5 Dev Blog | Updated MRCR results with GPT-5 family | openai.com/index/introducing-gpt-5-for-developers |
Dataset and Code
| Resource | Description | Link |
|---|---|---|
| Hugging Face Dataset | Full 2,400-sample dataset (2-needle, 4-needle, 8-needle) | huggingface.co/datasets/openai/mrcr |
| Michelangelo Paper | Original Google paper introducing MRCR concept | arxiv.org/abs/2409.12640 |
Load the Dataset
from huggingface_hub import hf_hub_download
import pandas as pd
# Load 2-needle data
dataset_2needle = pd.concat([
pd.read_parquet(hf_hub_download(
repo_id="openai/mrcr",
filename="2needle/2needle_0.parquet",
repo_type="dataset"
)),
pd.read_parquet(hf_hub_download(
repo_id="openai/mrcr",
filename="2needle/2needle_1.parquet",
repo_type="dataset"
))
])
# Load 8-needle data
dataset_8needle = pd.concat([
pd.read_parquet(hf_hub_download(
repo_id="openai/mrcr",
filename="8needle/8needle_0.parquet",
repo_type="dataset"
)),
pd.read_parquet(hf_hub_download(
repo_id="openai/mrcr",
filename="8needle/8needle_1.parquet",
repo_type="dataset"
))
])Run the Evaluation
from openai import OpenAI
from difflib import SequenceMatcher
import json
client = OpenAI()
def grade(response, answer, random_string_to_prepend) -> float:
if not response.startswith(random_string_to_prepend):
return 0
response = response.removeprefix(random_string_to_prepend)
answer = answer.removeprefix(random_string_to_prepend)
return float(SequenceMatcher(None, response, answer).ratio())Understanding the Metrics
Match Ratio (SequenceMatcher)
The primary metric is the SequenceMatcher ratio from Python’s difflib library. It measures the similarity between the model’s response and the ground-truth answer on a scale from 0 to 1 (reported as percentage).
| Component | Description |
|---|---|
| Hash verification | Model must prepend a specific alphanumeric hash; if missing, score = 0 |
| String matching | After hash removal, the stripped response is compared to stripped ground truth |
| Per-bin aggregation | Results are averaged within each context-length bin (100 samples per bin) |
Why SequenceMatcher?
Unlike exact match, SequenceMatcher provides a gradient signal — a model that retrieves most of the correct content but makes a small error still gets partial credit, while a model that retrieves the wrong needle entirely scores near 0.
Why OpenAI MRCR Matters
graph LR
A["Single-needle<br/>benchmarks<br/>saturated"] --> B["Cannot distinguish<br/>frontier models'<br/>long-context ability"]
B --> C["OpenAI MRCR<br/>fills the gap"]
C --> D["Better long-context<br/>evaluation"]
A2["Real-world tasks<br/>need multi-hop<br/>retrieval"] --> B2["Simple retrieval<br/>evals miss<br/>this signal"]
B2 --> C
C --> D2["Drives progress<br/>in long-context<br/>models"]
style A fill:#e74c3c,color:#fff,stroke:#333
style A2 fill:#e74c3c,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#3498db,color:#fff,stroke:#333
style D2 fill:#3498db,color:#fff,stroke:#333
- Goes beyond trivial retrieval — needles are drawn from the same distribution as distractors, making the task genuinely hard
- Scales with context length — bins from 4K to 1M tokens reveal how models degrade at extreme lengths
- Multiple difficulty levels — 2, 4, and 8 needle variants provide unsaturated headroom
- Open and reproducible — full dataset on Hugging Face with MIT license and evaluation code
- Reflects real-world needs — legal, coding, customer support, and research tasks all require multi-needle long-context retrieval
Video: OpenAI MRCR Explained
Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀
Conclusion
OpenAI MRCR represents a much-needed evolution in long-context evaluation:
- 2,400 data points testing 2, 4, and 8 needle retrieval across context lengths from 4K to 1M tokens
- Created by OpenAI, inspired by Google DeepMind’s Michelangelo paper
- Even the best model (GPT-5) drops from 95.2% at 128K to 86.8% at 256K on just the 2-needle variant — the 8-needle variant is even harder
- Open-sourced with MIT license, full dataset, and evaluation code
- Exposes a critical capability gap between models that can merely find one needle and models that can genuinely comprehend long contexts
As context windows grow to millions of tokens, OpenAI MRCR provides a meaningful benchmark for measuring whether models can actually use all that context — not just claim it.
References
- OpenAI. “Introducing GPT-4.1 in the API.” OpenAI Blog (April 14, 2025). openai.com/index/gpt-4-1
- OpenAI. “Introducing GPT-5 for developers.” OpenAI Blog (August 7, 2025). openai.com/index/introducing-gpt-5-for-developers
- OpenAI. “OpenAI MRCR Dataset.” Hugging Face. huggingface.co/datasets/openai/mrcr
- Vodrahalli, K., Ontanon, S., Tripuraneni, N. et al. “Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries.” arXiv preprint arXiv:2409.12640 (2024). arxiv.org/abs/2409.12640
Read More
- Explore other AI benchmarks — see Humanity’s Last Exam (HLE)
- Multimodal evaluation — see MMMU-Pro
- Graduate-level reasoning — see GPQA Diamond
- Multilingual evaluation — see MMMLU
- Track model costs when running evaluations — see FinOps Best Practices for LLM Applications
- Deploy models for running your own evaluations — see Deploying and Serving LLM with vLLM
- OpenAI MRCR Dataset on Hugging Face
- GPT-4.1 Blog with MRCR Interactive Charts