OmniDocBench 1.5

A comprehensive benchmark for evaluating diverse PDF document parsing — covering text OCR, table recognition, formula extraction, and layout detection across 1,355 real-world pages

Published

September 10, 2025

Keywords: OmniDocBench, document parsing benchmark, PDF parsing, OCR evaluation, table recognition, formula recognition, layout detection, CVPR 2025, OpenDataLab, Shanghai AI Laboratory, document understanding, VLM evaluation, pipeline evaluation, TEDS, CDM, edit distance

Introduction

Large language models and RAG systems are only as good as the documents they can parse. Yet the ability of AI to accurately extract text, tables, formulas, and layout from real-world PDFs — academic papers, financial reports, handwritten notes, newspapers — has lacked a fair, comprehensive benchmark.

OmniDocBench fills this gap. It is a rigorously annotated benchmark spanning 1,355 PDF pages across 9 document types, 4 layout types, and 3 languages, with over 20,000 block-level and 80,000 span-level annotations. Version 1.5 (September 2025) expanded the dataset with 374 new pages, balanced Chinese/English coverage, and introduced an improved evaluation methodology.

“Despite recent progress, current document parsing methods have not been fairly and comprehensively evaluated due to the narrow coverage of document types and the simplified, unrealistic evaluation procedures in existing benchmarks.” — OmniDocBench Paper

graph LR
    A["Existing Doc Benchmarks<br/>Limited doc types<br/>Simplified evaluation"] --> B["Unfair comparisons<br/>between models"]
    B --> C["OmniDocBench 1.5<br/>1,355 pages · 9 doc types<br/>Multi-level evaluation"]
    C --> D["Fair, fine-grained<br/>document parsing<br/>evaluation"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is OmniDocBench?

OmniDocBench is a benchmark for evaluating diverse PDF document parsing in real-world scenarios. It assesses how well AI systems can convert complex PDF pages into structured, machine-readable output (typically Markdown), covering text extraction, table recognition, formula parsing, layout detection, and reading order.

Key Characteristics

Feature Details
Total pages 1,355 PDF pages (v1.5)
Document types 9 — academic papers, textbooks, financial reports, newspapers, handwritten notes, PPTs, magazines, test papers, books
Layout types 4 — single-column, double-column, three-column, complex
Languages 3 — English, Chinese, mixed
Block-level annotations 15 categories (text paragraphs, headings, tables, etc.) — over 20,000
Span-level annotations 4 categories (text lines, inline formulas, subscripts, etc.) — over 80,000
Table annotations Both LaTeX and HTML formats
Formula annotations LaTeX format with language attributes
Reading order Full reading-order annotations for document components
Attribute labels 5 page-level + 3 text-level + 6 table-level attribute tags
License Apache-2.0
Accepted at CVPR 2025

What Makes It Comprehensive?

Unlike narrow benchmarks that focus on a single document type or a single extraction task, OmniDocBench evaluates five distinct capabilities across diverse, real-world documents:

graph TD
    ODB["OmniDocBench 1.5<br/>1,355 PDF pages"] --> E2E["End-to-End<br/>Document Parsing"]
    ODB --> OCR["Text OCR<br/>Recognition"]
    ODB --> TAB["Table<br/>Recognition"]
    ODB --> FORM["Formula<br/>Recognition"]
    ODB --> LAY["Layout<br/>Detection"]

    E2E --> M1["Edit Distance · BLEU<br/>METEOR · TEDS · CDM"]
    OCR --> M2["Normalized<br/>Edit Distance"]
    TAB --> M3["TEDS<br/>(Tree Edit Distance)"]
    FORM --> M4["CDM<br/>(Character Detection Matching)"]
    LAY --> M5["COCODet<br/>(mAP, mAR)"]

    style ODB fill:#e74c3c,color:#fff,stroke:#333
    style E2E fill:#3498db,color:#fff,stroke:#333
    style OCR fill:#27ae60,color:#fff,stroke:#333
    style TAB fill:#f39c12,color:#fff,stroke:#333
    style FORM fill:#8e44ad,color:#fff,stroke:#333
    style LAY fill:#e67e22,color:#fff,stroke:#333
    style M1 fill:#ecf0f1,color:#333,stroke:#bdc3c7
    style M2 fill:#ecf0f1,color:#333,stroke:#bdc3c7
    style M3 fill:#ecf0f1,color:#333,stroke:#bdc3c7
    style M4 fill:#ecf0f1,color:#333,stroke:#bdc3c7
    style M5 fill:#ecf0f1,color:#333,stroke:#bdc3c7

Version 1.5 Updates (September 2025)

OmniDocBench v1.5 introduced several important improvements over v1.0:

  • +374 new pages — balanced Chinese/English page counts and increased formula-rich pages
  • Higher resolution — newspaper and note images upgraded from 72 DPI to 200 DPI
  • Improved matching algorithm — formulas and text can now be matched with each other, reducing score errors from Unicode formula outputs
  • Simplified Overall metric — now calculated as: \text{Overall} = \frac{(1 - \text{Text Edit Distance}) \times 100 + \text{Table TEDS} + \text{Formula CDM}}{3}
  • Language attributes for formulas — 68 Chinese + 982 English formulas
  • Inline formulas increased from 353 to 1,050

Who Built It?

OmniDocBench was developed by researchers at OpenDataLab / Shanghai AI Laboratory (Shanghai Artificial Intelligence Laboratory). The lead authors include:

  • Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, Conghui He

The project was published at CVPR 2025 (IEEE/CVF Conference on Computer Vision and Pattern Recognition), one of the top-tier venues in computer vision.

Resource Link
arXiv paper arxiv.org/abs/2412.07626
GitHub github.com/opendatalab/OmniDocBench
Official site opendatalab.com/omnidocbench

What Skills Does It Test?

OmniDocBench evaluates the full spectrum of document understanding capabilities:

Capability What It Tests Metric
End-to-end parsing Full-page PDF-to-Markdown conversion — text, tables, formulas, reading order combined Overall (composite), Edit Distance
Text OCR Accurate recognition of text paragraphs across languages, fonts, and layouts Normalized Edit Distance
Table recognition Structural and content extraction of tables (simple, complex, merged cells) TEDS (Tree Edit Distance Similarity)
Formula recognition Correct LaTeX transcription of display and inline formulas CDM (Character Detection Matching)
Layout detection Localization and classification of document components (text, tables, figures, etc.) COCODet metrics (mAP, mAR)
Reading order Correct sequencing of document elements for downstream processing Edit Distance

Three Categories of Models Evaluated

OmniDocBench evaluates three fundamentally different approaches to document parsing:

graph LR
    A["Specialized VLMs<br/>PaddleOCR-VL, MinerU,<br/>MonkeyOCR, Dolphin"] --> D["End-to-End<br/>Leaderboard"]
    B["General VLMs<br/>Qwen3-VL, Gemini,<br/>GPT-4o, InternVL"] --> D
    C["Pipeline Tools<br/>PP-StructureV3, Marker,<br/>MinerU-pipeline"] --> D

    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#27ae60,color:#fff,stroke:#333
    style C fill:#8e44ad,color:#fff,stroke:#333
    style D fill:#e74c3c,color:#fff,stroke:#333

Current Leaderboard

The leaderboard below shows the end-to-end document parsing results on OmniDocBench v1.5. The Overall score is the composite metric: \frac{(1 - \text{Edit Dist}) \times 100 + \text{TEDS} + \text{CDM}}{3}. Higher Overall is better; lower Edit Distance is better.

Source: OmniDocBench GitHub Repository (consulted March 29, 2026). Dataset version 1.5 (September 2025).

Specialized Document VLMs

Rank Model Size Overall ↑ Edit Dist ↓ Table TEDS ↑ Formula CDM ↑
1 PaddleOCR-VL 0.9B 92.86 0.035 90.89 94.76
2 MinerU2.5 1.2B 90.67 0.047 88.22 92.38
3 OpenDoc-0.1B 0.1B 90.49 0.043 88.05 91.97
4 MonkeyOCR-pro-3B 3B 88.85 0.075 86.78 90.63
5 OCRVerse 4B 88.56 0.058 84.55 88.45
6 dots.ocr 3B 88.41 0.048 86.78 90.62
7 MonkeyOCR-3B 3B 87.13 0.075 81.39 85.92
8 Deepseek-OCR 3B 87.01 0.073 84.97 88.80
9 MonkeyOCR-pro-1.2B 1.2B 86.96 0.084 84.24 89.02
10 Nanonets-OCR-s 3B 85.59 0.093 80.14 85.57
11 MinerU2-VLM 0.9B 85.56 0.078 83.54 87.66
12 Dolphin-1.5 0.3B 83.21 0.092 78.06 84.10
13 olmOCR 7B 81.79 0.096 68.92 74.77
14 POINTS-Reader 3B 80.98 0.134 77.13 81.66
15 Mistral OCR 78.83 0.164 70.03 78.04
16 OCRFlux 3B 74.82 0.193 75.75 80.23
17 Dolphin 0.3B 74.67 0.125 68.70 77.77

General Vision-Language Models

Rank Model Size Overall ↑ Edit Dist ↓ Table TEDS ↑ Formula CDM ↑
1 Qwen3-VL-235B 235B 89.15 0.069 86.21 90.55
2 Gemini-2.5 Pro 88.03 0.075 85.71 90.29
3 Qwen2.5-VL 72B 87.02 0.094 82.15 86.22
4 InternVL3.5 241B 82.67 0.142 75.00 81.28
5 InternVL3 78B 80.33 0.131 70.64 77.74
6 GPT-4o 75.02 0.217 67.07 76.09

Pipeline Tools

Rank Model Overall ↑ Edit Dist ↓ Table TEDS ↑ Formula CDM ↑
1 PP-StructureV3 86.73 0.073 81.68 89.48
2 Mineru2-pipeline 75.51 0.209 70.90 79.11
3 Marker-1.8.2 71.30 0.206 57.88 71.17

Key takeaways:

  • PaddleOCR-VL (0.9B) leads overall at 92.86, proving that specialized smaller models can outperform massive general VLMs on document parsing
  • Among general VLMs, Qwen3-VL-235B (89.15) and Gemini-2.5 Pro (88.03) compete closely with specialized models
  • GPT-4o scores only 75.02 — significantly behind purpose-built document parsers
  • Tiny models like OpenDoc-0.1B (90.49) and Dolphin-1.5 (0.3B, 83.21) demonstrate impressive efficiency for their size

Where to Explore the Benchmark

Dashboards and Resources

Resource Description Link
Official Site OpenDataLab’s OmniDocBench leaderboard and dataset portal opendatalab.com/omnidocbench
GitHub Repository Evaluation code, configs, inference scripts, and result tables github.com/opendatalab/OmniDocBench
Hugging Face Dataset Download the 1,355-page annotated dataset (1.25 GB) huggingface.co/datasets/opendatalab/OmniDocBench
OpenDataLab Dataset Alternative dataset download opendatalab.com/OpenDataLab/OmniDocBench
arXiv Paper Full technical paper with methodology and analysis arxiv.org/abs/2412.07626

Load the Dataset

from datasets import load_dataset

dataset = load_dataset("opendatalab/OmniDocBench", split="train")
print(f"Total pages: {len(dataset)}")
# Total pages: 1358

Run the Evaluation

# Setup
conda create -n omnidocbench python=3.10
conda activate omnidocbench
pip install -r requirements.txt

# Run evaluation with your model's markdown output
python pdf_validation.py --config configs/end2end.yaml

The evaluation framework supports flexible configuration files for each task: end2end, md2md, table recognition, formula recognition, OCR, and layout detection.

Understanding the Metrics

Overall Score

The primary end-to-end metric combines three component scores:

\text{Overall} = \frac{(1 - \text{Text Edit Distance}) \times 100 + \text{Table TEDS} + \text{Formula CDM}}{3}

This gives equal weight to text, table, and formula extraction quality.

Component Metrics

Metric Range What It Measures Used For
Edit Distance 0–1 ↓ Character-level differences between predicted and ground-truth text Text OCR, reading order
TEDS 0–100 ↑ Tree Edit Distance Similarity — structural + content accuracy of tables Table recognition
CDM 0–100 ↑ Character Detection Matching — precision of formula LaTeX transcription Formula recognition
BLEU / METEOR 0–1 ↑ Standard NLP metrics for text similarity Alternative text quality measures
mAP / mAR 0–1 ↑ COCO detection metrics for bounding box localization Layout detection

Document Types and Attribute Analysis

OmniDocBench goes beyond aggregate scores by providing fine-grained, attribute-level results. You can break down performance by:

  • Document type — academic paper, textbook, financial report, newspaper, handwritten note, PPT, magazine, test paper, book
  • Layout complexity — single-column, double-column, three-column, complex
  • Language — Chinese, English, mixed
  • Table attributes — simple vs. complex, with/without merged cells, colored backgrounds
  • Text attributes — font size, orientation, special characters

Why OmniDocBench Matters

graph LR
    A["LLMs & RAG need<br/>accurate document<br/>parsing"] --> B["Existing benchmarks<br/>too narrow"]
    B --> C["OmniDocBench<br/>fills the gap"]
    C --> D["Better document AI<br/>for real-world use"]

    A2["Models compared<br/>unfairly"] --> B2["Different eval<br/>methodologies"]
    B2 --> C
    C --> D2["Standardized,<br/>reproducible<br/>benchmarking"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style A2 fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style D2 fill:#3498db,color:#fff,stroke:#333

  1. Diverse and realistic — 9 document types covering the full range of real-world PDFs, not just academic papers
  2. Multi-level evaluation — end-to-end, task-specific, and attribute-level analysis to pinpoint model weaknesses
  3. Fair comparison — standardized evaluation code ensures reproducible, apples-to-apples comparisons
  4. Covers the full pipeline — text, tables, formulas, layout, and reading order in a single benchmark
  5. Active community — 1.6k GitHub stars, 14 contributors, regular model additions (Docker support added November 2025)

Video: OmniDocBench Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

OmniDocBench 1.5 sets the standard for document parsing evaluation:

  • 1,355 PDF pages across 9 document types, 4 layouts, and 3 languages — far broader than any predecessor
  • 100,000+ annotations at block and span levels, with reading order and multi-format table/formula ground truth
  • Five evaluation dimensions — end-to-end, text OCR, table, formula, and layout detection
  • The best specialized model (PaddleOCR-VL) achieves 92.86 Overall — but general VLMs like GPT-4o still score only 75.02, revealing a significant gap
  • Accepted at CVPR 2025 and actively maintained with regular model updates

As document AI becomes critical infrastructure for LLMs and RAG systems, OmniDocBench provides the rigorous, multi-dimensional evaluation needed to drive real progress — not just on cherry-picked academic papers, but across the messy diversity of real-world documents.

References

Read More