Flow-Aware PDF-to-Markdown Benchmark

This repository provides a benchmark for evaluating the accuracy of PDF-to-Markdown extraction tools. The main goal is to measure how well a tool can convert a complex, 2D PDF document into a 1D (text/markdown) format that preserves the logical reading flow of the content.

The Core Problem

Large Language Models (LLMs) operate on a 1D sequence of tokens. They cannot natively understand the 2D spatial layout of a PDF. This "2D-to-1D" gap is a major bottleneck.

Existing benchmarks are often inadequate:

They focus on layout detection: Datasets like DocLayNet are excellent for identifying bounding boxes (e.g., "this is a paragraph"), but not for connecting text blocks that form a single, logical flow (e.g., "this paragraph continues in the next column").
They assume a "total order": Some benchmarks incorrectly assume a single, linear reading path for an entire document. In reality, complex documents have a partial order. For example, a main article and a sidebar can be read independently; neither logically precedes the other.

This benchmark is designed to measure a tool's ability to extract and correctly sequence these logically coherent "threads" of text.

Benchmark Methodology

1. Dataset

The benchmark uses 127 PDF documents sampled from the DocLayNet dataset, ensuring a diverse mix of challenging, real-world layouts. The documents are sourced from six distinct categories:

Financial Reports
Scientific Articles
Laws & Regulations
Government Tenders
Manuals
Patents

2. Ground Truth

To create a "ground truth" for evaluating text flow, for each document, we manually copied multiple, random pieces of text in their correct logical reading order to create "ground truth snippets" for each PDF. An evaluation metric can then check if a tool's output contains these snippets, in order, without being jumbled with text from other columns or sections. This method effectively tests the preservation of reading flow.

3. Evaluation Metric: FATA Score

We evaluate tools using a Flow-Aware Text Accuracy (FATA) Score.

For each ground truth text snippet (truth_i), we search the tool's entire markdown output to find the substring that is its "best match" (best_match_i).
This "best match" is determined using the Normalized Levenshtein distance (a measure of character-level similarity).
The final FATA score is a weighted average of the similarity scores for all snippets.
A high FATA score (max 1.0) indicates the tool successfully extracted the text snippets with their internal order intact. A low score indicates the text was "mangled" (e.g., columns interleaved, text garbled), making it impossible to find a clean match for the ground truth snippets. These are converted to percentages, 100% being perfect.

Initial Tools Evaluated

This benchmark was used to generate a comparative analysis of modern PDF extraction tools that produce markdown directly. The initial set of tools evaluated includes:

LlamaParse
Docling
Marker
Reducto
PyMuPDF4LLM
Pymupdf-Layout
Google Gemini (multimodal)
ChatGPT-5 (multimodal)

How to run this benchmark

uv sync
uv run prod_benchmark.py

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
PDFs		PDFs
datalab_markdowns		datalab_markdowns
datalablllm_markdowns		datalablllm_markdowns
docling_markdowns		docling_markdowns
gemini_markdowns		gemini_markdowns
geminipro_markdowns		geminipro_markdowns
llamaparse_highest_markdowns		llamaparse_highest_markdowns
llamaparse_markdowns		llamaparse_markdowns
pymupdf_markdowns		pymupdf_markdowns
pymupdflayout_markdowns		pymupdflayout_markdowns
reducto_markdowns		reducto_markdowns
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
Pictures.xlsx		Pictures.xlsx
README.md		README.md
ai_prompt.md		ai_prompt.md
annotations.json		annotations.json
benchmark_results_final.csv		benchmark_results_final.csv
cleaned_output.json		cleaned_output.json
combined_output.json		combined_output.json
docling_markdown.py		docling_markdown.py
gemini_markdown.py		gemini_markdown.py
llamaprse_markdown.py		llamaprse_markdown.py
markit_markdown.py		markit_markdown.py
page_folder_mapping.csv		page_folder_mapping.csv
prod_benchmark.py		prod_benchmark.py
pymupdf4llm_markdown.py		pymupdf4llm_markdown.py
pymupdflayout_markdown.py		pymupdflayout_markdown.py
pyproject.toml		pyproject.toml
reducto_markdown.py		reducto_markdown.py
utils.py		utils.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Flow-Aware PDF-to-Markdown Benchmark

The Core Problem

Benchmark Methodology

1. Dataset

2. Ground Truth

3. Evaluation Metric: FATA Score

Initial Tools Evaluated

How to run this benchmark

About

Uh oh!

Releases

Packages

Contributors 2

Languages

pymupdf/PyMuPDF-Layout

Folders and files

Latest commit

History

Repository files navigation

Flow-Aware PDF-to-Markdown Benchmark

The Core Problem

Benchmark Methodology

1. Dataset

2. Ground Truth

3. Evaluation Metric: FATA Score

Initial Tools Evaluated

How to run this benchmark

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages