Skip to content

A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.

Notifications You must be signed in to change notification settings

GiftMungmeeprued/document-parsers-list

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

📃 Extensive List of Document Parsers

  • 🚧 THIS IS A WORK IN PROGRESS! More will be added soon!
  • Feel free to contribute by submitting a pull request 🙏
  • Cells marked with ✅ or ❌ have been independently tested. Blank cells indicate that the feature has not yet been independently tested.
  • See the results folder to see the outputs from models.

PDF-to-Text Converters

Usually outputs as raw text or markdown

PDF-to-Text Converters

Machine-generated Documents only

Models Source Output Needs prompt? Table Equation Figure Handwriting Two columns Multiple columns
PyMuPDF GitHub Repo stars Raw text N
PDFPlumber GitHub Repo stars Raw text N ✅ (separate from text)

Machine-generated and Scanned Documents

Models Source Output Needs prompt? Table Equation Handwriting Two columns Multiple columns
Marker GitHub Repo stars Markdown N ✅ (markdown)
MonkeyOCR GitHub Repo stars Huggingface model Markdown Y ✅ (html)
Nougat GitHub Repo stars Markdown N
MinerU GitHub Repo stars Markdown N ✅ (html)
Llamaparse (balanced mode) - Markdown Y ✅ (markdown)
Llamaparse (premium mode) - Markdown Y ✅ (markdown)
Docling GitHub Repo stars Markdown N ✅ (markdown)
RolmOCR Huggingface model Markdown Y ✅ (markdown)
olmOCR GitHub Repo stars Markdown Y ✅ (markdown)
Unstructured GitHub Repo stars Raw text N
Pytesseract GitHub Repo stars Raw text N
MarkItDown GitHub Repo stars Markdown N
Amazon textract -
Azure AI Document Intelligence -
Google Cloud OCR -
Mathpix -
MistralOCR -
Upstage -
OmniAI -
ChatDoc PDF parser -
Reducto -
OCRFlux GitHub Repo stars
Nanonets Huggingface model
PaddleOCR GitHub Repo stars
ClovaOCR -
ParseExtract -
Tensorlake -
Vectorize -
MassivePix -
Dolphin GitHub Repo stars
GOT GitHub Repo stars
Manga OCR GitHub Repo stars
EasyOCR GitHub Repo stars
PDFeditify -

† Process took too long

Layout Parsers

Usually outputs as JSON containing bounding box coordinates, content (as raw text or markdown), and sometimes type (header, figure, paragraph, etc.)

Layout Parsers

🚧 WORK IN PROGRESS

Models Source Output Table Equation Handwriting Two columns Multiple columns
Chunkr GitHub Repo stars
GroundX -
ChatDOC -
Unstract GitHub Repo stars

Contributing

If you would like to contribute in any way, please read CONTRIBUTING.md and then make a contribution. Thank you!

About

A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.

Topics

Resources

Stars

Watchers

Forks