📃 Extensive List of Document Parsers

🚧 THIS IS A WORK IN PROGRESS! More will be added soon!
Feel free to contribute by submitting a pull request 🙏
Cells marked with ✅ or ❌ have been independently tested. Blank cells indicate that the feature has not yet been independently tested.
See the results folder to see the outputs from models.

PDF-to-Text Converters

Usually outputs as raw text or markdown

Models	Source	Output	Needs prompt?	Table	Equation	Figure	Handwriting	Two columns	Multiple columns
PyMuPDF		Raw text	N	❌	❌	❌	❌	✅	❌
PDFPlumber		Raw text	N	✅ (separate from text)	❌	❌	❌	❌	❌

Models	Source	Output	Needs prompt?	Table	Equation	Handwriting	Two columns	Multiple columns
Marker		Markdown	N	✅ (markdown)	✅	✅	✅	❌
MonkeyOCR		Markdown	Y	✅ (html)	✅	✅	✅	✅
Nougat		Markdown	N	❌	✅	✅	✅	❌
MinerU		Markdown	N	✅ (html)	✅	❌	✅	❌
Llamaparse (balanced mode)	-	Markdown	Y	✅ (markdown)	❌	❌	✅	❌
Llamaparse (premium mode)	-	Markdown	Y	✅ (markdown)	❌	❌	✅	❌
Docling		Markdown	N	✅ (markdown)	❌	❌	✅	✅
RolmOCR		Markdown	Y	✅ (markdown)	✅	✅	✅	†
olmOCR		Markdown	Y	✅ (markdown)	✅	✅	✅	†
Unstructured		Raw text	N	❌	❌	❌	❌	✅
Pytesseract		Raw text	N	❌	❌	❌	✅	✅
MarkItDown		Markdown	N	❌	❌	❌	✅	✅
Amazon textract	-
Azure AI Document Intelligence	-
Google Cloud OCR	-
Mathpix	-
MistralOCR	-
Upstage	-
OmniAI	-
ChatDoc PDF parser	-
Reducto	-
OCRFlux
Nanonets
PaddleOCR
ClovaOCR	-
ParseExtract	-
Tensorlake	-
Vectorize	-
MassivePix	-
Dolphin
GOT
Manga OCR
EasyOCR
PDFeditify	-