- 🚧 THIS IS A WORK IN PROGRESS! More will be added soon!
- Feel free to contribute by submitting a pull request 🙏
- Cells marked with ✅ or ❌ have been independently tested. Blank cells indicate that the feature has not yet been independently tested.
- See the
results
folder to see the outputs from models.
Usually outputs as raw text or markdown
Models | Source | Output | Needs prompt? | Table | Equation | Figure | Handwriting | Two columns | Multiple columns |
---|---|---|---|---|---|---|---|---|---|
PyMuPDF | Raw text | N | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | |
PDFPlumber | Raw text | N | ✅ (separate from text) | ❌ | ❌ | ❌ | ❌ | ❌ |
Models | Source | Output | Needs prompt? | Table | Equation | Handwriting | Two columns | Multiple columns |
---|---|---|---|---|---|---|---|---|
Marker | Markdown | N | ✅ (markdown) | ✅ | ✅ | ✅ | ❌ | |
MonkeyOCR | Markdown | Y | ✅ (html) | ✅ | ✅ | ✅ | ✅ | |
Nougat | Markdown | N | ❌ | ✅ | ✅ | ✅ | ❌ | |
MinerU | Markdown | N | ✅ (html) | ✅ | ❌ | ✅ | ❌ | |
Llamaparse (balanced mode) | - | Markdown | Y | ✅ (markdown) | ❌ | ❌ | ✅ | ❌ |
Llamaparse (premium mode) | - | Markdown | Y | ✅ (markdown) | ❌ | ❌ | ✅ | ❌ |
Docling | Markdown | N | ✅ (markdown) | ❌ | ❌ | ✅ | ✅ | |
RolmOCR | Markdown | Y | ✅ (markdown) | ✅ | ✅ | ✅ | † | |
olmOCR | Markdown | Y | ✅ (markdown) | ✅ | ✅ | ✅ | † | |
Unstructured | Raw text | N | ❌ | ❌ | ❌ | ❌ | ✅ | |
Pytesseract | Raw text | N | ❌ | ❌ | ❌ | ✅ | ✅ | |
MarkItDown | Markdown | N | ❌ | ❌ | ❌ | ✅ | ✅ | |
Amazon textract | - | |||||||
Azure AI Document Intelligence | - | |||||||
Google Cloud OCR | - | |||||||
Mathpix | - | |||||||
MistralOCR | - | |||||||
Upstage | - | |||||||
OmniAI | - | |||||||
ChatDoc PDF parser | - | |||||||
Reducto | - | |||||||
OCRFlux | ||||||||
Nanonets | ||||||||
PaddleOCR | ||||||||
ClovaOCR | - | |||||||
ParseExtract | - | |||||||
Tensorlake | - | |||||||
Vectorize | - | |||||||
MassivePix | - | |||||||
Dolphin | ||||||||
GOT | ||||||||
Manga OCR | ||||||||
EasyOCR | ||||||||
PDFeditify | - |
† Process took too long
Usually outputs as JSON containing bounding box coordinates, content (as raw text or markdown), and sometimes type (header, figure, paragraph, etc.)
🚧 WORK IN PROGRESS
Models | Source | Output | Table | Equation | Handwriting | Two columns | Multiple columns |
---|---|---|---|---|---|---|---|
Chunkr | |||||||
GroundX | - | ||||||
ChatDOC | - | ||||||
Unstract |
If you would like to contribute in any way, please read CONTRIBUTING.md
and then make a contribution. Thank you!