📝 Docs Parsing Techniques

A curated collection of Jupyter notebooks for experimenting with state-of-the-art OCR, document parsing, table extraction, and chart understanding techniques. This repository enables easy benchmarking and practical usage of the latest open-source and cloud-based solutions for document image processing.

🚀 Notebooks Overview

Notebook	Description
bytedance-dolphin-image-parsing.ipynb	Document page parsing with Dolphin by ByteDance
Llama-3.1-Nemotron-Nano-VL-8B-V1_parsing_documents.ipynb	Testing the performance of document parsing with Llama-3.1-Nemotron-Nano-VL-8B-V1
docling-documents-parsing-and-tables-extraction.ipynb	Parsing and table extraction with Docling
typhoon-ocr-7b-docs-pages-parser.ipynb	Evaluating Typhoon_ocr_7b Document Parsing Capabilities Across Various Use Cases
florence-2-large-ocr-documents-pages.ipynb	OCR of document pages using Florence 2 Large
florence-2-large-ocr-images-real-life-scenarios.ipynb	Real-life scenario OCR with Florence 2 Large
got-ocr2-0-docs-parsing.ipynb	Document pages parsing with GOT-OCR2.0 and Gemini 2.5 Flash
marker-docs-parsing.ipynb	Marker-based document parsing experiments
mistralocr-docs-parsing.ipynb	Document parsing using MistralOCR
monkeyocr-docs-pages-parsing.ipynb	Document parsing with MonkeyOCR
nanonets-OCR-s_docs_parsing.ipynb	Advanced document parsing using Nanonets-OCR-s
ollama-llama3-2-vision-usage.ipynb	Using Llama3-2 Vision for document parsing
paddleocr-3-0-docs-parsing.ipynb	Parsing with PaddleOCR 3.0 PP-StructureV3
pix2text-docs-pages-parsing.ipynb	Document parsing using Pix2Text
smoldocling-documents-understanding.ipynb	Document understanding with SmolDocling
zerox-pdf-parsing.ipynb	PDF parsing experiments with Zerox
qwen2-vl-2b-docs-parsing.ipynb	Documents pages parsing with Qwen2-VL-2B
OCRFlux_3B_Docs_Parsing.ipynb	Document parsing with OCRFlux-3B on Lightning AI

📑📊 Tables and Charts Recognition

This section includes notebooks focused on table and chart detection, structure recognition, and extraction from documents. It covers various open-source approaches and benchmarks for understanding table and chart layouts and content.

Notebook	Description
unitable-testing-for-table-structure-recognition.ipynb	Testing table detection and structure recognition with UniTable
deepdoctection-tables-recognition.ipynb	Evaluating Deepdoctection for table extraction across varied structures
gemini-2-5-pro-on-chart-and-table-extraction.ipynb	Chart/table extraction using Gemini 2.5 Pro
deplot-plots-to-tables-converter.ipynb	Converting Charts into Tables with DePlot

📑🔍 Structured Data Extraction

This section covers the structured data extraction phase, detailing methods to extract specific data from documents or images. It includes steps like OCR preprocessing, table extraction, named entity recognition (NER), and conversion to structured formats.

Notebook	Description
NuExtract-2-8b-structured-data-extraction	NuExtract-2.0-8B for Structured Data Extraction

📖 Project Goals

Benchmark different OCR/document parsing models on real documents.
Demonstrate table, chart, and text extraction workflows.
Compare open-source and commercial solutions.
Provide ready-to-use code snippets for rapid prototyping.

🛠️ Usage

Clone the repository:

git clone https://github.com/AdemBoukhris457/Docs_Parsing_Techniques.git

Install dependencies as needed for each notebook (see the first cells of each .ipynb for requirements).
Launch Jupyter Notebook or JupyterLab and open any notebook of interest.
Run the cells and adapt the code for your documents.

📌 Notes

Some notebooks require model weights or API keys, check comments in each notebook for details.
Results, insights, and sample outputs are provided inline.

🔗 Related Resources

📂 You can find more notebooks, experiments, and datasets related to document parsing and OCR on my Kaggle profile: 👉 https://www.kaggle.com/ademboukhris/code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📝 Docs Parsing Techniques

🚀 Notebooks Overview

📑📊 Tables and Charts Recognition

📑🔍 Structured Data Extraction

📖 Project Goals

🛠️ Usage

📌 Notes

🔗 Related Resources

Star History

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
Different_Tables_Images_Testing		Different_Tables_Images_Testing
NVIDIA_Annual_Report		NVIDIA_Annual_Report
Tables_Different_Cases_Cropped		Tables_Different_Cases_Cropped
assets		assets
cga_images		cga_images
pdf_files_pages		pdf_files_pages
tables_and_plots_for_testing		tables_and_plots_for_testing
Llama-3.1-Nemotron-Nano-VL-8B-V1_parsing_documents.ipynb		Llama-3.1-Nemotron-Nano-VL-8B-V1_parsing_documents.ipynb
Nanonets-OCR-s_docs_parsing.ipynb		Nanonets-OCR-s_docs_parsing.ipynb
NuExtract-2-8b-structured-data-extraction.ipynb		NuExtract-2-8b-structured-data-extraction.ipynb
OCRFlux_3B_Docs_Parsing.ipynb		OCRFlux_3B_Docs_Parsing.ipynb
README.md		README.md
bytedance-dolphin-image-parsing.ipynb		bytedance-dolphin-image-parsing.ipynb
deepdoctection-tables-recognition.ipynb		deepdoctection-tables-recognition.ipynb
deplot-plots-to-tables-converter.ipynb		deplot-plots-to-tables-converter.ipynb
docling-documents-parsing-and-tables-extraction.ipynb		docling-documents-parsing-and-tables-extraction.ipynb
florence-2-large-ocr-documents-pages.ipynb		florence-2-large-ocr-documents-pages.ipynb
florence-2-large-ocr-images-real-life-scenarios.ipynb		florence-2-large-ocr-images-real-life-scenarios.ipynb
gemini-2-5-pro-on-chart-and-table-extraction.ipynb		gemini-2-5-pro-on-chart-and-table-extraction.ipynb
got-ocr2-0-docs-parsing.ipynb		got-ocr2-0-docs-parsing.ipynb
marker-docs-parsing.ipynb		marker-docs-parsing.ipynb
mistralocr-docs-parsing.ipynb		mistralocr-docs-parsing.ipynb
monkeyocr-docs-pages-parsing.ipynb		monkeyocr-docs-pages-parsing.ipynb
ollama-llama3-2-vision-usage.ipynb		ollama-llama3-2-vision-usage.ipynb
paddleocr-3-0-docs-parsing.ipynb		paddleocr-3-0-docs-parsing.ipynb
pix2text-docs-pages-parsing.ipynb		pix2text-docs-pages-parsing.ipynb
qwen2-vl-2b-docs-parsing.ipynb		qwen2-vl-2b-docs-parsing.ipynb
smoldocling-documents-understanding.ipynb		smoldocling-documents-understanding.ipynb
typhoon-ocr-7b-docs-pages-parser.ipynb		typhoon-ocr-7b-docs-pages-parser.ipynb
unitable-testing-for-table-structure-recognition.ipynb		unitable-testing-for-table-structure-recognition.ipynb
zerox-pdf-parsing.ipynb		zerox-pdf-parsing.ipynb

AdemBoukhris457/Docs_Parsing_Techniques

Folders and files

Latest commit

History

Repository files navigation

📝 Docs Parsing Techniques

🚀 Notebooks Overview

📑📊 Tables and Charts Recognition

📑🔍 Structured Data Extraction

📖 Project Goals

🛠️ Usage

📌 Notes

🔗 Related Resources

Star History

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages