A Rust-based PDF content extraction tool that combines PDF rendering with deep learning-based document layout detection and OCR text recognition. This project is inspired by and builds upon the excellent work of ferrules.
- PDF to Image Conversion: Convert PDF pages to high-quality images using PDFium with configurable rendering options
- Layout Analysis: Detect and classify document elements using YOLOv12 model trained on DocLayNet, with advanced sorting by page index
- Text Detection & OCR: Extract text content using PaddleOCR with PaddlePaddle models, including layout-aware detection and multi-language support
- Bounding Box Visualization: Draw detected layout elements and text regions with color-coded bounding boxes, including detection and OCR results
- Reading Order Sorting: Automatically sort detected elements in natural reading order, optimized for multi-column layouts
- Multi-column Support: Handle both single and multi-column document layouts with intelligent merging of text lines
- Text Line Merging: Intelligently merge overlapping text detections into coherent text lines, with configurable thresholds
- Debug Mode: Save intermediate results such as layout bounding boxes and text detection visualizations for debugging purposes
The tool can detect and classify the following document elements:
- Caption
- Footnote
- Formula
- List-Item
- Page-Footer
- Page-Header
- Picture
- Section-Header
- Table
- Text
- Title
- PaddleOCR Detection: Uses PaddlePaddle's text detection model to locate text regions
- Layout-Aware Detection: Detects text within specific layout regions (e.g., text blocks, tables)
- Confidence Scoring: Each text detection includes confidence probability
- Bounding Box Extraction: Precise text region coordinates with proper scaling
- PaddleOCR Recognition: Uses PaddlePaddle's text recognition model for OCR
- Multi-language Support: Supports various languages and character sets
- High Accuracy: State-of-the-art recognition accuracy on printed text
-
PDFium Library: The tool requires PDFium binaries. Set the environment variable:
export PDFIUM_DYNAMIC_LIB_PATH=/path/to/pdfium/lib -
Rust: Install Rust from rustup.rs
-
ONNX Runtime: Required for model inference (automatically handled by dependencies)
-
Arch Linux CUDA Requirements: If enabling CUDA features on Arch Linux, ensure the following packages are installed:
onnxruntimecudacudnnnccl
cd ferrpdf/ferrpdf-core
cargo build --releaseThe analyze tool combines PDF rendering, layout analysis, and OCR text extraction in a single command:
# Analyze first page of a PDF with OCR
cargo run --bin analyze -- input.pdf
# Analyze specific page (0-based indexing)
cargo run --bin analyze -- input.pdf --page 2
# Specify output directory
cargo run --bin analyze -- input.pdf --output results --debug
# Full example
cargo run --bin analyze -- "document.pdf" --page 0 --output analysis_results --debug<INPUT>: Input PDF file path (required)-p, --page <PAGE>: Page number to analyze (0-based, default: 0)-o, --output <OUTPUT>: Output directory (default: "images")
analysis-{page}.jpg: Layout analysis result with bounding boxes and labelsdetection-{page}-{N}.jpg: Text detection visualization with bounding boxes and confidence scores
cargo run --bin pdf2imgcargo run --bin layoutThe following constants can be adjusted in src/consts.rs:
PROBA_THRESHOLD: Minimum confidence threshold (default: 0.2)NMS_IOU_THRESHOLD: IoU threshold for Non-Maximum Suppression (default: 0.45)REQUIRED_WIDTH/HEIGHT: Model input dimensions (1024x1024)
PaddleOCR detection and recognition models can be configured with custom parameters:
- Detection Thresholds: Adjust text detection confidence and IoU thresholds
- Line Merging: Configure text line merging parameters for better text extraction
- Post-processing: Customize DB (Differentiable Binarization) post-processing parameters
The tool uses an advanced NMS algorithm that:
- Merges overlapping bounding boxes instead of removing them
- Uses both IoU and overlap ratio for better detection of nested elements
- Handles bounding boxes of different sizes effectively
# Analyze a research paper with OCR
cargo run --bin analyze -- "research_paper.pdf"
# Analyze page 3 of a document
cargo run --bin analyze -- "document.pdf" --page 3
# Save results to custom directory
cargo run --bin analyze -- "document.pdf" --output my_analysisanalysis/bbox.rs: Bounding box operations and geometric calculationsanalysis/labels.rs: Document element classification labelsinference/session.rs: ONNX Runtime integration and model inferencelayout/element.rs: Layout element data structuresinference/paddle/: PaddleOCR detection and recognition modelsinference/yolov12/: YOLOv12 layout detection model
- Image Preprocessing: Resize and normalize images for model input
- YOLO Inference: Run YOLOv12 model for layout element detection
- Layout-Aware Text Detection: Use PaddleOCR to detect text within layout regions
- Text Recognition: Extract text content using PaddleOCR recognition model
- Text Line Merging: Intelligently merge overlapping text detections
- Advanced NMS: Merge overlapping detections with union bounding boxes
- Reading Order Sorting: Sort elements for natural document flow
PDF Page → PDFium Rendering → Configurable Image Dimensions → YOLOv12 Layout Detection
↓
Layout Regions → PaddleOCR Text Detection
↓
Text Regions → PaddleOCR Text Recognition
↓
Extracted Text + Coordinates
↓
Debug Visualization (Optional)
cargo testEnable debug mode to save intermediate results such as layout bounding boxes and text detection visualizations:
cargo run --bin analyze -- --page 1 input.pdf --debugcargo clippycargo doc --openThis project is inspired by and builds upon the excellent work of ferrules, a comprehensive document analysis framework. We extend our gratitude to the ferrules project for providing the foundation and inspiration for this PDF content extraction tool.
This project is licensed under the MIT License - see the LICENSE file for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
-
PDFium Library Not Found
Solution: Set PDFIUM_DYNAMIC_LIB_PATH environment variable -
PDF Not Found Error
Solution: Check file path and ensure PDF file exists -
Page Index Out of Range
Solution: Use 0-based page indexing (first page = 0) -
Low Detection Accuracy
Solution: Adjust PROBA_THRESHOLD or try different NMS_IOU_THRESHOLD -
OCR Text Recognition Issues
Solution: Ensure text is clear and properly rendered, adjust detection thresholds -
Debug Output Not Generated
Solution: Ensure debug mode is enabled with the --debug flag and specify a valid output directory
- Use release builds for better performance:
cargo build --release - Consider batch processing for multiple PDFs
- Adjust model thresholds based on document type
- For large documents, process pages individually to manage memory usage