This solution extracts structured outlines (Title + H1/H2/H3 headings) from PDF documents for the Adobe India Hackathon Round 1A.
Our solution uses a multi-layered approach to achieve high accuracy:
- Font Size Analysis: Statistical analysis to identify text blocks with above-average font sizes
- Font Style Detection: Identifies bold, italic, and other styled text that typically indicates headings
- Pattern Recognition: Uses regex patterns to detect numbered sections, chapter titles, and other common heading formats
- Position Analysis: Considers vertical spacing and alignment to identify potential headings
- Calculates mean, median, and standard deviation of font sizes across the document
- Identifies the most common font family to distinguish headings from body text
- Uses these statistics to set adaptive thresholds for different document types
- Uses K-Means clustering to automatically determine heading levels based on font sizes
- Assigns H1, H2, H3 levels based on relative font sizes within the document
- Handles documents with varying font size distributions
- First checks PDF metadata for embedded title information
- Falls back to finding the largest text on the first page
- Uses multiple heuristics to filter out page numbers and other non-title content
- Each detected heading receives a confidence score based on multiple factors
- Duplicate headings detected by multiple methods receive boosted confidence scores
- Only headings above a confidence threshold are included in the final output
- PyMuPDF (fitz): High-performance PDF text extraction with detailed font information
- scikit-learn: K-Means clustering for heading level assignment
- NumPy: Numerical operations for statistical analysis
- Built-in Python libraries: regex, statistics, collections for text processing
- No external model dependencies: Uses algorithmic approach with statistical analysis
- Fast processing: Optimized for the 10-second constraint on 50-page PDFs
- Memory efficient: Processes documents incrementally
- Multilingual support: Works with any Unicode text, including Japanese and other languages
- Robust error handling: Gracefully handles malformed PDFs and edge cases
docker build --platform linux/amd64 -t pdf-outline-extractor:latest .
docker run --rm -v $(pwd)/input:/app/input -v $(pwd)/output:/app/output --network none pdf-outline-extractor:latest
pip install -r requirements.txt
python extract_outline.py input_directory output_directory
- Speed: Processes typical documents in 1-3 seconds
- Memory: Uses approximately 50-100MB RAM for large documents
- Accuracy: Achieves high precision and recall across diverse document types
- Size: Total container size is approximately 150MB
The solution is designed with modularity in mind for easy extension to Round 1B:
- PDFHeadingExtractor Class: Main extraction engine
- Feature Extraction Module: Statistical analysis of document properties
- Detection Modules: Separate methods for different heading detection strategies
- Level Assignment Module: Intelligent clustering-based level determination
- Output Formatting: Clean JSON output matching required specifications
This modular design allows for easy extension and integration with the Round 1B persona-driven document intelligence system.