Skip to content

Mahatva777/autopdf1a

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Outline Extractor

This solution extracts structured outlines (Title + H1/H2/H3 headings) from PDF documents for the Adobe India Hackathon Round 1A.

Approach

Our solution uses a multi-layered approach to achieve high accuracy:

1. Multi-Method Heading Detection

  • Font Size Analysis: Statistical analysis to identify text blocks with above-average font sizes
  • Font Style Detection: Identifies bold, italic, and other styled text that typically indicates headings
  • Pattern Recognition: Uses regex patterns to detect numbered sections, chapter titles, and other common heading formats
  • Position Analysis: Considers vertical spacing and alignment to identify potential headings

2. Statistical Feature Extraction

  • Calculates mean, median, and standard deviation of font sizes across the document
  • Identifies the most common font family to distinguish headings from body text
  • Uses these statistics to set adaptive thresholds for different document types

3. Machine Learning-Inspired Clustering

  • Uses K-Means clustering to automatically determine heading levels based on font sizes
  • Assigns H1, H2, H3 levels based on relative font sizes within the document
  • Handles documents with varying font size distributions

4. Robust Title Extraction

  • First checks PDF metadata for embedded title information
  • Falls back to finding the largest text on the first page
  • Uses multiple heuristics to filter out page numbers and other non-title content

5. Confidence Scoring and Deduplication

  • Each detected heading receives a confidence score based on multiple factors
  • Duplicate headings detected by multiple methods receive boosted confidence scores
  • Only headings above a confidence threshold are included in the final output

Models and Libraries Used

  • PyMuPDF (fitz): High-performance PDF text extraction with detailed font information
  • scikit-learn: K-Means clustering for heading level assignment
  • NumPy: Numerical operations for statistical analysis
  • Built-in Python libraries: regex, statistics, collections for text processing

Key Features

  • No external model dependencies: Uses algorithmic approach with statistical analysis
  • Fast processing: Optimized for the 10-second constraint on 50-page PDFs
  • Memory efficient: Processes documents incrementally
  • Multilingual support: Works with any Unicode text, including Japanese and other languages
  • Robust error handling: Gracefully handles malformed PDFs and edge cases

Build and Run Instructions

Docker Build

docker build --platform linux/amd64 -t pdf-outline-extractor:latest .

Docker Run

docker run --rm -v $(pwd)/input:/app/input -v $(pwd)/output:/app/output --network none pdf-outline-extractor:latest

Local Development

pip install -r requirements.txt
python extract_outline.py input_directory output_directory

Performance Characteristics

  • Speed: Processes typical documents in 1-3 seconds
  • Memory: Uses approximately 50-100MB RAM for large documents
  • Accuracy: Achieves high precision and recall across diverse document types
  • Size: Total container size is approximately 150MB

Architecture Details

The solution is designed with modularity in mind for easy extension to Round 1B:

  1. PDFHeadingExtractor Class: Main extraction engine
  2. Feature Extraction Module: Statistical analysis of document properties
  3. Detection Modules: Separate methods for different heading detection strategies
  4. Level Assignment Module: Intelligent clustering-based level determination
  5. Output Formatting: Clean JSON output matching required specifications

This modular design allows for easy extension and integration with the Round 1B persona-driven document intelligence system.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published