PDF Outline Extractor

This solution extracts structured outlines (Title + H1/H2/H3 headings) from PDF documents for the Adobe India Hackathon Round 1A.

Approach

Our solution uses a multi-layered approach to achieve high accuracy:

1. Multi-Method Heading Detection

Font Size Analysis: Statistical analysis to identify text blocks with above-average font sizes
Font Style Detection: Identifies bold, italic, and other styled text that typically indicates headings
Pattern Recognition: Uses regex patterns to detect numbered sections, chapter titles, and other common heading formats
Position Analysis: Considers vertical spacing and alignment to identify potential headings

2. Statistical Feature Extraction

Calculates mean, median, and standard deviation of font sizes across the document
Identifies the most common font family to distinguish headings from body text
Uses these statistics to set adaptive thresholds for different document types

3. Machine Learning-Inspired Clustering

Uses K-Means clustering to automatically determine heading levels based on font sizes
Assigns H1, H2, H3 levels based on relative font sizes within the document
Handles documents with varying font size distributions

4. Robust Title Extraction

First checks PDF metadata for embedded title information
Falls back to finding the largest text on the first page
Uses multiple heuristics to filter out page numbers and other non-title content

5. Confidence Scoring and Deduplication

Each detected heading receives a confidence score based on multiple factors
Duplicate headings detected by multiple methods receive boosted confidence scores
Only headings above a confidence threshold are included in the final output

Models and Libraries Used

PyMuPDF (fitz): High-performance PDF text extraction with detailed font information
scikit-learn: K-Means clustering for heading level assignment
NumPy: Numerical operations for statistical analysis
Built-in Python libraries: regex, statistics, collections for text processing

Key Features

No external model dependencies: Uses algorithmic approach with statistical analysis
Fast processing: Optimized for the 10-second constraint on 50-page PDFs
Memory efficient: Processes documents incrementally
Multilingual support: Works with any Unicode text, including Japanese and other languages
Robust error handling: Gracefully handles malformed PDFs and edge cases

Build and Run Instructions

Docker Build

docker build --platform linux/amd64 -t pdf-outline-extractor:latest .

Docker Run

docker run --rm -v $(pwd)/input:/app/input -v $(pwd)/output:/app/output --network none pdf-outline-extractor:latest

Local Development

pip install -r requirements.txt
python extract_outline.py input_directory output_directory

Performance Characteristics

Speed: Processes typical documents in 1-3 seconds
Memory: Uses approximately 50-100MB RAM for large documents
Accuracy: Achieves high precision and recall across diverse document types
Size: Total container size is approximately 150MB

Architecture Details

The solution is designed with modularity in mind for easy extension to Round 1B:

PDFHeadingExtractor Class: Main extraction engine
Feature Extraction Module: Statistical analysis of document properties
Detection Modules: Separate methods for different heading detection strategies
Level Assignment Module: Intelligent clustering-based level determination
Output Formatting: Clean JSON output matching required specifications

This modular design allows for easy extension and integration with the Round 1B persona-driven document intelligence system.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
input		input
output		output
Dockerfile		Dockerfile
README.md		README.md
extract_outline.py		extract_outline.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Outline Extractor

Approach

1. Multi-Method Heading Detection

2. Statistical Feature Extraction

3. Machine Learning-Inspired Clustering

4. Robust Title Extraction

5. Confidence Scoring and Deduplication

Models and Libraries Used

Key Features

Build and Run Instructions

Docker Build

Docker Run

Local Development

Performance Characteristics

Architecture Details

About

Uh oh!

Releases

Packages

Languages

Mahatva777/autopdf1a

Folders and files

Latest commit

History

Repository files navigation

PDF Outline Extractor

Approach

1. Multi-Method Heading Detection

2. Statistical Feature Extraction

3. Machine Learning-Inspired Clustering

4. Robust Title Extraction

5. Confidence Scoring and Deduplication

Models and Libraries Used

Key Features

Build and Run Instructions

Docker Build

Docker Run

Local Development

Performance Characteristics

Architecture Details

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages