This repository contains solutions for the Adobe India Hackathon "Connecting the Dots" challenge, featuring intelligent PDF processing and analysis capabilities.
Location: Challenge_1a/
- Extracts structured outlines (Title, H1, H2, H3 headings) from PDF documents
- Processes up to 50 pages per PDF in under 10 seconds
- Outputs hierarchical JSON format with page numbers
Location: Challenge_1b/
- Analyzes collections of 3-10 related PDFs based on specific personas and job requirements
- Extracts and ranks relevant sections and subsections
- Processes document collections in under 60 seconds
- Docker installed on your system
- Git for version control
# Build the Docker image
docker build --platform linux/amd64 -t challenge1a:latest Challenge_1a/
# Run the container
docker run --rm -v $(pwd)/input:/app/input -v $(pwd)/output:/app/output --network none challenge1a:latest
# Build the Docker image
docker build --platform linux/amd64 -t challenge1b:latest Challenge_1b/
# Run the container
docker run --rm -v $(pwd)/input:/app/input -v $(pwd)/output:/app/output --network none challenge1b:latest
- Architecture: AMD64 (x86_64)
- CPU: 8 cores, 16GB RAM
- Storage: Sufficient space for Docker images and temporary files
- Network: No internet access required during execution
- Challenge 1A: ≤10 seconds for 50-page PDFs, ≤200MB model size
- Challenge 1B: ≤60 seconds for 3-5 documents, ≤1GB model size
├── README.md # This file - Main project overview
├── Challenge_1a/ # PDF Outline Extraction
│ ├── Dockerfile # Docker configuration
│ ├── process_pdfs.py # Main processing script
│ ├── requirements.txt # Python dependencies
│ ├── README.md # Detailed documentation
│ └── sample_dataset/ # Test data and outputs
├── Challenge_1b/ # Persona-Driven Analysis
│ ├── Dockerfile # Docker configuration
│ ├── main.py # Main processing script
│ ├── requirements.txt # Python dependencies
│ ├── README.md # Detailed documentation
│ ├── approach_explanation.md # Methodology explanation
│ └── Collection_*/ # Test collections
└── .gitignore # Git ignore file
- Robust PDF parsing using PyMuPDF
- Intelligent heading detection with multiple strategies
- Conservative pattern matching to avoid false positives
- Hierarchical outline generation with page numbers
- Fully offline processing
- Semantic similarity analysis using sentence transformers
- Persona-driven content relevance scoring
- Multi-document collection processing
- Intelligent section ranking and extraction
- Comprehensive metadata tracking
- ✅ AMD64 Docker compatibility
- ✅ No internet access required
- ✅ CPU-only processing (no GPU dependencies)
- ✅ Model size constraints met
- ✅ Runtime performance requirements satisfied
- ✅ Modular, reusable code architecture
Each challenge directory contains detailed documentation:
- Challenge_1a/README.md: Complete implementation guide
- Challenge_1b/README.md: Comprehensive usage instructions
- Challenge_1b/approach_explanation.md: Methodology explanation
Both challenges include comprehensive test datasets:
- Challenge_1a: 6 sample PDFs with expected outputs
- Challenge_1b: 3 diverse collections (Travel, Adobe Learning, Recipes)
For questions or issues, please refer to the individual challenge documentation or contact the project maintainer.
Note: This repository should remain private until the competition deadline as per hackathon guidelines.