Adobe India Hackathon 2025 - "Connecting the Dots"

Project Overview

This repository contains solutions for the Adobe India Hackathon "Connecting the Dots" challenge, featuring intelligent PDF processing and analysis capabilities.

Challenge Structure

Challenge 1A: PDF Outline Extraction

Location: Challenge_1a/

Extracts structured outlines (Title, H1, H2, H3 headings) from PDF documents
Processes up to 50 pages per PDF in under 10 seconds
Outputs hierarchical JSON format with page numbers

Challenge 1B: Persona-Driven Document Intelligence

Location: Challenge_1b/

Analyzes collections of 3-10 related PDFs based on specific personas and job requirements
Extracts and ranks relevant sections and subsections
Processes document collections in under 60 seconds

Quick Start

Prerequisites

Docker installed on your system
Git for version control

Building and Running

Challenge 1A

# Build the Docker image
docker build --platform linux/amd64 -t challenge1a:latest Challenge_1a/

# Run the container
docker run --rm -v $(pwd)/input:/app/input -v $(pwd)/output:/app/output --network none challenge1a:latest

Challenge 1B

# Build the Docker image
docker build --platform linux/amd64 -t challenge1b:latest Challenge_1b/

# Run the container
docker run --rm -v $(pwd)/input:/app/input -v $(pwd)/output:/app/output --network none challenge1b:latest

Technical Specifications

System Requirements

Architecture: AMD64 (x86_64)
CPU: 8 cores, 16GB RAM
Storage: Sufficient space for Docker images and temporary files
Network: No internet access required during execution

Performance Constraints

Challenge 1A: ≤10 seconds for 50-page PDFs, ≤200MB model size
Challenge 1B: ≤60 seconds for 3-5 documents, ≤1GB model size

Project Structure

├── README.md                    # This file - Main project overview
├── Challenge_1a/               # PDF Outline Extraction
│   ├── Dockerfile              # Docker configuration
│   ├── process_pdfs.py         # Main processing script
│   ├── requirements.txt        # Python dependencies
│   ├── README.md              # Detailed documentation
│   └── sample_dataset/        # Test data and outputs
├── Challenge_1b/               # Persona-Driven Analysis
│   ├── Dockerfile              # Docker configuration
│   ├── main.py                 # Main processing script
│   ├── requirements.txt        # Python dependencies
│   ├── README.md              # Detailed documentation
│   ├── approach_explanation.md # Methodology explanation
│   └── Collection_*/          # Test collections
└── .gitignore                 # Git ignore file

Key Features

Challenge 1A Features

Robust PDF parsing using PyMuPDF
Intelligent heading detection with multiple strategies
Conservative pattern matching to avoid false positives
Hierarchical outline generation with page numbers
Fully offline processing

Challenge 1B Features

Semantic similarity analysis using sentence transformers
Persona-driven content relevance scoring
Multi-document collection processing
Intelligent section ranking and extraction
Comprehensive metadata tracking

Compliance

✅ AMD64 Docker compatibility
✅ No internet access required
✅ CPU-only processing (no GPU dependencies)
✅ Model size constraints met
✅ Runtime performance requirements satisfied
✅ Modular, reusable code architecture

Documentation

Each challenge directory contains detailed documentation:

Challenge_1a/README.md: Complete implementation guide
Challenge_1b/README.md: Comprehensive usage instructions
Challenge_1b/approach_explanation.md: Methodology explanation

Testing

Both challenges include comprehensive test datasets:

Challenge_1a: 6 sample PDFs with expected outputs
Challenge_1b: 3 diverse collections (Travel, Adobe Learning, Recipes)

Contact

For questions or issues, please refer to the individual challenge documentation or contact the project maintainer.

Note: This repository should remain private until the competition deadline as per hackathon guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Challenge_1a		Challenge_1a
Challenge_1b		Challenge_1b
.gitignore		.gitignore
FINAL_SUBMISSION_CHECKLIST.md		FINAL_SUBMISSION_CHECKLIST.md
README.md		README.md
SUBMISSION_INSTRUCTIONS.md		SUBMISSION_INSTRUCTIONS.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Adobe India Hackathon 2025 - "Connecting the Dots"

Project Overview

Challenge Structure

Challenge 1A: PDF Outline Extraction

Challenge 1B: Persona-Driven Document Intelligence

Quick Start

Prerequisites

Building and Running

Challenge 1A

Challenge 1B

Technical Specifications

System Requirements

Performance Constraints

Project Structure

Key Features

Challenge 1A Features

Challenge 1B Features

Compliance

Documentation

Testing

Contact

About

Uh oh!

Releases

Packages

Languages

pradyumna0104/Adobe-India-Hackathon25

Folders and files

Latest commit

History

Repository files navigation

Adobe India Hackathon 2025 - "Connecting the Dots"

Project Overview

Challenge Structure

Challenge 1A: PDF Outline Extraction

Challenge 1B: Persona-Driven Document Intelligence

Quick Start

Prerequisites

Building and Running

Challenge 1A

Challenge 1B

Technical Specifications

System Requirements

Performance Constraints

Project Structure

Key Features

Challenge 1A Features

Challenge 1B Features

Compliance

Documentation

Testing

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages