Skip to content

pradyumna0104/Adobe-India-Hackathon25

Repository files navigation

Adobe India Hackathon 2025 - "Connecting the Dots"

Project Overview

This repository contains solutions for the Adobe India Hackathon "Connecting the Dots" challenge, featuring intelligent PDF processing and analysis capabilities.

Challenge Structure

Challenge 1A: PDF Outline Extraction

Location: Challenge_1a/

  • Extracts structured outlines (Title, H1, H2, H3 headings) from PDF documents
  • Processes up to 50 pages per PDF in under 10 seconds
  • Outputs hierarchical JSON format with page numbers

Challenge 1B: Persona-Driven Document Intelligence

Location: Challenge_1b/

  • Analyzes collections of 3-10 related PDFs based on specific personas and job requirements
  • Extracts and ranks relevant sections and subsections
  • Processes document collections in under 60 seconds

Quick Start

Prerequisites

  • Docker installed on your system
  • Git for version control

Building and Running

Challenge 1A

# Build the Docker image
docker build --platform linux/amd64 -t challenge1a:latest Challenge_1a/

# Run the container
docker run --rm -v $(pwd)/input:/app/input -v $(pwd)/output:/app/output --network none challenge1a:latest

Challenge 1B

# Build the Docker image
docker build --platform linux/amd64 -t challenge1b:latest Challenge_1b/

# Run the container
docker run --rm -v $(pwd)/input:/app/input -v $(pwd)/output:/app/output --network none challenge1b:latest

Technical Specifications

System Requirements

  • Architecture: AMD64 (x86_64)
  • CPU: 8 cores, 16GB RAM
  • Storage: Sufficient space for Docker images and temporary files
  • Network: No internet access required during execution

Performance Constraints

  • Challenge 1A: ≤10 seconds for 50-page PDFs, ≤200MB model size
  • Challenge 1B: ≤60 seconds for 3-5 documents, ≤1GB model size

Project Structure

├── README.md                    # This file - Main project overview
├── Challenge_1a/               # PDF Outline Extraction
│   ├── Dockerfile              # Docker configuration
│   ├── process_pdfs.py         # Main processing script
│   ├── requirements.txt        # Python dependencies
│   ├── README.md              # Detailed documentation
│   └── sample_dataset/        # Test data and outputs
├── Challenge_1b/               # Persona-Driven Analysis
│   ├── Dockerfile              # Docker configuration
│   ├── main.py                 # Main processing script
│   ├── requirements.txt        # Python dependencies
│   ├── README.md              # Detailed documentation
│   ├── approach_explanation.md # Methodology explanation
│   └── Collection_*/          # Test collections
└── .gitignore                 # Git ignore file

Key Features

Challenge 1A Features

  • Robust PDF parsing using PyMuPDF
  • Intelligent heading detection with multiple strategies
  • Conservative pattern matching to avoid false positives
  • Hierarchical outline generation with page numbers
  • Fully offline processing

Challenge 1B Features

  • Semantic similarity analysis using sentence transformers
  • Persona-driven content relevance scoring
  • Multi-document collection processing
  • Intelligent section ranking and extraction
  • Comprehensive metadata tracking

Compliance

  • ✅ AMD64 Docker compatibility
  • ✅ No internet access required
  • ✅ CPU-only processing (no GPU dependencies)
  • ✅ Model size constraints met
  • ✅ Runtime performance requirements satisfied
  • ✅ Modular, reusable code architecture

Documentation

Each challenge directory contains detailed documentation:

  • Challenge_1a/README.md: Complete implementation guide
  • Challenge_1b/README.md: Comprehensive usage instructions
  • Challenge_1b/approach_explanation.md: Methodology explanation

Testing

Both challenges include comprehensive test datasets:

  • Challenge_1a: 6 sample PDFs with expected outputs
  • Challenge_1b: 3 diverse collections (Travel, Adobe Learning, Recipes)

Contact

For questions or issues, please refer to the individual challenge documentation or contact the project maintainer.


Note: This repository should remain private until the competition deadline as per hackathon guidelines.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published