Skip to content

kdt523/adobe-hackathon-1b

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Persona-Driven Document Intelligence & Analysis

Solution for the Adobe India Hackathon 2025: Round 1B

A sophisticated document analysis system designed to intelligently extract and prioritize information from a collection of PDFs based on a user's specific persona and job-to-be-done.


🚀 Key Features

  • Dynamic Query Generation: Automatically creates a highly specific, persona-centric query to guide the analysis, including explicit priorities and constraints.
  • Intelligent Section Parsing: Robustly identifies and extracts structured sections (headings and content) from diverse and complex PDF layouts.
  • Goal-Oriented Relevance Ranking: Implements a novel weighted scoring model that heavily prioritizes section titles to ensure results are not just relevant but also highly actionable and intentional.
  • Built-in Diversification: Ensures the final output is a well-rounded and balanced plan by selecting the best information from a variety of source documents.
  • Fully Containerized: Packaged with Docker for a 100% reproducible and consistent execution environment, eliminating the "it works on my machine" problem.

📂 Project Structure

The repository is organized with a clean and scalable structure to separate source code, data, and configuration.

.
├── src/
│   ├── __init__.py           # Makes 'src' a Python package
│   ├── main.py               # Main entry point of the application
│   └── processing.py         # Core logic for parsing and analysis
│
├── input/
│   ├── input.json            # Defines the persona, job, and document list
│   └── *.pdf                 # All PDF documents to be analyzed
│
├── output/
│   └── output.json           # The generated JSON result file
│
├── .gitignore                # Specifies files to be excluded from Git
├── Dockerfile                # The recipe to build the containerized application
├── README.md                 # This file
└── requirements.txt          # Lists all Python dependencies

⚙️ Core Methodology

This solution addresses the challenge by moving beyond simple semantic search to a more robust framework that mirrors human intent. Our system, Goal-Oriented Relevance Fusion, operates on the principle that a section's title is the most powerful indicator of its purpose.

The process involves two core stages:

  1. Persona-Centric Prompting: The system begins by dynamically generating a detailed query prompt directly from the input.json. This prompt commands the AI model to "act as" the specified persona and provides a clear mission brief based on their job_to_be_done. It explicitly outlines high-priority topics to find and low-priority topics to avoid, ensuring the model's "mindset" is perfectly aligned with the user's context.

  2. Weighted Relevance Scoring: For every section extracted from the PDFs, we calculate two separate relevance scores against the persona-centric query:

    • Title Score: The semantic similarity between the query and the section's title.
    • Content Score: The semantic similarity between the query and the section's body text.

    These are fused into a final score with a heavy weighting towards the title (Score = 0.7 * TitleScore + 0.3 * ContentScore). This technique intelligently penalizes sections with generic titles like "Introduction," forcing the system to favor sections that are clearly and intentionally focused on the user's specific goal. This allows the system to make discerning choices, such as prioritizing "Nightlife and Entertainment" over "Family-Friendly Hotels" for a group of college students.

This hybrid approach, combined with a diversification step that selects the best section from each of the most relevant documents, ensures the final output is contextually aware, highly actionable, and directly addresses the user's stated needs.

🛠️ Tech Stack

Component Technology / Library
PDF Processing PyMuPDF
Language Model sentence-transformers (all-MiniLM-L6-v2)
ML Backend PyTorch
Containerization Docker
Utilities NumPy, tqdm

⚡ Setup & Execution

The entire application is containerized, allowing for a simple, one-command setup and execution.

Prerequisites

  • Docker Desktop installed and running.
  • Input files placed correctly in the /input directory as per the structure above.

📊 Data Format

Input (input/input.json)

The application is driven by an input.json file that specifies the context for the analysis.

{
    "documents": [
        { "filename": "doc1.pdf", "title": "Document Title 1" },
        { "filename": "doc2.pdf", "title": "Document Title 2" }
    ],
    "persona": {
        "role": "HR professional"
    },
    "job_to_be_done": {
        "task": "Create and manage fillable forms for onboarding and compliance."
    }
}

Output (output/output.json)

The application generates a single output.json file containing the complete analysis.

{
    "metadata": {
        "input_documents": ["doc1.pdf", "doc2.pdf"],
        "persona": "HR professional",
        "job_to_be_done": "...",
        "processing_timestamp": "..."
    },
    "extracted_sections": [
        {
            "document": "doc1.pdf",
            "section_title": "A Highly Relevant Section Title",
            "importance_rank": 1,
            "page_number": 12
        }
    ],
    "subsection_analysis": [
        {
            "document": "doc1.pdf",
            "refined_text": "The most relevant and actionable paragraph from the extracted section...",
            "page_number": 12
        }
    ]
}

1. Build the Docker Image

Navigate to the project's root directory in your terminal and run the build command.

docker build -t adobe-1b-solution .

2. Run the Analysis

Execute the run command. This will start the container, process the files, and save the results.

  • For Windows (PowerShell or Command Prompt):

    docker run --rm -v "%cd%/input":/app/input -v "%cd%/output":/app/output adobe-1b-solution
  • For macOS or Linux:

    docker run --rm -v "$(pwd)/input":/app/input -v "$(pwd)/output":/app/output adobe-1b-solution

The final analysis will be saved as output.json in the output directory.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published