Persona-Driven Document Intelligence & Analysis

Solution for the Adobe India Hackathon 2025: Round 1B

A sophisticated document analysis system designed to intelligently extract and prioritize information from a collection of PDFs based on a user's specific persona and job-to-be-done.

🚀 Key Features

Dynamic Query Generation: Automatically creates a highly specific, persona-centric query to guide the analysis, including explicit priorities and constraints.
Intelligent Section Parsing: Robustly identifies and extracts structured sections (headings and content) from diverse and complex PDF layouts.
Goal-Oriented Relevance Ranking: Implements a novel weighted scoring model that heavily prioritizes section titles to ensure results are not just relevant but also highly actionable and intentional.
Built-in Diversification: Ensures the final output is a well-rounded and balanced plan by selecting the best information from a variety of source documents.
Fully Containerized: Packaged with Docker for a 100% reproducible and consistent execution environment, eliminating the "it works on my machine" problem.

📂 Project Structure

The repository is organized with a clean and scalable structure to separate source code, data, and configuration.

.
├── src/
│   ├── __init__.py           # Makes 'src' a Python package
│   ├── main.py               # Main entry point of the application
│   └── processing.py         # Core logic for parsing and analysis
│
├── input/
│   ├── input.json            # Defines the persona, job, and document list
│   └── *.pdf                 # All PDF documents to be analyzed
│
├── output/
│   └── output.json           # The generated JSON result file
│
├── .gitignore                # Specifies files to be excluded from Git
├── Dockerfile                # The recipe to build the containerized application
├── README.md                 # This file
└── requirements.txt          # Lists all Python dependencies

⚙️ Core Methodology

This solution addresses the challenge by moving beyond simple semantic search to a more robust framework that mirrors human intent. Our system, Goal-Oriented Relevance Fusion, operates on the principle that a section's title is the most powerful indicator of its purpose.

The process involves two core stages:

Persona-Centric Prompting: The system begins by dynamically generating a detailed query prompt directly from the input.json. This prompt commands the AI model to "act as" the specified persona and provides a clear mission brief based on their job_to_be_done. It explicitly outlines high-priority topics to find and low-priority topics to avoid, ensuring the model's "mindset" is perfectly aligned with the user's context.
Weighted Relevance Scoring: For every section extracted from the PDFs, we calculate two separate relevance scores against the persona-centric query:
- Title Score: The semantic similarity between the query and the section's title.
- Content Score: The semantic similarity between the query and the section's body text.
These are fused into a final score with a heavy weighting towards the title (Score = 0.7 * TitleScore + 0.3 * ContentScore). This technique intelligently penalizes sections with generic titles like "Introduction," forcing the system to favor sections that are clearly and intentionally focused on the user's specific goal. This allows the system to make discerning choices, such as prioritizing "Nightlife and Entertainment" over "Family-Friendly Hotels" for a group of college students.

This hybrid approach, combined with a diversification step that selects the best section from each of the most relevant documents, ensures the final output is contextually aware, highly actionable, and directly addresses the user's stated needs.

🛠️ Tech Stack

Component	Technology / Library
PDF Processing	`PyMuPDF`
Language Model	`sentence-transformers` (`all-MiniLM-L6-v2`)
ML Backend	`PyTorch`
Containerization	`Docker`
Utilities	`NumPy`, `tqdm`

⚡ Setup & Execution

The entire application is containerized, allowing for a simple, one-command setup and execution.

Prerequisites

Docker Desktop installed and running.
Input files placed correctly in the /input directory as per the structure above.

📊 Data Format

Input (`input/input.json`)

The application is driven by an input.json file that specifies the context for the analysis.

{
    "documents": [
        { "filename": "doc1.pdf", "title": "Document Title 1" },
        { "filename": "doc2.pdf", "title": "Document Title 2" }
    ],
    "persona": {
        "role": "HR professional"
    },
    "job_to_be_done": {
        "task": "Create and manage fillable forms for onboarding and compliance."
    }
}

Output (`output/output.json`)

The application generates a single output.json file containing the complete analysis.

{
    "metadata": {
        "input_documents": ["doc1.pdf", "doc2.pdf"],
        "persona": "HR professional",
        "job_to_be_done": "...",
        "processing_timestamp": "..."
    },
    "extracted_sections": [
        {
            "document": "doc1.pdf",
            "section_title": "A Highly Relevant Section Title",
            "importance_rank": 1,
            "page_number": 12
        }
    ],
    "subsection_analysis": [
        {
            "document": "doc1.pdf",
            "refined_text": "The most relevant and actionable paragraph from the extracted section...",
            "page_number": 12
        }
    ]
}

1. Build the Docker Image

Navigate to the project's root directory in your terminal and run the build command.

docker build -t adobe-1b-solution .

2. Run the Analysis

Execute the run command. This will start the container, process the files, and save the results.

For Windows (PowerShell or Command Prompt):

docker run --rm -v "%cd%/input":/app/input -v "%cd%/output":/app/output adobe-1b-solution

For macOS or Linux:

docker run --rm -v "$(pwd)/input":/app/input -v "$(pwd)/output":/app/output adobe-1b-solution

The final analysis will be saved as output.json in the output directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Persona-Driven Document Intelligence & Analysis

Solution for the Adobe India Hackathon 2025: Round 1B

🚀 Key Features

📂 Project Structure

⚙️ Core Methodology

🛠️ Tech Stack

⚡ Setup & Execution

Prerequisites

📊 Data Format

Input (`input/input.json`)

Output (`output/output.json`)

1. Build the Docker Image

2. Run the Analysis

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
input		input
output		output
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt

kdt523/adobe-hackathon-1b

Folders and files

Latest commit

History

Repository files navigation

Persona-Driven Document Intelligence & Analysis

Solution for the Adobe India Hackathon 2025: Round 1B

🚀 Key Features

📂 Project Structure

⚙️ Core Methodology

🛠️ Tech Stack

⚡ Setup & Execution

Prerequisites

📊 Data Format

Input (input/input.json)

Output (output/output.json)

1. Build the Docker Image

2. Run the Analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Input (`input/input.json`)

Output (`output/output.json`)

Packages