A sophisticated document analysis system designed to intelligently extract and prioritize information from a collection of PDFs based on a user's specific persona and job-to-be-done.
- Dynamic Query Generation: Automatically creates a highly specific, persona-centric query to guide the analysis, including explicit priorities and constraints.
- Intelligent Section Parsing: Robustly identifies and extracts structured sections (headings and content) from diverse and complex PDF layouts.
- Goal-Oriented Relevance Ranking: Implements a novel weighted scoring model that heavily prioritizes section titles to ensure results are not just relevant but also highly actionable and intentional.
- Built-in Diversification: Ensures the final output is a well-rounded and balanced plan by selecting the best information from a variety of source documents.
- Fully Containerized: Packaged with Docker for a 100% reproducible and consistent execution environment, eliminating the "it works on my machine" problem.
The repository is organized with a clean and scalable structure to separate source code, data, and configuration.
.
├── src/
│ ├── __init__.py # Makes 'src' a Python package
│ ├── main.py # Main entry point of the application
│ └── processing.py # Core logic for parsing and analysis
│
├── input/
│ ├── input.json # Defines the persona, job, and document list
│ └── *.pdf # All PDF documents to be analyzed
│
├── output/
│ └── output.json # The generated JSON result file
│
├── .gitignore # Specifies files to be excluded from Git
├── Dockerfile # The recipe to build the containerized application
├── README.md # This file
└── requirements.txt # Lists all Python dependencies
This solution addresses the challenge by moving beyond simple semantic search to a more robust framework that mirrors human intent. Our system, Goal-Oriented Relevance Fusion, operates on the principle that a section's title is the most powerful indicator of its purpose.
The process involves two core stages:
-
Persona-Centric Prompting: The system begins by dynamically generating a detailed query prompt directly from the
input.json
. This prompt commands the AI model to "act as" the specified persona and provides a clear mission brief based on theirjob_to_be_done
. It explicitly outlines high-priority topics to find and low-priority topics to avoid, ensuring the model's "mindset" is perfectly aligned with the user's context. -
Weighted Relevance Scoring: For every section extracted from the PDFs, we calculate two separate relevance scores against the persona-centric query:
- Title Score: The semantic similarity between the query and the section's title.
- Content Score: The semantic similarity between the query and the section's body text.
These are fused into a final score with a heavy weighting towards the title (
Score = 0.7 * TitleScore + 0.3 * ContentScore
). This technique intelligently penalizes sections with generic titles like "Introduction," forcing the system to favor sections that are clearly and intentionally focused on the user's specific goal. This allows the system to make discerning choices, such as prioritizing "Nightlife and Entertainment" over "Family-Friendly Hotels" for a group of college students.
This hybrid approach, combined with a diversification step that selects the best section from each of the most relevant documents, ensures the final output is contextually aware, highly actionable, and directly addresses the user's stated needs.
Component | Technology / Library |
---|---|
PDF Processing | PyMuPDF |
Language Model | sentence-transformers (all-MiniLM-L6-v2 ) |
ML Backend | PyTorch |
Containerization | Docker |
Utilities | NumPy , tqdm |
The entire application is containerized, allowing for a simple, one-command setup and execution.
- Docker Desktop installed and running.
- Input files placed correctly in the
/input
directory as per the structure above.
The application is driven by an input.json
file that specifies the context for the analysis.
{
"documents": [
{ "filename": "doc1.pdf", "title": "Document Title 1" },
{ "filename": "doc2.pdf", "title": "Document Title 2" }
],
"persona": {
"role": "HR professional"
},
"job_to_be_done": {
"task": "Create and manage fillable forms for onboarding and compliance."
}
}
The application generates a single output.json
file containing the complete analysis.
{
"metadata": {
"input_documents": ["doc1.pdf", "doc2.pdf"],
"persona": "HR professional",
"job_to_be_done": "...",
"processing_timestamp": "..."
},
"extracted_sections": [
{
"document": "doc1.pdf",
"section_title": "A Highly Relevant Section Title",
"importance_rank": 1,
"page_number": 12
}
],
"subsection_analysis": [
{
"document": "doc1.pdf",
"refined_text": "The most relevant and actionable paragraph from the extracted section...",
"page_number": 12
}
]
}
Navigate to the project's root directory in your terminal and run the build
command.
docker build -t adobe-1b-solution .
Execute the run
command. This will start the container, process the files, and save the results.
-
For Windows (PowerShell or Command Prompt):
docker run --rm -v "%cd%/input":/app/input -v "%cd%/output":/app/output adobe-1b-solution
-
For macOS or Linux:
docker run --rm -v "$(pwd)/input":/app/input -v "$(pwd)/output":/app/output adobe-1b-solution
The final analysis will be saved as output.json
in the output
directory.