This repository and the RAG system implemented herein are developed strictly for educational and portfolio purposes only. It is intended to demonstrate technical capabilities in Retrieval-Augmented Generation (RAG) and the integration of local LLMs with authoritative guidelines.
This system is NOT intended for, nor should it be used for, clinical decision-making, medical diagnosis, treatment, or any form of patient care. Clinical decisions must always be made by qualified healthcare professionals based on their expertise, patient-specific information, and current, officially published medical guidelines.
The information provided by this system is for informational and demonstrative purposes only and should not be considered medical advice.
This project implements a Retrieval-Augmented Generation (RAG) system designed to provide clinicians with advice on patient management based on the United States Preventive Services Task Force (USPSTF) guidelines. It leverages local Large Language Models (LLMs) and embedding models via Ollama, ensuring data privacy and control.
- Document Ingestion: Processes PDF documents (USPSTF guidelines) into a searchable format.
- Local Embeddings: Uses
all-minilm
via Ollama for generating document embeddings. - Local LLM: Utilizes
phi3.5:latest
via Ollama for generating responses, keeping all processing local. - Vector Store: Employs ChromaDB for efficient storage and retrieval of document chunks and their embeddings.
- FastAPI Interface: Exposes the RAG system as a web API for easy integration and interaction.
- Logging & Performance Metrics: Integrates Python's
logging
module and basic timing measurements for better observability.
- Python 3.11+
- FastAPI: For building the web API.
- Uvicorn: ASGI server to run the FastAPI application.
- LangChain: Framework for developing LLM applications.
- Ollama: For running local LLMs and embedding models (
all-minilm
,phi3.5:latest
). - ChromaDB: Lightweight vector database.
python-dotenv
: For managing environment variables.unstructured
: For extracting text from various document formats (e.g., PDFs).uv
: For dependency management and virtual environments.
This project was developed using an iterative, AI-assisted approach. Leveraging gemini-cli
, I guided the development process step-by-step, making architectural decisions, defining requirements, and ensuring adherence to best practices. This methodology allowed for rapid prototyping, exploration of various technical solutions, and a deeper understanding of complex concepts by actively directing the AI assistant's code generation and refactoring efforts. This approach highlights the ability to effectively utilize advanced AI tools as a force multiplier in software development.
Follow these steps to set up and run the project locally.
git clone <repository_url>
cd simple-rag-system
uv
is a fast Python package installer and resolver. If you don't have it, install it:
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
Install all required Python packages using uv
:
uv sync
Download and install Ollama from the official website:
Once installed, pull the necessary models:
ollama pull all-minilm
ollama pull phi3.5:latest
Ensure the Ollama server is running in the background.
Place your USPSTF guideline PDF files into the data/raw/
directory. If the directory does not exist, it will be created during the ingestion process.
Before querying, you need to ingest the documents into the vector database. This process extracts text, splits it into chunks, generates embeddings, and stores them in ChromaDB.
Run the ingestion script from the project root:
uv run python -m src.ingest
This will create a vectorstore/
directory containing your indexed data.
Start the FastAPI server from the project root:
uv run python -m uvicorn main:app --host 0.0.0.0 --port 8000 --reload
The --reload
flag is useful for development as it restarts the server on code changes.
Once the server is running, open your web browser and navigate to:
This will open the interactive Swagger UI documentation, where you can test the API.
Use the /query
endpoint to send questions to the RAG system.
-
Endpoint:
/query
-
Method:
POST
-
Request Body (JSON):
{ "question": "What is the recommendation for colorectal cancer screening?" }
-
Response Body (JSON):
{ "question": "What is the recommendation for colorectal cancer screening?", "answer": "The USPSTF recommends..." }
To evolve this into a truly clinical RAG system, key considerations include:
- Authoritative Data: Strict curation and versioning of highly authoritative, evidence-based sources.
- Domain-Specific Models: Utilizing embedding and language models (LLMs) specifically trained or fine-tuned on clinical texts for enhanced understanding.
- Enhanced Retrieval: Leveraging metadata, hybrid search, and re-ranking for precise, context-aware information retrieval.
- Clinical Prompt Engineering: Crafting prompts with guardrails to ensure factual, safe, and actionable responses, avoiding direct medical advice, and citing sources.
- Rigorous Validation: Comprehensive evaluation by clinical experts to ensure accuracy, safety, and clinical utility.
This project serves as a foundational RAG system. Current limitations and areas for future improvement include:
- Document Parsing: Relies on basic PDF text extraction. Does not robustly handle complex document structures like tables, figures, or scanned documents, which can lead to information loss.
- Chunking Strategy: Uses a simple recursive character splitting method. This might not always preserve semantic coherence perfectly, especially for clinical guidelines with intricate structures.
- Retrieval Sophistication: Employs basic similarity search. Lacks advanced retrieval techniques such as re-ranking retrieved documents, hybrid search (combining keyword and semantic search), or leveraging document metadata for more precise filtering.
- LLM Hallucination/Generality: While RAG reduces hallucination, the LLM might still generate less precise or overly general answers if the retrieved context is insufficient or ambiguous.
- User Interface: Currently, interaction is limited to the FastAPI Swagger UI. A dedicated, user-friendly web interface is absent.
- Evaluation Framework: Lacks a robust, automated evaluation pipeline to measure the RAG system's performance (e.g., relevance, faithfulness, latency) against a defined dataset.
- Error Handling: Basic error handling is in place, but more granular and user-friendly error messages could be implemented.
- Advanced Document Processing: Implement more sophisticated parsing techniques (e.g., using
unstructured.io
's advanced features, or dedicated table/image extraction) to better handle complex PDF layouts and extract structured information. - Smarter Chunking: Explore and implement advanced chunking strategies (e.g., semantic chunking, hierarchical chunking based on document structure) to create more meaningful context units.
- Enhanced Retrieval Techniques: Integrate re-ranking models (e.g., Cohere Rerank), implement HyDE (Hypothetical Document Embeddings), or RAG-Fusion for improved retrieval accuracy. Leverage document metadata (e.g., guideline year, disease, population) for filtered retrieval.
- Domain-Specific Model Adaptation: Investigate fine-tuning embedding models (e.g., on clinical notes, medical literature) and potentially LLMs (if resources permit) to enhance domain-specific understanding and generation quality.
- Interactive User Interface: Develop a simple web-based front-end (e.g., using Streamlit, Gradio, or a React/Vue app) for a more intuitive user experience.
- Comprehensive Evaluation: Build an automated evaluation pipeline to continuously monitor and improve the RAG system's performance, including metrics for retrieval quality, answer faithfulness, and latency.
- Guideline Versioning: Implement a system to manage and query specific versions of guidelines, ensuring answers are based on the most current or relevant iteration.
- Citation Generation: Enhance the system to provide direct citations (e.g., page numbers, section references) from the source documents for generated answers, increasing trustworthiness.
- Streaming Responses: Implement streaming for LLM responses to provide a more responsive user experience.