A Retrieval-Augmented Generation (RAG) framework for Intelligent Document Processing (IDP) that combines ChromaDB for vector storage and Groq's Llama3-70b-8192 model for reasoning. This project enables efficient document retrieval, embedding, and question-answering with self-corrective mechanisms.
- Project Overview
- Features
- Folder Structure
- Setup Instructions
- Usage
- How It Works
- Contributing
- License
This project implements a self-corrective RAG framework for processing and querying documents stored in ChromaDB. The system uses:
- ChromaDB as a persistent vector database for storing embeddings.
- Groq's Llama3-70b-8192 model for reasoning and language understanding.
- LangChain for managing embeddings, vector stores, and retrieval workflows.
The system supports:
- Persistent storage of document embeddings.
- Retrieval of relevant documents based on user queries.
- A grading mechanism to assess the relevance of retrieved documents.
- ChromaDB Integration: Persistent vector database to store and retrieve embeddings.
- Groq's Llama3 Model: Advanced reasoning capabilities with Llama3-70b-8192.
- Self-Corrective Mechanism: Grading retrieved documents for relevance.
- JSON Data Handling: Fetches and processes JSON responses from APIs.
self-corrective-rag-idp/
├── app.py \# Main application script
├── chromadb/ \# Persistent ChromaDB storage
│ ├── 1b89f3d0-a622-4e93-936d-fb0c25382900/
│ │ ├── data_level0.bin
│ │ ├── header.bin
│ │ ├── length.bin
│ │ └── link_lists.bin
│ ├── 076d46b2-8442-4792-b0fc-d748b47c9e25/
│ │ ├── data_level0.bin
│ │ ├── header.bin
│ │ ├── length.bin
│ │ └── link_lists.bin
│ └── chroma.sqlite3 \# ChromaDB SQLite database file
├── create_chromadb.py \# Script to populate ChromaDB with embeddings
├── LICENSE \# License file for the project
├── metadata_store.json \# Metadata for ChromaDB collections
├── poetry.lock \# Poetry lock file for dependencies
├── pyproject.toml \# Poetry project configuration file
├── README.md \# Project documentation (this file)
└── utils/ \# Utility scripts and modules
├── build_rag.py \# RAG implementation (vector DB management)
├── data_gatherer.py \# Fetches JSON data for embedding and processing
└── llm.py \# LLM initialization (Groq's Llama3 model)
- Python 3.8 or higher.
- Install dependencies using Poetry:
poetry install
- Set up environment variables in a
.env
file:
VECTOR_STORE=./chromadb/
GROQ_API_KEY=your_groq_api_key_here
Ensure you have the necessary libraries installed:
pip install chromadb langchain langchain_chroma langchain_community python-dotenv
Run the create_chromadb.py
script to fetch data and populate the vector database:
python create_chromadb.py
Start the main application to query documents:
python app.py
In app.py
, modify the question
variable to query specific topics:
question = "hospitals"
The system will retrieve relevant documents and grade their relevance.
-
Data Gathering:
- The
data_gatherer.py
script fetches JSON responses from APIs. - These responses are processed into LangChain-compatible document objects.
- The
-
Embedding and Storage:
- The
build_rag.py
script uses HuggingFace BGE embeddings ("BAAI/bge-base-en-v1.5") to create document embeddings. - Embeddings are stored persistently in ChromaDB.
- The
-
Document Retrieval:
- A retriever is created using LangChain's Chroma integration.
- Documents are retrieved based on semantic similarity to user queries.
-
Grading Mechanism:
- The Groq LLM grades retrieved documents for relevance using a structured prompt.
Contributions are welcome! Please follow these steps:
- Fork this repository.
- Create a new branch:
git checkout -b feature-name
. - Commit your changes:
git commit -m 'Add feature-name'
. - Push to your branch:
git push origin feature-name
. - Submit a pull request.
This project is licensed under the MIT License. See the LICENSE
file for details.