Self-Corrective RAG IDP

A Retrieval-Augmented Generation (RAG) framework for Intelligent Document Processing (IDP) that combines ChromaDB for vector storage and Groq's Llama3-70b-8192 model for reasoning. This project enables efficient document retrieval, embedding, and question-answering with self-corrective mechanisms.

Project Overview

This project implements a self-corrective RAG framework for processing and querying documents stored in ChromaDB. The system uses:

ChromaDB as a persistent vector database for storing embeddings.
Groq's Llama3-70b-8192 model for reasoning and language understanding.
LangChain for managing embeddings, vector stores, and retrieval workflows.

The system supports:

Persistent storage of document embeddings.
Retrieval of relevant documents based on user queries.
A grading mechanism to assess the relevance of retrieved documents.

Features

ChromaDB Integration: Persistent vector database to store and retrieve embeddings.
Groq's Llama3 Model: Advanced reasoning capabilities with Llama3-70b-8192.
Self-Corrective Mechanism: Grading retrieved documents for relevance.
JSON Data Handling: Fetches and processes JSON responses from APIs.

Folder Structure


self-corrective-rag-idp/
├── app.py                 \# Main application script
├── chromadb/              \# Persistent ChromaDB storage
│   ├── 1b89f3d0-a622-4e93-936d-fb0c25382900/
│   │   ├── data_level0.bin
│   │   ├── header.bin
│   │   ├── length.bin
│   │   └── link_lists.bin
│   ├── 076d46b2-8442-4792-b0fc-d748b47c9e25/
│   │   ├── data_level0.bin
│   │   ├── header.bin
│   │   ├── length.bin
│   │   └── link_lists.bin
│   └── chroma.sqlite3     \# ChromaDB SQLite database file
├── create_chromadb.py     \# Script to populate ChromaDB with embeddings
├── LICENSE                \# License file for the project
├── metadata_store.json    \# Metadata for ChromaDB collections
├── poetry.lock            \# Poetry lock file for dependencies
├── pyproject.toml         \# Poetry project configuration file
├── README.md              \# Project documentation (this file)
└── utils/                 \# Utility scripts and modules
├── build_rag.py       \# RAG implementation (vector DB management)
├── data_gatherer.py   \# Fetches JSON data for embedding and processing
└── llm.py             \# LLM initialization (Groq's Llama3 model)

Setup Instructions

Prerequisites

Python 3.8 or higher.
Install dependencies using Poetry:


poetry install

Set up environment variables in a .env file:


VECTOR_STORE=./chromadb/
GROQ_API_KEY=your_groq_api_key_here

Install Required Libraries

Ensure you have the necessary libraries installed:


pip install chromadb langchain langchain_chroma langchain_community python-dotenv

Usage

Step 1: Populate the Vector Database

Run the create_chromadb.py script to fetch data and populate the vector database:


python create_chromadb.py

Step 2: Run the Application

Start the main application to query documents:


python app.py

Step 3: Query Documents

In app.py, modify the question variable to query specific topics:


question = "hospitals"

The system will retrieve relevant documents and grade their relevance.

How It Works

Data Gathering:
- The data_gatherer.py script fetches JSON responses from APIs.
- These responses are processed into LangChain-compatible document objects.
Embedding and Storage:
- The build_rag.py script uses HuggingFace BGE embeddings ("BAAI/bge-base-en-v1.5") to create document embeddings.
- Embeddings are stored persistently in ChromaDB.
Document Retrieval:
- A retriever is created using LangChain's Chroma integration.
- Documents are retrieved based on semantic similarity to user queries.
Grading Mechanism:
- The Groq LLM grades retrieved documents for relevance using a structured prompt.

Contributing

Contributions are welcome! Please follow these steps:

Fork this repository.
Create a new branch: git checkout -b feature-name.
Commit your changes: git commit -m 'Add feature-name'.
Push to your branch: git push origin feature-name.
Submit a pull request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Self-Corrective RAG IDP

Table of Contents

Project Overview

Features

Folder Structure

Setup Instructions

Prerequisites

Install Required Libraries

Usage

Step 1: Populate the Vector Database

Step 2: Run the Application

Step 3: Query Documents

How It Works

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
chromadb		chromadb
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
create_chromadb.py		create_chromadb.py
metadata_store.json		metadata_store.json
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

License

Suryakandukuri/self-corrective-rag-idp

Folders and files

Latest commit

History

Repository files navigation

Self-Corrective RAG IDP

Table of Contents

Project Overview

Features

Folder Structure

Setup Instructions

Prerequisites

Install Required Libraries

Usage

Step 1: Populate the Vector Database

Step 2: Run the Application

Step 3: Query Documents

How It Works

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages