Skip to content

MahajanSamiksha/RAG-KM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

# Retrieval-Augmented Generation (RAG) Pipeline Implementation

## Project Overview

This project demonstrates a **Retrieval-Augmented Generation (RAG)** pipeline to retrieve and generate answers from a local repository of documents.

### Objectives

- Efficiently load and preprocess textual data
- Convert text into embeddings using state-of-the-art models
- Enable semantic retrieval of relevant content
- Generate precise and context-aware answers using a language model

---

## Tasks & Deliverables

### Task 1: Data Creation

1.1 Document Loading & Text Extraction
- Load documents (PDF, DOCX, TXT) from local storage
- Extract text content using LangChain loaders

1.2 Data Cleaning & Preprocessing
- Tokenization, punctuation removal, and spell correction
- Tools: `gensim.simple_preprocess`, `TextBlob`

1.3 Text Chunking
- Split documents into smaller chunks for efficient embedding and retrieval
- Tool: `LangChain`'s `RecursiveCharacterTextSplitter`

### Task 2: Vector Embedding

2.1 Generate Embeddings
- Convert text chunks into embeddings
- Tool: `OpenAI Embeddings API` (`text-embedding-ada-002`)

2.2 Embedding Storage
- Store embeddings in a vector database
- Tool: `FAISS`

### Task 3: Retrieval & Answer Generation

- Retrieve semantically relevant chunks using FAISS
- Generate answers using `OpenAI GPT-4`

### Task 4: User Interface

- Provide a minimal UI to enter queries and display generated responses
- Tool: `Streamlit`

---

## Technology Stack

| Task                         | Tools & Libraries                                         |
|-----------------------------|-----------------------------------------------------------|
| Document Loading & Extraction | LangChain                                                |
| Text Preprocessing           | Gensim, TextBlob                                          |
| Embedding Generation         | OpenAI Embeddings API (`text-embedding-ada-002`)         |
| Embedding Storage            | FAISS                                                     |
| Query & Retrieval            | FAISS, LangChain                                          |
| Answer Generation            | OpenAI GPT-4                                              |
| User Interface               | Streamlit                                                 |

---

## Assumptions

- Necessary APIs (e.g., OpenAI API key) are available and properly configured.

---

## Installation Instructions

Install all required dependencies:

```bash
pip install gensim textblob langchain-community pdfplumber python-docx pandas openpyxl python-pptx faiss-cpu llama-index
pip install llama-index-vector-stores-faiss
pip install -U langchain-openai
pip install streamlit

Initialize the TextBlob spell correction model:

python -m textblob.download_corpora

Library Versions

Library Version
openai 1.67.0
gensim 4.3.3
textblob 0.19.0
langchain-community 0.3.20
pdfplumber 0.11.5
python-docx 1.1.2
pandas 2.2.3
openpyxl 3.1.5
python-pptx 1.0.2
faiss-cpu 1.10.0
llama-index 0.12.24
llama-index-vector-stores-faiss 0.3.0
langchain-openai 0.3.9
langchain 0.3.21
streamlit 1.41.1
langGraph 0.2.67

About

RAG pipeline for Knowledge management

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages