GitHub - MahajanSamiksha/RAG-KM: RAG pipeline for Knowledge management

# Retrieval-Augmented Generation (RAG) Pipeline Implementation

## Project Overview

This project demonstrates a **Retrieval-Augmented Generation (RAG)** pipeline to retrieve and generate answers from a local repository of documents.

### Objectives

- Efficiently load and preprocess textual data
- Convert text into embeddings using state-of-the-art models
- Enable semantic retrieval of relevant content
- Generate precise and context-aware answers using a language model

---

## Tasks & Deliverables

### Task 1: Data Creation

1.1 Document Loading & Text Extraction
- Load documents (PDF, DOCX, TXT) from local storage
- Extract text content using LangChain loaders

1.2 Data Cleaning & Preprocessing
- Tokenization, punctuation removal, and spell correction
- Tools: `gensim.simple_preprocess`, `TextBlob`

1.3 Text Chunking
- Split documents into smaller chunks for efficient embedding and retrieval
- Tool: `LangChain`'s `RecursiveCharacterTextSplitter`

### Task 2: Vector Embedding

2.1 Generate Embeddings
- Convert text chunks into embeddings
- Tool: `OpenAI Embeddings API` (`text-embedding-ada-002`)

2.2 Embedding Storage
- Store embeddings in a vector database
- Tool: `FAISS`

### Task 3: Retrieval & Answer Generation

- Retrieve semantically relevant chunks using FAISS
- Generate answers using `OpenAI GPT-4`

### Task 4: User Interface

- Provide a minimal UI to enter queries and display generated responses
- Tool: `Streamlit`

---

## Technology Stack

| Task                         | Tools & Libraries                                         |
|-----------------------------|-----------------------------------------------------------|
| Document Loading & Extraction | LangChain                                                |
| Text Preprocessing           | Gensim, TextBlob                                          |
| Embedding Generation         | OpenAI Embeddings API (`text-embedding-ada-002`)         |
| Embedding Storage            | FAISS                                                     |
| Query & Retrieval            | FAISS, LangChain                                          |
| Answer Generation            | OpenAI GPT-4                                              |
| User Interface               | Streamlit                                                 |

---

## Assumptions

- Necessary APIs (e.g., OpenAI API key) are available and properly configured.

---

## Installation Instructions

Install all required dependencies:

```bash
pip install gensim textblob langchain-community pdfplumber python-docx pandas openpyxl python-pptx faiss-cpu llama-index
pip install llama-index-vector-stores-faiss
pip install -U langchain-openai
pip install streamlit

Initialize the TextBlob spell correction model:

python -m textblob.download_corpora

Library Versions

Library	Version
openai	1.67.0
gensim	4.3.3
textblob	0.19.0
langchain-community	0.3.20
pdfplumber	0.11.5
python-docx	1.1.2
pandas	2.2.3
openpyxl	3.1.5
python-pptx	1.0.2
faiss-cpu	1.10.0
llama-index	0.12.24
llama-index-vector-stores-faiss	0.3.0
langchain-openai	0.3.9
langchain	0.3.21
streamlit	1.41.1
langGraph	0.2.67

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
code		code
data		data
RAG_Conversation_Memory_Roadmap.md		RAG_Conversation_Memory_Roadmap.md
README.md		README.md
buildspec.yml		buildspec.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Library Versions

About

Uh oh!

Releases

Packages

Languages

MahajanSamiksha/RAG-KM

Folders and files

Latest commit

History

Repository files navigation

Library Versions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages