# Retrieval-Augmented Generation (RAG) Pipeline Implementation
## Project Overview
This project demonstrates a **Retrieval-Augmented Generation (RAG)** pipeline to retrieve and generate answers from a local repository of documents.
### Objectives
- Efficiently load and preprocess textual data
- Convert text into embeddings using state-of-the-art models
- Enable semantic retrieval of relevant content
- Generate precise and context-aware answers using a language model
---
## Tasks & Deliverables
### Task 1: Data Creation
1.1 Document Loading & Text Extraction
- Load documents (PDF, DOCX, TXT) from local storage
- Extract text content using LangChain loaders
1.2 Data Cleaning & Preprocessing
- Tokenization, punctuation removal, and spell correction
- Tools: `gensim.simple_preprocess`, `TextBlob`
1.3 Text Chunking
- Split documents into smaller chunks for efficient embedding and retrieval
- Tool: `LangChain`'s `RecursiveCharacterTextSplitter`
### Task 2: Vector Embedding
2.1 Generate Embeddings
- Convert text chunks into embeddings
- Tool: `OpenAI Embeddings API` (`text-embedding-ada-002`)
2.2 Embedding Storage
- Store embeddings in a vector database
- Tool: `FAISS`
### Task 3: Retrieval & Answer Generation
- Retrieve semantically relevant chunks using FAISS
- Generate answers using `OpenAI GPT-4`
### Task 4: User Interface
- Provide a minimal UI to enter queries and display generated responses
- Tool: `Streamlit`
---
## Technology Stack
| Task | Tools & Libraries |
|-----------------------------|-----------------------------------------------------------|
| Document Loading & Extraction | LangChain |
| Text Preprocessing | Gensim, TextBlob |
| Embedding Generation | OpenAI Embeddings API (`text-embedding-ada-002`) |
| Embedding Storage | FAISS |
| Query & Retrieval | FAISS, LangChain |
| Answer Generation | OpenAI GPT-4 |
| User Interface | Streamlit |
---
## Assumptions
- Necessary APIs (e.g., OpenAI API key) are available and properly configured.
---
## Installation Instructions
Install all required dependencies:
```bash
pip install gensim textblob langchain-community pdfplumber python-docx pandas openpyxl python-pptx faiss-cpu llama-index
pip install llama-index-vector-stores-faiss
pip install -U langchain-openai
pip install streamlit
Initialize the TextBlob spell correction model:
python -m textblob.download_corpora
Library | Version |
---|---|
openai | 1.67.0 |
gensim | 4.3.3 |
textblob | 0.19.0 |
langchain-community | 0.3.20 |
pdfplumber | 0.11.5 |
python-docx | 1.1.2 |
pandas | 2.2.3 |
openpyxl | 3.1.5 |
python-pptx | 1.0.2 |
faiss-cpu | 1.10.0 |
llama-index | 0.12.24 |
llama-index-vector-stores-faiss | 0.3.0 |
langchain-openai | 0.3.9 |
langchain | 0.3.21 |
streamlit | 1.41.1 |
langGraph | 0.2.67 |