🧠 RAG Pipeline - Retrieval-Augmented Generation System

A comprehensive command-line RAG (Retrieval-Augmented Generation) system for document-based question answering with support for multiple vector databases (ChromaDB and Weaviate), LangChain, and GROQ.

🚀 Features

🌐 Web Interface: Modern Streamlit-based web UI for all operations
📚 Document Ingestion: Support for PDF, DOCX, TXT, and JSON files
🔍 Vector Search: Support for ChromaDB, Weaviate, and Qdrant vector databases
🤖 AI Integration: GROQ API for text generation
💬 Interactive Queries: Web chat interface and CLI modes
📊 Statistics: Real-time database analytics and monitoring
🎯 Dual Interface: Both web UI and comprehensive CLI tools
⚙️ Configurable: YAML-based configuration system
📝 Timestamped Logs: Detailed logging with unique timestamps

📋 Prerequisites

Python 3.8+
GROQ API Key

🛠️ Installation

Clone the repository

git clone <repository-url>
cd rag-pipeline

Install dependencies
```
pip install -r requirements.txt
```

Set up environment variables

export GROQ_API_KEY="your_groq_api_key"
# Optional: For Weaviate Cloud
export WEAVIATE_API_KEY="your_weaviate_api_key"

Optional: Install as package
```
pip install -e .
```

Environment Variables and API Keys

This project uses a .env file to manage secrets and API keys securely. A template file named .env.example is provided in the project root.

Setup Instructions:

Copy .env.example to .env in the project root:
```
cp .env.example .env
```
Open .env and fill in your API keys and any other required secrets. For example:
```
GROQ_API_KEY=your_actual_groq_api_key_here
# Add other keys as needed
```
Do not commit your .env file to version control.

The application will automatically load environment variables from .env at startup.

Vector Database Configuration

The RAG Pipeline supports multiple vector database providers. You can choose between ChromaDB, Weaviate, and Qdrant by configuring the config/config.yaml file.

ChromaDB (Default)

ChromaDB is the default vector database and requires no additional setup:

vector_db:
  provider: "chromadb"
  path: "./data/vectors"
  collection_name: "documents"
  distance_metric: "cosine"

Weaviate

For Weaviate, you have two options:

Option 1: Local Weaviate Instance

Start Weaviate using Docker Compose:

docker-compose -f docker-compose.weaviate.yml up -d

Update your configuration:

vector_db:
  provider: "weaviate"
  url: "http://localhost:8080"
  class_name: "Document"

Option 2: Weaviate Cloud

Sign up for Weaviate Cloud Services
Get your API key and cluster URL

Update your configuration:

vector_db:
  provider: "weaviate"
  url: "https://your-cluster-url.weaviate.network"
  api_key: "your-api-key"
  class_name: "Document"

Qdrant

For Qdrant, you have two options:

Option 1: Local Qdrant Instance

Start Qdrant using Docker Compose:

docker-compose -f docker-compose.qdrant.yml up -d

Update your configuration:

vector_db:
  provider: "qdrant"
  url: "http://localhost:6333"
  collection_name: "documents"
  vector_size: 384

Option 2: Qdrant Cloud

Sign up for Qdrant Cloud Services
Get your API key and cluster URL

Update your configuration:

vector_db:
  provider: "qdrant"
  url: "https://your-cluster-url.qdrant.tech"
  api_key: "your-api-key"
  collection_name: "documents"
  vector_size: 384

Migration Between Vector Databases

If you have existing data and want to migrate between vector databases:

# Migrate from ChromaDB to Weaviate
python scripts/migrate_to_weaviate.py --backup --validate

# Migrate from ChromaDB to Qdrant
python scripts/migrate_to_qdrant.py --backup --validate

# Dry run to see what would be migrated
python scripts/migrate_to_weaviate.py --dry-run
python scripts/migrate_to_qdrant.py --dry-run

🎯 CLI Usage

Quick Start

# Initialize and test the system
python main.py init

# Show help
python main.py --help

# Show available commands
python main.py list

Document Ingestion

# Ingest documents from a directory
python main.py ingest -d ./docs

# Ingest a single file
python main.py ingest -f document.pdf

# Ingest with verbose output
python main.py ingest -d ./docs --verbose

Querying

# Ask a question
python main.py query "What is machine learning?"

# Query with verbose output (shows source details)
python main.py query "Explain neural networks" --verbose

# Interactive mode
python main.py interactive

Database Management

# Show database statistics
python main.py stats

# Clear all documents (with confirmation)
python main.py clear

# Clear without confirmation
python main.py clear --confirm

Configuration

# Use custom config file
python main.py init --config /path/to/config.yaml

# Skip test query during initialization
python main.py init --no-test

🌐 Web Interface (Streamlit)

Launch the comprehensive web-based interface for a user-friendly experience:

# Start the Streamlit web app
streamlit run app.py

# Or with custom port
streamlit run app.py --server.port 8502

Web Interface Features

🏠 Dashboard: System overview and quick actions
🚀 Initialize: Web-based system initialization
📚 Ingest Documents:
- Upload files directly through the browser
- Specify directory paths
- Drag-and-drop support for multiple files
💬 Chat Interface: Interactive conversational AI with chat history
❓ Single Query: Detailed query interface with source analysis
📊 Statistics: Real-time database analytics and visualizations
🗑️ Clear Database: Safe database clearing with confirmations
📋 System Info: Configuration and system status overview

The web interface provides all CLI functionality through an intuitive, modern UI accessible at http://localhost:8501.

🎮 Interactive Mode (CLI)

Start CLI interactive mode for conversational queries:

python main.py interactive

Interactive commands:

/stats - Show database statistics
/help - Show help
/quit - Exit interactive mode

📊 Examples

Basic Workflow

# 1. Initialize the system
python main.py init

# 2. Add documents
python main.py ingest -d ./data/raw

# 3. Query the system
python main.py query "What are the main topics in the documents?"

# 4. Check statistics
python main.py stats

Advanced Usage

# Verbose ingestion with timing
python main.py ingest -d ./research_papers --verbose

# Query with source details
python main.py query "Explain the methodology" --verbose --max-results 10

# Interactive session
python main.py interactive

🔧 Configuration

Edit config/config.yaml to customize:

# Logging Configuration
logging:
  level: "INFO"
  format: "%(asctime)s - %(levelname)s - %(name)s:%(lineno)d - %(message)s"
  path: "./logs"

# LLM Configuration
llm:
  model: "llama-3.1-8b-instant"
  temperature: 0.7
  max_tokens: 1000

# Vector Database Configuration
vector_db:
  path: "./data/vectors"
  collection_name: "documents"

📁 Project Structure

rag-pipeline/
├── main.py                 # Enhanced CLI entry point
├── app.py                  # Streamlit web interface
├── setup.py               # Package setup (legacy)
├── pyproject.toml         # Modern package configuration
├── requirements.txt       # Dependencies
├── rag.bat               # Windows CLI launcher
├── rag.sh                # Linux/Mac CLI launcher
├── src/
│   ├── utils/
│   │   ├── init_manager.py    # Logging initialization
│   │   ├── log_manager.py     # Log management utilities
│   │   └── config_loader.py   # Configuration management
│   ├── ingestion/
│   │   └── document_loader.py # Document processing
│   └── rag_pipeline.py        # Core RAG functionality
├── config/
│   └── config.yaml            # System configuration
├── data/
│   ├── raw/                   # Input documents
│   ├── processed/             # Processed documents
│   └── vectors/               # Vector database
├── logs/                      # Timestamped log files
└── docs/                      # Documentation

🔍 Supported File Formats

PDF (.pdf) - Extracted using PyPDF2
Word (.docx) - Processed with python-docx
Text (.txt) - Plain text files
JSON (.json) - Structured data files

📈 Log Files

The system creates timestamped log files in the format:

log_YYMMDD_HHMM.log (e.g., log_250723_1430.log)
New log file created for each session
Configurable via config.yaml

🧪 Testing

# Run with test data
python main.py init

# Test individual components
python main.py ingest -f ./data/raw/sample.pdf
python main.py query "Test question"
python main.py stats

🚦 Troubleshooting

Common Issues

Missing GROQ API Key
```
export GROQ_API_KEY="your_api_key_here"
```
Dependencies not installed
```
pip install -r requirements.txt
```

No documents found

python main.py ingest -d ./your_documents_directory

Permission errors
- Ensure write permissions for logs/ and data/ directories

Debug Mode

# Enable verbose output
python main.py --verbose <command>

# Check logs
tail -f logs/log_*.log

📄 License

MIT License - see LICENSE file for details.

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

📞 Support

📖 Documentation: Check the docs/ directory
🐛 Issues: Report bugs via GitHub issues
💬 Questions: Use GitHub discussions

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.devcontainer		.devcontainer
.streamlit		.streamlit
config		config
data		data
docs		docs
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CHROMADB_STREAMLIT_FIX.md		CHROMADB_STREAMLIT_FIX.md
QDRANT_CLOUD_SETUP.md		QDRANT_CLOUD_SETUP.md
QDRANT_INTEGRATION.md		QDRANT_INTEGRATION.md
QDRANT_TEST_RESULTS.md		QDRANT_TEST_RESULTS.md
README.md		README.md
README.pdf		README.pdf
STREAMLIT_CLOUD_DEPLOYMENT.md		STREAMLIT_CLOUD_DEPLOYMENT.md
STREAMLIT_QDRANT_DEPLOYMENT.md		STREAMLIT_QDRANT_DEPLOYMENT.md
VERSION		VERSION
app.cloud.py		app.cloud.py
app.py		app.py
app.streamlit.cloud.py		app.streamlit.cloud.py
app.streamlit.py		app.streamlit.py
app.streamlit.qdrant.py		app.streamlit.qdrant.py
debug_config.py		debug_config.py
docker-compose.qdrant.yml		docker-compose.qdrant.yml
docker-compose.weaviate.yml		docker-compose.weaviate.yml
fix_chromadb_streamlit.py		fix_chromadb_streamlit.py
ion.py get		ion.py get
main.py		main.py
migrate_to_qdrant_cloud.py		migrate_to_qdrant_cloud.py
pyproject.toml		pyproject.toml
rag.bat		rag.bat
rag.sh		rag.sh
requirements.cloud.txt		requirements.cloud.txt
requirements.streamlit.cloud.txt		requirements.streamlit.cloud.txt
requirements.streamlit.qdrant.txt		requirements.streamlit.qdrant.txt
requirements.streamlit.txt		requirements.streamlit.txt
requirements.streamlit.windows.txt		requirements.streamlit.windows.txt
requirements.txt		requirements.txt
setup.py		setup.py
setup_qdrant_cloud.py		setup_qdrant_cloud.py
switch_to_cloud.py		switch_to_cloud.py
test_json_processing.py		test_json_processing.py
test_qdrant_cloud.py		test_qdrant_cloud.py
test_qdrant_comprehensive.py		test_qdrant_comprehensive.py
test_qdrant_detailed.py		test_qdrant_detailed.py
test_qdrant_rag.py		test_qdrant_rag.py
test_streamlit_init.py		test_streamlit_init.py
test_vector_db.py		test_vector_db.py
version.bat		version.bat
version.py		version.py

bsoberoi/RAG-Pipeline

Folders and files

Latest commit

History

Repository files navigation

🧠 RAG Pipeline - Retrieval-Augmented Generation System

🚀 Features

📋 Prerequisites

🛠️ Installation

Environment Variables and API Keys

Vector Database Configuration

ChromaDB (Default)

Weaviate

Option 1: Local Weaviate Instance

Option 2: Weaviate Cloud

Qdrant

Option 1: Local Qdrant Instance

Option 2: Qdrant Cloud

Migration Between Vector Databases

🎯 CLI Usage

Quick Start

Document Ingestion

Querying

Database Management

Configuration

🌐 Web Interface (Streamlit)

Web Interface Features

🎮 Interactive Mode (CLI)

📊 Examples

Basic Workflow

Advanced Usage

🔧 Configuration

📁 Project Structure

🔍 Supported File Formats

📈 Log Files

🧪 Testing

🚦 Troubleshooting

Common Issues

Debug Mode

📄 License

🤝 Contributing

📞 Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages