Skip to content

bsoberoi/RAG-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

4 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿง  RAG Pipeline - Retrieval-Augmented Generation System

A comprehensive command-line RAG (Retrieval-Augmented Generation) system for document-based question answering with support for multiple vector databases (ChromaDB and Weaviate), LangChain, and GROQ.

๐Ÿš€ Features

  • ๐ŸŒ Web Interface: Modern Streamlit-based web UI for all operations
  • ๐Ÿ“š Document Ingestion: Support for PDF, DOCX, TXT, and JSON files
  • ๐Ÿ” Vector Search: Support for ChromaDB, Weaviate, and Qdrant vector databases
  • ๐Ÿค– AI Integration: GROQ API for text generation
  • ๐Ÿ’ฌ Interactive Queries: Web chat interface and CLI modes
  • ๐Ÿ“Š Statistics: Real-time database analytics and monitoring
  • ๐ŸŽฏ Dual Interface: Both web UI and comprehensive CLI tools
  • โš™๏ธ Configurable: YAML-based configuration system
  • ๐Ÿ“ Timestamped Logs: Detailed logging with unique timestamps

๐Ÿ“‹ Prerequisites

  • Python 3.8+
  • GROQ API Key

๐Ÿ› ๏ธ Installation

  1. Clone the repository

    git clone <repository-url>
    cd rag-pipeline
  2. Install dependencies

    pip install -r requirements.txt
  3. Set up environment variables

    export GROQ_API_KEY="your_groq_api_key"
    # Optional: For Weaviate Cloud
    export WEAVIATE_API_KEY="your_weaviate_api_key"
  4. Optional: Install as package

    pip install -e .

Environment Variables and API Keys

This project uses a .env file to manage secrets and API keys securely. A template file named .env.example is provided in the project root.

Setup Instructions:

  1. Copy .env.example to .env in the project root:
    cp .env.example .env
  2. Open .env and fill in your API keys and any other required secrets. For example:
    GROQ_API_KEY=your_actual_groq_api_key_here
    # Add other keys as needed
  3. Do not commit your .env file to version control.

The application will automatically load environment variables from .env at startup.

Vector Database Configuration

The RAG Pipeline supports multiple vector database providers. You can choose between ChromaDB, Weaviate, and Qdrant by configuring the config/config.yaml file.

ChromaDB (Default)

ChromaDB is the default vector database and requires no additional setup:

vector_db:
  provider: "chromadb"
  path: "./data/vectors"
  collection_name: "documents"
  distance_metric: "cosine"

Weaviate

For Weaviate, you have two options:

Option 1: Local Weaviate Instance

  1. Start Weaviate using Docker Compose:

    docker-compose -f docker-compose.weaviate.yml up -d
  2. Update your configuration:

    vector_db:
      provider: "weaviate"
      url: "http://localhost:8080"
      class_name: "Document"

Option 2: Weaviate Cloud

  1. Sign up for Weaviate Cloud Services
  2. Get your API key and cluster URL
  3. Update your configuration:
    vector_db:
      provider: "weaviate"
      url: "https://your-cluster-url.weaviate.network"
      api_key: "your-api-key"
      class_name: "Document"

Qdrant

For Qdrant, you have two options:

Option 1: Local Qdrant Instance

  1. Start Qdrant using Docker Compose:

    docker-compose -f docker-compose.qdrant.yml up -d
  2. Update your configuration:

    vector_db:
      provider: "qdrant"
      url: "http://localhost:6333"
      collection_name: "documents"
      vector_size: 384

Option 2: Qdrant Cloud

  1. Sign up for Qdrant Cloud Services
  2. Get your API key and cluster URL
  3. Update your configuration:
    vector_db:
      provider: "qdrant"
      url: "https://your-cluster-url.qdrant.tech"
      api_key: "your-api-key"
      collection_name: "documents"
      vector_size: 384

Migration Between Vector Databases

If you have existing data and want to migrate between vector databases:

# Migrate from ChromaDB to Weaviate
python scripts/migrate_to_weaviate.py --backup --validate

# Migrate from ChromaDB to Qdrant
python scripts/migrate_to_qdrant.py --backup --validate

# Dry run to see what would be migrated
python scripts/migrate_to_weaviate.py --dry-run
python scripts/migrate_to_qdrant.py --dry-run

๐ŸŽฏ CLI Usage

Quick Start

# Initialize and test the system
python main.py init

# Show help
python main.py --help

# Show available commands
python main.py list

Document Ingestion

# Ingest documents from a directory
python main.py ingest -d ./docs

# Ingest a single file
python main.py ingest -f document.pdf

# Ingest with verbose output
python main.py ingest -d ./docs --verbose

Querying

# Ask a question
python main.py query "What is machine learning?"

# Query with verbose output (shows source details)
python main.py query "Explain neural networks" --verbose

# Interactive mode
python main.py interactive

Database Management

# Show database statistics
python main.py stats

# Clear all documents (with confirmation)
python main.py clear

# Clear without confirmation
python main.py clear --confirm

Configuration

# Use custom config file
python main.py init --config /path/to/config.yaml

# Skip test query during initialization
python main.py init --no-test

๐ŸŒ Web Interface (Streamlit)

Launch the comprehensive web-based interface for a user-friendly experience:

# Start the Streamlit web app
streamlit run app.py

# Or with custom port
streamlit run app.py --server.port 8502

Web Interface Features

  • ๐Ÿ  Dashboard: System overview and quick actions
  • ๐Ÿš€ Initialize: Web-based system initialization
  • ๐Ÿ“š Ingest Documents:
    • Upload files directly through the browser
    • Specify directory paths
    • Drag-and-drop support for multiple files
  • ๐Ÿ’ฌ Chat Interface: Interactive conversational AI with chat history
  • โ“ Single Query: Detailed query interface with source analysis
  • ๐Ÿ“Š Statistics: Real-time database analytics and visualizations
  • ๐Ÿ—‘๏ธ Clear Database: Safe database clearing with confirmations
  • ๐Ÿ“‹ System Info: Configuration and system status overview

The web interface provides all CLI functionality through an intuitive, modern UI accessible at http://localhost:8501.

๐ŸŽฎ Interactive Mode (CLI)

Start CLI interactive mode for conversational queries:

python main.py interactive

Interactive commands:

  • /stats - Show database statistics
  • /help - Show help
  • /quit - Exit interactive mode

๐Ÿ“Š Examples

Basic Workflow

# 1. Initialize the system
python main.py init

# 2. Add documents
python main.py ingest -d ./data/raw

# 3. Query the system
python main.py query "What are the main topics in the documents?"

# 4. Check statistics
python main.py stats

Advanced Usage

# Verbose ingestion with timing
python main.py ingest -d ./research_papers --verbose

# Query with source details
python main.py query "Explain the methodology" --verbose --max-results 10

# Interactive session
python main.py interactive

๐Ÿ”ง Configuration

Edit config/config.yaml to customize:

# Logging Configuration
logging:
  level: "INFO"
  format: "%(asctime)s - %(levelname)s - %(name)s:%(lineno)d - %(message)s"
  path: "./logs"

# LLM Configuration
llm:
  model: "llama-3.1-8b-instant"
  temperature: 0.7
  max_tokens: 1000

# Vector Database Configuration
vector_db:
  path: "./data/vectors"
  collection_name: "documents"

๐Ÿ“ Project Structure

rag-pipeline/
โ”œโ”€โ”€ main.py                 # Enhanced CLI entry point
โ”œโ”€โ”€ app.py                  # Streamlit web interface
โ”œโ”€โ”€ setup.py               # Package setup (legacy)
โ”œโ”€โ”€ pyproject.toml         # Modern package configuration
โ”œโ”€โ”€ requirements.txt       # Dependencies
โ”œโ”€โ”€ rag.bat               # Windows CLI launcher
โ”œโ”€โ”€ rag.sh                # Linux/Mac CLI launcher
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ utils/
โ”‚   โ”‚   โ”œโ”€โ”€ init_manager.py    # Logging initialization
โ”‚   โ”‚   โ”œโ”€โ”€ log_manager.py     # Log management utilities
โ”‚   โ”‚   โ””โ”€โ”€ config_loader.py   # Configuration management
โ”‚   โ”œโ”€โ”€ ingestion/
โ”‚   โ”‚   โ””โ”€โ”€ document_loader.py # Document processing
โ”‚   โ””โ”€โ”€ rag_pipeline.py        # Core RAG functionality
โ”œโ”€โ”€ config/
โ”‚   โ””โ”€โ”€ config.yaml            # System configuration
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ raw/                   # Input documents
โ”‚   โ”œโ”€โ”€ processed/             # Processed documents
โ”‚   โ””โ”€โ”€ vectors/               # Vector database
โ”œโ”€โ”€ logs/                      # Timestamped log files
โ””โ”€โ”€ docs/                      # Documentation

๐Ÿ” Supported File Formats

  • PDF (.pdf) - Extracted using PyPDF2
  • Word (.docx) - Processed with python-docx
  • Text (.txt) - Plain text files
  • JSON (.json) - Structured data files

๐Ÿ“ˆ Log Files

The system creates timestamped log files in the format:

  • log_YYMMDD_HHMM.log (e.g., log_250723_1430.log)
  • New log file created for each session
  • Configurable via config.yaml

๐Ÿงช Testing

# Run with test data
python main.py init

# Test individual components
python main.py ingest -f ./data/raw/sample.pdf
python main.py query "Test question"
python main.py stats

๐Ÿšฆ Troubleshooting

Common Issues

  1. Missing GROQ API Key

    export GROQ_API_KEY="your_api_key_here"
  2. Dependencies not installed

    pip install -r requirements.txt
  3. No documents found

    python main.py ingest -d ./your_documents_directory
  4. Permission errors

    • Ensure write permissions for logs/ and data/ directories

Debug Mode

# Enable verbose output
python main.py --verbose <command>

# Check logs
tail -f logs/log_*.log

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

๐Ÿ“ž Support

  • ๐Ÿ“– Documentation: Check the docs/ directory
  • ๐Ÿ› Issues: Report bugs via GitHub issues
  • ๐Ÿ’ฌ Questions: Use GitHub discussions

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages