A comprehensive command-line RAG (Retrieval-Augmented Generation) system for document-based question answering with support for multiple vector databases (ChromaDB and Weaviate), LangChain, and GROQ.
- ๐ Web Interface: Modern Streamlit-based web UI for all operations
- ๐ Document Ingestion: Support for PDF, DOCX, TXT, and JSON files
- ๐ Vector Search: Support for ChromaDB, Weaviate, and Qdrant vector databases
- ๐ค AI Integration: GROQ API for text generation
- ๐ฌ Interactive Queries: Web chat interface and CLI modes
- ๐ Statistics: Real-time database analytics and monitoring
- ๐ฏ Dual Interface: Both web UI and comprehensive CLI tools
- โ๏ธ Configurable: YAML-based configuration system
- ๐ Timestamped Logs: Detailed logging with unique timestamps
- Python 3.8+
- GROQ API Key
-
Clone the repository
git clone <repository-url> cd rag-pipeline
-
Install dependencies
pip install -r requirements.txt
-
Set up environment variables
export GROQ_API_KEY="your_groq_api_key" # Optional: For Weaviate Cloud export WEAVIATE_API_KEY="your_weaviate_api_key"
-
Optional: Install as package
pip install -e .
This project uses a .env
file to manage secrets and API keys securely. A template file named .env.example
is provided in the project root.
Setup Instructions:
- Copy
.env.example
to.env
in the project root:cp .env.example .env
- Open
.env
and fill in your API keys and any other required secrets. For example:GROQ_API_KEY=your_actual_groq_api_key_here # Add other keys as needed
- Do not commit your
.env
file to version control.
The application will automatically load environment variables from .env
at startup.
The RAG Pipeline supports multiple vector database providers. You can choose between ChromaDB, Weaviate, and Qdrant by configuring the config/config.yaml
file.
ChromaDB is the default vector database and requires no additional setup:
vector_db:
provider: "chromadb"
path: "./data/vectors"
collection_name: "documents"
distance_metric: "cosine"
For Weaviate, you have two options:
-
Start Weaviate using Docker Compose:
docker-compose -f docker-compose.weaviate.yml up -d
-
Update your configuration:
vector_db: provider: "weaviate" url: "http://localhost:8080" class_name: "Document"
- Sign up for Weaviate Cloud Services
- Get your API key and cluster URL
- Update your configuration:
vector_db: provider: "weaviate" url: "https://your-cluster-url.weaviate.network" api_key: "your-api-key" class_name: "Document"
For Qdrant, you have two options:
-
Start Qdrant using Docker Compose:
docker-compose -f docker-compose.qdrant.yml up -d
-
Update your configuration:
vector_db: provider: "qdrant" url: "http://localhost:6333" collection_name: "documents" vector_size: 384
- Sign up for Qdrant Cloud Services
- Get your API key and cluster URL
- Update your configuration:
vector_db: provider: "qdrant" url: "https://your-cluster-url.qdrant.tech" api_key: "your-api-key" collection_name: "documents" vector_size: 384
If you have existing data and want to migrate between vector databases:
# Migrate from ChromaDB to Weaviate
python scripts/migrate_to_weaviate.py --backup --validate
# Migrate from ChromaDB to Qdrant
python scripts/migrate_to_qdrant.py --backup --validate
# Dry run to see what would be migrated
python scripts/migrate_to_weaviate.py --dry-run
python scripts/migrate_to_qdrant.py --dry-run
# Initialize and test the system
python main.py init
# Show help
python main.py --help
# Show available commands
python main.py list
# Ingest documents from a directory
python main.py ingest -d ./docs
# Ingest a single file
python main.py ingest -f document.pdf
# Ingest with verbose output
python main.py ingest -d ./docs --verbose
# Ask a question
python main.py query "What is machine learning?"
# Query with verbose output (shows source details)
python main.py query "Explain neural networks" --verbose
# Interactive mode
python main.py interactive
# Show database statistics
python main.py stats
# Clear all documents (with confirmation)
python main.py clear
# Clear without confirmation
python main.py clear --confirm
# Use custom config file
python main.py init --config /path/to/config.yaml
# Skip test query during initialization
python main.py init --no-test
Launch the comprehensive web-based interface for a user-friendly experience:
# Start the Streamlit web app
streamlit run app.py
# Or with custom port
streamlit run app.py --server.port 8502
- ๐ Dashboard: System overview and quick actions
- ๐ Initialize: Web-based system initialization
- ๐ Ingest Documents:
- Upload files directly through the browser
- Specify directory paths
- Drag-and-drop support for multiple files
- ๐ฌ Chat Interface: Interactive conversational AI with chat history
- โ Single Query: Detailed query interface with source analysis
- ๐ Statistics: Real-time database analytics and visualizations
- ๐๏ธ Clear Database: Safe database clearing with confirmations
- ๐ System Info: Configuration and system status overview
The web interface provides all CLI functionality through an intuitive, modern UI accessible at http://localhost:8501
.
Start CLI interactive mode for conversational queries:
python main.py interactive
Interactive commands:
/stats
- Show database statistics/help
- Show help/quit
- Exit interactive mode
# 1. Initialize the system
python main.py init
# 2. Add documents
python main.py ingest -d ./data/raw
# 3. Query the system
python main.py query "What are the main topics in the documents?"
# 4. Check statistics
python main.py stats
# Verbose ingestion with timing
python main.py ingest -d ./research_papers --verbose
# Query with source details
python main.py query "Explain the methodology" --verbose --max-results 10
# Interactive session
python main.py interactive
Edit config/config.yaml
to customize:
# Logging Configuration
logging:
level: "INFO"
format: "%(asctime)s - %(levelname)s - %(name)s:%(lineno)d - %(message)s"
path: "./logs"
# LLM Configuration
llm:
model: "llama-3.1-8b-instant"
temperature: 0.7
max_tokens: 1000
# Vector Database Configuration
vector_db:
path: "./data/vectors"
collection_name: "documents"
rag-pipeline/
โโโ main.py # Enhanced CLI entry point
โโโ app.py # Streamlit web interface
โโโ setup.py # Package setup (legacy)
โโโ pyproject.toml # Modern package configuration
โโโ requirements.txt # Dependencies
โโโ rag.bat # Windows CLI launcher
โโโ rag.sh # Linux/Mac CLI launcher
โโโ src/
โ โโโ utils/
โ โ โโโ init_manager.py # Logging initialization
โ โ โโโ log_manager.py # Log management utilities
โ โ โโโ config_loader.py # Configuration management
โ โโโ ingestion/
โ โ โโโ document_loader.py # Document processing
โ โโโ rag_pipeline.py # Core RAG functionality
โโโ config/
โ โโโ config.yaml # System configuration
โโโ data/
โ โโโ raw/ # Input documents
โ โโโ processed/ # Processed documents
โ โโโ vectors/ # Vector database
โโโ logs/ # Timestamped log files
โโโ docs/ # Documentation
- PDF (.pdf) - Extracted using PyPDF2
- Word (.docx) - Processed with python-docx
- Text (.txt) - Plain text files
- JSON (.json) - Structured data files
The system creates timestamped log files in the format:
log_YYMMDD_HHMM.log
(e.g.,log_250723_1430.log
)- New log file created for each session
- Configurable via
config.yaml
# Run with test data
python main.py init
# Test individual components
python main.py ingest -f ./data/raw/sample.pdf
python main.py query "Test question"
python main.py stats
-
Missing GROQ API Key
export GROQ_API_KEY="your_api_key_here"
-
Dependencies not installed
pip install -r requirements.txt
-
No documents found
python main.py ingest -d ./your_documents_directory
-
Permission errors
- Ensure write permissions for
logs/
anddata/
directories
- Ensure write permissions for
# Enable verbose output
python main.py --verbose <command>
# Check logs
tail -f logs/log_*.log
MIT License - see LICENSE file for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
- ๐ Documentation: Check the
docs/
directory - ๐ Issues: Report bugs via GitHub issues
- ๐ฌ Questions: Use GitHub discussions