Transform your documents into an intelligent Q&A system using LlamaIndex RAG capabilities and Streamlit's interactive interface. Upload PDFs, DOCX, or TXT files and get instant, contextual answers powered by advanced AI embeddings.
Demo.mp4
RAGIndex is a Retrieval-Augmented Generation (RAG) application that leverages LlamaIndex for document processing and Streamlit for the user interface. It transforms static documents into an interactive knowledge base where you can ask questions and receive accurate, context-aware answers.
- π LlamaIndex-Powered RAG: Advanced document indexing and retrieval using LlamaIndex's state-of-the-art RAG pipeline
- π» Streamlit Web Interface: Beautiful, responsive UI built with Streamlit for seamless user experience
- π Advanced Document Processing: Multi-format support (PDF, DOCX, TXT) with intelligent text extraction and metadata preservation
- π Intelligent PDF Ingestion: Sophisticated PDF processing with page-level tracking, automatic fallback mechanisms, and detailed metadata retention
- π§ Smart OCR Pipeline: Automatic OCR processing for image-based PDFs using Tesseract with custom configuration and error handling
- π Document Tracking & Deduplication: Advanced document store with Redis-backed tracking, duplicate detection, and ingestion state management
- β‘ High-Performance Vector Storage: Redis vector store with semantic search, metadata fields, and optimized retrieval
- π Real-time Processing: Instant document processing with progress tracking and detailed ingestion statistics
- π¨ Modern UI: Clean, intuitive interface with chat-style interactions and comprehensive error feedback
- π³ Production Ready: Fully containerized deployment with Docker Compose and scalable architecture
- Embedding Model:
bge-base-finetune-v2
for high-quality text embeddings with semantic splitting - Vector Store: Redis-backed vector storage with metadata fields for source tracking and page numbering
- Document Processing: Semantic-aware text chunking with intelligent overlap and page boundary preservation
- Document Store: Redis document store with duplicate detection and ingestion state tracking
- Query Engine: LlamaIndex conversation engine for contextual responses with source attribution
- Ingestion Pipeline: Advanced pipeline with caching, error handling, and automatic retry mechanisms
- Interactive File Upload: Multi-file upload with progress tracking
- Real-time Chat: Chat-style interface for natural Q&A interactions
- Session Management: Persistent conversation state across interactions
- Responsive Design: Modern, mobile-friendly interface
- Docker and Docker Compose
- 4GB+ RAM recommended
- Internet connection for model downloads
git clone https://github.com/rigvedrs/RAGIndex.git
cd RAGIndex
Create a .env
file with your OpenAI API key:
echo "OPENAI_API_KEY=your_openai_api_key_here" > .env
docker compose up --build
Open your browser and navigate to: http://localhost:8501
- Upload Documents: Use the sidebar to upload PDF, DOCX, or TXT files
- Process Documents: Click "Analyze" to process and index your documents
- Ask Questions: Type your questions in the chat interface
- Get Answers: Receive contextual answers based on your documents
RAGIndex implements a sophisticated PDF processing pipeline that goes far beyond basic text extraction:
- Page-Level Tracking: Each page is individually processed with embedded page numbers (
PAGE_NUM=1
,PAGE_NUM=2
, etc.) for precise source attribution - Metadata Preservation: Complete document metadata including source filename, page numbers, and processing timestamps
- Semantic Chunking: Uses LlamaIndex's SemanticSplitterNodeParser for intelligent content-aware splitting rather than naive character limits
- Primary Extraction: PyPDF2-based text extraction for standard PDFs
- OCR Fallback: Automatic detection of image-based PDFs with Tesseract OCR processing
- Format Conversion: DOCX and TXT files automatically converted to PDF format for consistent processing
- Quality Validation: Empty document detection with automatic fallback to OCR processing
- Automatic Retry Logic: Failed document ingestion automatically triggers cleanup and retry mechanisms
- Memory Management: OutOfMemoryError handling with graceful degradation
- Document Store Cleanup: Automatic removal of partially processed documents to maintain data integrity
- Progress Tracking: Real-time feedback with detailed statistics on node generation and ingestion success
- Duplicate Detection:
DocstoreStrategy.DUPLICATES_ONLY
prevents re-processing of identical documents - Redis-Backed Storage: High-performance document storage with persistence and scalability
- Ingestion Caching: Intelligent caching system to speed up repeated operations
- Metadata Indexing: Searchable metadata fields including source attribution and page references
When you ask questions, RAGIndex doesn't just provide answersβit tells you exactly which document and page the information came from, enabling:
- Citation Accuracy: Precise page-level source references
- Content Verification: Easy verification of AI responses against source documents
- Context Preservation: Maintains document structure and page relationships
- LlamaIndex: Advanced RAG framework for document indexing and retrieval
- Streamlit: Modern web app framework for data science and AI
- Redis: In-memory vector database for high-performance search
- HuggingFace Transformers: Pre-trained embedding models
- PyPDF2: Primary PDF text extraction with page-level tracking
- Tesseract OCR: Intelligent OCR fallback for image-based PDFs with custom configuration
- pdf2image: High-quality PDF to image conversion for OCR processing
- python-docx: DOCX file processing with automatic PDF conversion
- PyMuPDF: Advanced PDF processing and metadata extraction
- Document Tracking: Page number injection, source attribution, and metadata preservation
- Error Handling: Robust fallback mechanisms and automatic retry logic
- Deduplication: Document fingerprinting and duplicate prevention system
- OpenAI API: Large language model integration
- sentence-transformers: Text embedding generation
- NLTK: Natural language processing utilities
[embed_model]
model_name = "Suva/bge-base-finetune-v2"
cache_folder = "/RAGIndex/store/models"
embed_batch_size = 1
[transformations]
chunk_size = 1000
chunk_overlap = 100
[redis]
host_name = 'redis'
port_no = 6379
doc_store_name = "DocStore_v1"
vector_index_name = "VecStore_v1"
Replace the embedding model in config.toml
:
model_name = "your-custom-huggingface-model"
For production deployment:
docker compose up -d --scale RAGIndex=3
The application can be extended with REST API endpoints for programmatic access.
# Install dependencies
pip install -r requirements.txt
# Start Redis
docker run -d -p 6379:6379 redis/redis-stack-server:latest
# Run Streamlit app
streamlit run src/app.py
RAGIndex/
βββ src/
β βββ app.py # Main Streamlit application
β βββ RAGIndex/
β βββ chat/ # LlamaIndex conversation engine
β βββ pipeline/ # Document processing pipeline
β βββ pdf_ingest/ # PDF processing utilities
β βββ stcomp/ # Streamlit components
βββ config.toml # Application configuration
βββ requirements.txt # Python dependencies
βββ docker-compose.yml # Docker deployment
βββ Dockerfile # Container definition
We welcome contributions!
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
- PDF Text Extraction: ~0.5-1 seconds per page for standard PDFs
- OCR Processing: ~2-3 seconds per page for image-based PDFs
- Document Chunking: ~100-200 nodes per MB of text content
- Vector Embedding: ~1000 chunks per minute with batch processing
- Query Response Time: <500ms for most queries with Redis vector store
- Memory Usage: ~2GB base + 500MB per 100 processed documents
- Storage: Redis-based persistence with configurable retention
- Concurrent Users: Supports 10+ concurrent users with proper resource allocation
- Document Limits: Tested with 1000+ documents and 100,000+ text chunks
- Environment variable management for API keys
- Containerized deployment for isolation
- No persistent storage of sensitive data
- Input validation and sanitization
- Document store isolation with namespace management
This project is licensed under the MIT License - see the LICENSE file for details.
- Documentation: Full Documentation
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- LlamaIndex Team for the excellent RAG framework
- Streamlit Team for the amazing web app framework
- HuggingFace for pre-trained models and transformers
- Redis Labs for the vector database solution
Keywords: LlamaIndex, Streamlit, RAG, Document Q&A, AI, Machine Learning, Python, Vector Database, Redis, PDF Processing, OCR, Natural Language Processing, Document Intelligence, Retrieval Augmented Generation, Chatbot, Knowledge Base
Made with β€οΈ using LlamaIndex and Streamlit