A production-ready semantic search system built with NVIDIA cuVS, featuring advanced vector quantization and optimized memory access patterns for sub-100ms query latency on 35M high-dimensional embeddings.
This project implements a complete end-to-end neural search pipeline capable of handling large-scale semantic search across 35 million 768-dimensional Wikipedia embeddings. The system achieves 13x storage compression and <100ms average query latency through advanced indexing techniques and memory optimization.
- Problem: 35M vectors × 768 dims × 2 bytes = 53.76 GB storage requirement
- Solution: Implemented IVF-PQ (Inverted File with Product Quantization) using NVIDIA cuVS
- Result: Compressed to 3.36 GB (96 bytes/vector) - 93.7% storage reduction
Technical Details:
- 32,768 IVF clusters for coarse quantization
- 96 PQ sub-quantizers (8 dimensions each, 8 bits per code)
- Trained on 2M representative vectors, populated with remaining 33M in 500K batches
- Problem: Direct PQ search yielded only 60% recall@10
- Solution: Implemented two-stage retrieval with oversampling + reranking
- Result: Achieved 89% recall@10 while maintaining performance
Implementation:
Stage 1: Retrieve 1000 candidates using PQ approximation
Stage 2: Exact cosine similarity reranking on original embeddings
- Problem: Random memory access for 1000 vectors took 800ms (unacceptable)
- Solution: List-wise memory layout optimization leveraging IVF structure
- Result: 1-10ms average retrieval, 100ms worst-case (80x improvement)
Key Insight: With n_probes=40, results come from max 40 lists. Arranging embeddings by list enables sequential memory access patterns.
- FastAPI-based REST endpoint with comprehensive error handling
- Async request processing with connection pooling
- Integrated monitoring and logging
- Memory-mapped file handling for efficient resource utilization
Metric | Value | Baseline Comparison |
---|---|---|
Storage Compression | 13.0x | 53.76 GB → 3.36 GB |
Query Latency (avg) | <100ms | - |
Recall@10 | 89% | 60% (direct PQ) |
Memory Access Speed | 1-10ms | 800ms (random access) |
Query → Cohere API → GPU Index Search → Memory-Optimized Retrieval → Metadata Lookup → Results
70ms 20ms 1-10ms 1-5ms <100ms total
-
Index Training Pipeline (
src/train_index.py
)- Efficient data loading with multiprocessing
- GPU memory monitoring and optimization
- Configurable training parameters
-
Optimized Search Engine (
src/search_index.py
)- Two-stage retrieval implementation
- List-wise memory access patterns
- Batch processing for metadata retrieval
-
Production Server (
server.py
)- FastAPI with async processing
- LMDB for metadata storage
- Comprehensive error handling and logging
-
Memory Management (
src/utils/
)- Custom offset-based caching
- GPU memory monitoring
- Validation utilities
- Coarse Quantization: 32,768 clusters using k-means
- Fine Quantization: 96 sub-quantizers with 256 centroids each
- Distance Metric: Inner product (optimized for normalized vectors)
# Original: Random access across 35M vectors
embeddings[random_ids] # 800ms for 1000 vectors
# Optimized: List-wise sequential access
for list_id in active_lists:
embeddings[list_start:list_end][local_offsets] # 1-10ms total
- LMDB for fast key-value retrieval
- Sorted access patterns for cache efficiency
- Binary encoding for space optimization
- Vector Processing: NVIDIA cuVS, CuPy, NumPy
- API Framework: FastAPI, Uvicorn
- Data Storage: LMDB, Memory-mapped files
- ML Pipeline: Hugging Face Datasets, Cohere API
- Monitoring: Custom GPU memory tracking, structured logging
- Horizontal Scaling: Stateless API design enables load balancing
- Memory Efficiency: Memory-mapped files reduce RAM requirements
- GPU Utilization: Batch processing maximizes throughput
- Cache Optimization: List-wise layout improves OS page cache hits
- NVIDIA GPU with CUDA 12.1+
- Python 3.8+
- 8GB+ GPU memory
# Clone repository
git clone <repository-url>
cd neural-search-engine
# Install dependencies
pip install -r requirements.txt
# Train index (requires ~4GB GPU memory)
## 🔍 Future Enhancements
1. **Multi-GPU Support**: Distribute index across multiple GPUs
2. **Dynamic Updates**: Implement incremental index updates
3. **Query Optimization**: Adaptive n_probes based on query characteristics
4. **Caching Layer**: Redis for frequently accessed results
## 📝 Key Learnings & Problem-Solving
1. **Memory Access Patterns Matter**: Achieved 80x speedup through data layout optimization
2. **Quantization Trade-offs**: Balanced compression vs. accuracy through two-stage retrieval
3. **System Integration**: Built complete pipeline from training to production deployment
4. **Performance Monitoring**: Implemented comprehensive logging for optimization insights