A high-performance medical literature search system that uses embedding-based search with reranking capabilities. The system processes medical research papers from PubMed, stores their embeddings, and provides a search API that finds relevant papers based on semantic similarity.
- FAISS Multi-GPU Search: Scalable similarity search across multiple GPUs using FAISS
- Fast Semantic Search: Utilizes Qwen3-Embedding-8B for generating high-quality document embeddings
- Advanced Reranking: Yes/no classification-based reranking with Qwen3-Reranker-8B for improved relevance
- GPU Load Balancing: Distributes FAISS index shards across all available GPUs to eliminate bottlenecks
- Built-in FAISS Quantization: IVFPQ index type provides automatic vector quantization (PQ32x8)
- Multi-Server Architecture: Separate embedding and reranking servers for better resource management
- Index Persistence: FAISS indexes are saved/loaded for fast startup times
- Incremental Index Building: Memory-efficient batch processing during index construction
- Deduplication: Returns one result per paper to avoid duplicates
- RESTful API: Simple and intuitive FastAPI interface with automatic documentation
- FlashInfer Optimization: Enhanced performance with FlashInfer-python for accelerated inference
- Customizable Results: Control number of results with top_k parameter
- Smart Previews: Generate relevant text previews showing the most pertinent passage chunks
- Fully Async Architecture: Non-blocking HTTP requests using httpx for optimal performance under load
The system consists of three main components:
- Main API Server (
api.py
) - FastAPI application serving on port 10000 - Embedding Server (
embedding_server.py
) - VLLM server for Qwen3-Embedding-8B on port 10001 - Reranker Server (
reranker_server.py
) - VLLM server for Qwen3-Reranker-8B on port 10002
- User sends search query to API
- API gets query embedding from embedding server (async)
- API performs FAISS multi-GPU similarity search on distributed index
- Optionally reranks results using reranker server (async)
- Returns deduplicated results (one result per paper)
- Non-blocking HTTP: All HTTP requests use httpx.AsyncClient for concurrent processing
- Async Embedding: Query and batch embedding generation don't block the event loop
- Async Reranking: Reranker requests processed asynchronously for better throughput
- Async Health Checks: Server health monitoring without blocking other operations
- Concurrent Preview Generation: Smart preview generation runs concurrently with other operations
- Index Distribution: FAISS index automatically sharded across all available GPUs
- IVFPQ Index: Inverted File with Product Quantization for memory-efficient search
- 32 subvectors with 8 bits each (PQ32x8) optimized for GPU shared memory
- Provides built-in compression without separate quantization step
- Cosine Similarity: Normalized vectors with Inner Product for optimal similarity search
- Index Persistence: Pre-built indexes saved to disk for fast startup
- Incremental Building: Batch-wise index construction to handle large datasets
- Auto-scaling: Automatically utilizes all available GPUs (configurable)
- FAISS GPU Management: Automatic GPU memory allocation and load balancing
- Built-in Quantization: IVFPQ index provides automatic vector compression
- Incremental Loading: Embeddings loaded and processed in batches to avoid OOM
- Index Caching: Pre-built FAISS indexes avoid rebuild time
- Memory Efficiency: Original embeddings cleared immediately after index construction
# Clone the repository
git clone <repository-url>
cd medical_search_simulation
# Install dependencies
pip install -r requirements.txt
# Run the API server
python api.py
# In another terminal, run the stress test
python stress_test.py --ccu 64 --duration 60
- Install dependencies:
pip install -r requirements.txt
For optimal performance with multi-GPU support, it's recommended to build FAISS from source:
# Clone FAISS repository
git clone https://github.com/facebookresearch/faiss
cd faiss
# Install SWIG (required for Python bindings)
conda install -c conda-forge swig
# Configure and build FAISS with GPU support
cmake -B build -DFAISS_ENABLE_GPU=ON -DFAISS_ENABLE_PYTHON=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES="89" .
make -C build -j1 faiss
make -C build -j1 swigfaiss
# Install Python bindings
cd build/faiss/python && python setup.py install
Note:
- Adjust
CMAKE_CUDA_ARCHITECTURES
based on your GPU architecture (e.g., "70" for V100, "80" for A100, "89" for RTX 4090) - The
-j1
flag builds with single thread to avoid memory issues; increase if you have sufficient RAM - Ensure CUDA toolkit is installed and matches your GPU driver version
-
Prepare embeddings:
- Configure embedding path in
EMBEDDING_FOLDER
setting - Configure number of embedding files in
MAX_EMBEDDING_FILES
setting
- Configure embedding path in
-
Configure environment (optional):
export CUDA_VISIBLE_DEVICES=0,1,2,3,4 # Set to your available GPUs
export HF_HOME=/path/to/huggingface/cache
- Run the API server:
python api.py
The main API server will start on http://0.0.0.0:10000
by default.
The system will automatically start the embedding server (port 10001) and reranker server (port 10002).
Once the server is running, you can test it with these curl commands:
- Check if the API is healthy:
curl http://localhost:10000/health
- Search for medical documents:
curl -X POST "http://localhost:10000/search" \
-H "Content-Type: application/json" \
-d '{"query": "diabetes treatment", "use_reranker": true, "top_k": 10, "preview_char": 300}'
- Visit a specific document (replace URL with one from search results):
curl -X POST "http://localhost:10000/visit" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/paper_123"}'
To run the server with detailed debug logging:
python api.py --debug
This will enable comprehensive logging including:
- FAISS index setup and multi-GPU distribution times
- FAISS incremental index building progress
- Model loading times
- Query embedding generation time
- FAISS similarity search performance
- Reranking preparation and scoring times
- Total API response time breakdowns
Additional command line options:
python api.py --debug --host 0.0.0.0 --port 10000
POST /search
Request body:
{
"query": "diabetes treatment guidelines",
"use_reranker": true,
"top_k": 10,
"preview_char": 512
}
Parameters:
query
(required): Search query textuse_reranker
(optional): Whether to apply reranking (default: false)top_k
(optional): Number of search results to return (default: MAX_SEARCH_RESULTS from config, max: 20)preview_char
(optional): Number of preview characters to return for each result (default: -1 to skip, min: MINIMUM_PREVIEW_CHAR)
Example curl command:
curl -X POST "http://localhost:10000/search" \
-H "Content-Type: application/json" \
-d '{
"query": "diabetes treatment guidelines",
"use_reranker": true,
"top_k": 10,
"preview_char": 512
}'
Example without reranking:
curl -X POST "http://localhost:10000/search" \
-H "Content-Type: application/json" \
-d '{
"query": "cardiovascular risk factors",
"use_reranker": false,
"top_k": 5
}'
Example with preview generation:
curl -X POST "http://localhost:10000/search" \
-H "Content-Type: application/json" \
-d '{
"query": "machine learning in healthcare",
"preview_char": 300
}'
Response:
{
"results": [
{
"url": "https://example.com/paper_123",
"metadata": {
"paper_id": "paper_123",
"paper_title": "Recent Advances in Diabetes Management",
"year": "2024",
"venue": "Nature Medicine",
"specialty": ["endocrinology", "internal medicine"]
},
"score": 0.95,
"preview": "Type 2 diabetes mellitus (T2DM) is a chronic metabolic disorder characterized by insulin resistance and relative insulin deficiency. Recent advances in diabetes management have focused on personalized treatment approaches, continuous glucose monitoring systems, and novel pharmacological interventions including GLP-1 receptor agonists..."
}
]
}
POST /visit
Request body:
{
"url": "https://example.com/paper_123"
}
Example curl command:
curl -X POST "http://localhost:10000/visit" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/paper_123"
}'
Response:
{
"url": "https://example.com/paper_123",
"data": "# Recent Advances in Diabetes Management\n\nDiabetes mellitus is a chronic metabolic disorder...",
"status_code": 200
}
Status codes:
- 200: Success
- 404: Document not found
- 400: Invalid request
- 500: Server error
GET /health
Example curl command:
curl -X GET "http://localhost:10000/health"
Response:
{
"status": "healthy",
"models_loaded": {
"embedding_model": true,
"reranker_model": true,
"metadata_dataset": true,
"embeddings_matrix": true,
"embeddings_shape": [2300000, 4096]
}
}
Interactive API documentation is available at:
- Swagger UI:
http://localhost:10000/docs
- ReDoc:
http://localhost:10000/redoc
Edit config.py
to customize:
EMBEDDING_MODEL_NAME = "Qwen/Qwen3-Embedding-8B"
RERANKER_MODEL_NAME = "Qwen/Qwen3-Reranker-8B"
METADATA_DATASET_NAME = "hoanganhpham/Miriad_Pubmed_metadata"
EMBEDDING_SERVER_PORT = 10001
RERANKER_SERVER_PORT = 10002
API_SERVER_PORT = 10000
EMBEDDING_GPU_DEVICES = "0" # Single GPU for embedding
RERANK_GPU_DEVICES = "1,2,3,4,5,6,7" # Multi-GPU for reranking
MAX_SEARCH_RESULTS = 20 # Maximum results to return
TOP_K_RERANK = 10 # Results to rerank
EMBEDDING_DIMENSION = 4096 # Embedding vector dimension
FAISS_SEARCH_K = 1000 # Initial k for FAISS search before reranking
MINIMUM_PREVIEW_CHAR = 100 # Minimum preview character length
FAISS_INDEX_TYPE = "IVFPQ" # Options: "Flat", "IVFFlat", "IVFPQ"
FAISS_NLIST = 1024 # Number of clusters for IVF indexes
FAISS_USE_COSINE = True # Use cosine similarity
FAISS_GPU_DEVICES = [0, 1, 2, 3, 4, 5, 6, 7] # GPU devices for FAISS
FAISS_INDEX_PATH = "/mnt/sharefs/tuenv/medical_search_cache/faiss_index.bin"
EMBEDDING_FOLDER = "/path/to/embeddings/"
MAX_EMBEDDING_FILES = 785 # Adjust based on your dataset
DEBUG_MODE = False # Set to True for detailed logging
LOG_LEVEL = "INFO" # Default log level
LOG_FORMAT = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
- RAM: Varies based on dataset size and FAISS index type
- GPU: 16GB+ VRAM recommended for both embedding and reranking models
- Storage: Disk space depends on embedding dataset size
- Use IVFPQ index: Provides automatic vector compression with good search quality
- Pre-build index: Save built index to disk to avoid rebuild on startup
- GPU allocation: Distribute FAISS index and model servers across multiple GPUs
- Disable reranking: Use
use_reranker=false
for faster but less accurate results - Monitor startup time: Initial index building takes 5-10 minutes but can be cached
- Debug mode: Enable to identify performance bottlenecks
- Async architecture: The fully async implementation provides better performance under concurrent load
See the Testing section above for comprehensive testing options including unit tests and stress testing.
The project includes a comprehensive stress testing tool that simulates multiple concurrent users:
# Run stress test with 64 concurrent users for 60 seconds
python stress_test.py --ccu 64 --duration 60
# Run with custom API URL
python stress_test.py --url http://192.168.0.11:10000 --ccu 64
# Skip pre-flight checks
python stress_test.py --skip-preflight --ccu 100 --duration 120
The stress test will:
- Run pre-flight checks to ensure API health
- Test all endpoints (health, search, visit)
- Simulate realistic user behavior (70% search, 30% visit)
- Report comprehensive metrics including:
- Response times (min, max, mean, median, p95, p99)
- Success/failure rates
- Requests per second
- Error breakdown
-
Out of GPU Memory
- Reduce number of GPUs in
FAISS_GPU_DEVICES
- Use IVFPQ index type for built-in compression
- Ensure no other processes are using GPU
- Reduce number of GPUs in
-
Slow Startup
- Initial FAISS index building takes 5-10 minutes
- Pre-built indexes are loaded from disk on subsequent runs
- Consider using fewer embedding files for testing
-
Model Server Startup Failures
- Check if ports 10001 and 10002 are available
- Verify GPU assignments in
EMBEDDING_GPU_DEVICES
andRERANK_GPU_DEVICES
- Ensure VLLM is properly installed
-
Model Loading Errors
- Ensure Hugging Face cache is accessible
- Check internet connection for model downloads
- Verify CUDA is properly installed
medical_search_simulation/
├── api.py # Main FastAPI application
├── faiss_index_manager.py # FAISS multi-GPU index management
├── cache_utils.py # Caching utilities for startup
├── config.py # Configuration settings
├── preprocess.py # Data preprocessing utilities
├── vllm_generate_emb.py # VLLM embedding generation
├── stress_test.py # Comprehensive stress testing tool
├── requirements.txt # Python dependencies
- Modify data models in the Pydantic classes in
api.py
- Update endpoints and business logic
- Add corresponding tests
- Update configuration in
config.py
if needed - Consider impact on FAISS index and memory usage
This project is for research and educational purposes.