This repository contains a production-ready scalable API server that provides text embedding, similarity, and semantic search services using the intfloat/multilingual-e5-base model from Sentence Transformers.
The API is implemented with:
- FastAPI for HTTP endpoints
- Ray Serve for scalable model serving and orchestration
- Redis for caching to reduce redundant computations and improve latency
- Prometheus + Grafana for monitoring and observability
- Get single sentence embeddings (
GET /embeddings?sentence=...
) - Get bulk embeddings (
POST /embeddings/bulk
) - Compute cosine similarity between two sentences (
POST /embeddings/similarity
) - Semantic search to find the most similar sentence (
POST /embeddings/search
) - Built-in health check endpoint (
GET /health
) - Prometheus metrics endpoint for monitoring (
GET /metrics
) - Scales with Ray Serve, supports multi-GPU deployments
- Easy to extend with additional models or endpoints
This API server offers a powerful, scalable, and cost-effective alternative to relying on hosted text embedding APIs like OpenAI’s embedding service. Here are the key benefits:
- Hosted APIs have strict rate limits that can throttle throughput.
- This server uses Ray Serve autoscaling to handle high concurrency and large request volumes without external API restrictions.
- Embeddings are generated locally or in your cloud environment, reducing latency.
- Hosted embedding APIs charge per request, which can get expensive with millions of calls.
- Running your own embedding model (e.g.,
intfloat/multilingual-e5-base
) on GPUs reduces costs. - Autoscaling helps optimize resource usage and spending.
- Choose and update your embedding model anytime without waiting for third-party API updates.
- Customize preprocessing or postprocessing as needed.
- Integrate seamlessly with monitoring tools like Prometheus, Grafana, and Ray Dashboard.
- Keep sensitive or proprietary text in your own environment.
- Avoid sending data to third-party services, ensuring better compliance and data protection.
By deploying this API server, you gain full control, scalability, and cost efficiency—making it ideal for production workloads requiring high-throughput semantic embeddings.
- Installation
- Running the Server
- API Usage
- Testing
- Monitoring & Observability
- Docker & Deployment
- Repository Structure
- Troubleshooting
- License
git clone https://github.com/vishukla/e5-embedding-ray-serve.git
cd e5-embedding-ray-serve
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
ray start --head --port=6379 --dashboard-host=0.0.0.0
python serve_app.py
The API will be available at:
http://localhost:8000
The Ray dashboard will be at:
http://localhost:8265
GET /embeddings?sentence=your+text+here
Example:
curl "http://localhost:8000/embeddings?sentence=The+quick+brown+fox"
POST /embeddings/bulk
Content-Type: application/json
{
"sentences": ["sentence 1", "sentence 2", "..."]
}
Example:
curl -X POST "http://localhost:8000/embeddings/bulk" \
-H "Content-Type: application/json" \
-d @tests/payloads/bulk_payload.json
POST /embeddings/similarity
Content-Type: application/json
{
"sentence_1": "text 1",
"sentence_2": "text 2"
}
Example:
curl -X POST "http://localhost:8000/embeddings/similarity" \
-H "Content-Type: application/json" \
-d @tests/payloads/similarity_payload.json
POST /embeddings/search
Content-Type: application/json
{
"query": "search query",
"sentences": ["candidate 1", "candidate 2", "..."]
}
Example:
curl -X POST "http://localhost:8000/embeddings/search" \
-H "Content-Type: application/json" \
-d @tests/payloads/search_payload.json
GET /health
Example:
curl -X GET "http://localhost:8000/health"
Returns:
{"status": "ok"}
GET /metrics
Example:
curl -X GET "http://localhost:8000/metrics"
Expose Prometheus metrics for scraping.
Sample JSON payload files for testing are located in:
tests/payloads/
Run curl
commands with these payloads as shown in the API Usage section.
You can also run automated tests (if implemented):
pytest tests/test_api.py -v
Run locust for performance testing:
locust -f tests/stress/locust_test.py --host=http://localhost:8000
- Prometheus metrics available at
/metrics
- Grafana dashboards can be set up with provided configs in
monitoring/grafana-provisioning/
- Ray Dashboard available on port
8265
for cluster health, resource utilization, and request tracing
-
Ray Serve Dashboard: http://localhost:8265
Track replica scaling, request queues, latency, errors, and resource utilization. -
Prometheus Dashboard: http://localhost:9090
Explore raw metrics and PromQL queries. -
Grafana Dashboard: http://localhost:3000
Visual dashboards for Requests Per Minute (RPM), QPS, Latency (P95, P99), Error Rates, CPU/GPU Utilization
docker build -t ray-embedding-server .
docker run --gpus all -p 8000:8000 -p 8265:8265 ray-embedding-server
Use docker-compose.yml
to launch with Prometheus and Grafana:
docker-compose up -d
e5-embedding-ray-serve/
├── app/
│ └── serve_app.py # API & Ray Serve deployment
├── monitoring/
│ ├── prometheus.yml # Prometheus config
│ └── grafana-provisioning/ # Grafana dashboards & datasources
├── tests/
│ ├── test_api.py # Automated tests (optional)
│ └── payloads/ # Sample JSON payloads for curl testing
│ ├── bulk_payload.json
│ ├── similarity_payload.json
│ └── search_payload.json
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
└── README.md
You would need NVIDIA Container Toolkit installed to enable Docker to use the GPU.
You may also need to update the default runtime in /etc/docker/daemon.json
with
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"args": [],
"path": "/usr/bin/nvidia-container-runtime"
}
}
}
and then restart Docker service.
sudo systemctl restart docker
You may configure the GPU allocation by adding or removing the following code segments:
In docker-compose.yml
:
services:
ray-embed-server:
...
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
In app/serve_app.py
, by updating ray_actor_options={"num_gpus": 1}
.
This project is licensed under the MIT License.