Skip to content

Production-grade scalable embedding API server using SentenceTransformers "intfloat/multilingual-e5-base" model, powered by Ray Serve for multi-GPU orchestration, with Prometheus & Grafana monitoring.

License

Notifications You must be signed in to change notification settings

vishukla/e5-embedding-ray-serve

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

E5 Embeddings API Server with Ray Serve

Python FastAPI Ray Serve SentenceTransformers Prometheus Grafana Redis GPU Ready MIT License

Overview

This repository contains a production-ready scalable API server that provides text embedding, similarity, and semantic search services using the intfloat/multilingual-e5-base model from Sentence Transformers.

The API is implemented with:

  • FastAPI for HTTP endpoints
  • Ray Serve for scalable model serving and orchestration
  • Redis for caching to reduce redundant computations and improve latency
  • Prometheus + Grafana for monitoring and observability

Features

  • Get single sentence embeddings (GET /embeddings?sentence=...)
  • Get bulk embeddings (POST /embeddings/bulk)
  • Compute cosine similarity between two sentences (POST /embeddings/similarity)
  • Semantic search to find the most similar sentence (POST /embeddings/search)
  • Built-in health check endpoint (GET /health)
  • Prometheus metrics endpoint for monitoring (GET /metrics)
  • Scales with Ray Serve, supports multi-GPU deployments
  • Easy to extend with additional models or endpoints

Why Use This API Server?

This API server offers a powerful, scalable, and cost-effective alternative to relying on hosted text embedding APIs like OpenAI’s embedding service. Here are the key benefits:

1. Overcome Rate Limits and Latency Bottlenecks

  • Hosted APIs have strict rate limits that can throttle throughput.
  • This server uses Ray Serve autoscaling to handle high concurrency and large request volumes without external API restrictions.
  • Embeddings are generated locally or in your cloud environment, reducing latency.

2. Cost-Effective at Scale

  • Hosted embedding APIs charge per request, which can get expensive with millions of calls.
  • Running your own embedding model (e.g., intfloat/multilingual-e5-base) on GPUs reduces costs.
  • Autoscaling helps optimize resource usage and spending.

3. Greater Control and Customization

  • Choose and update your embedding model anytime without waiting for third-party API updates.
  • Customize preprocessing or postprocessing as needed.
  • Integrate seamlessly with monitoring tools like Prometheus, Grafana, and Ray Dashboard.

4. Privacy and Data Security

  • Keep sensitive or proprietary text in your own environment.
  • Avoid sending data to third-party services, ensuring better compliance and data protection.

By deploying this API server, you gain full control, scalability, and cost efficiency—making it ideal for production workloads requiring high-throughput semantic embeddings.

Table of Contents

Installation

Requirements

Install dependencies

git clone https://github.com/vishukla/e5-embedding-ray-serve.git
cd e5-embedding-ray-serve

python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip
pip install -r requirements.txt

Running the Server

Locally (without Docker)

ray start --head --port=6379 --dashboard-host=0.0.0.0
python serve_app.py

The API will be available at: http://localhost:8000

The Ray dashboard will be at: http://localhost:8265

API Usage

1. Get single embedding (GET)

GET /embeddings?sentence=your+text+here

Example:

curl "http://localhost:8000/embeddings?sentence=The+quick+brown+fox"

2. Get bulk embeddings (POST)

POST /embeddings/bulk
Content-Type: application/json

{
  "sentences": ["sentence 1", "sentence 2", "..."]
}

Example:

curl -X POST "http://localhost:8000/embeddings/bulk" \
     -H "Content-Type: application/json" \
     -d @tests/payloads/bulk_payload.json

3. Compute similarity (POST)

POST /embeddings/similarity
Content-Type: application/json

{
  "sentence_1": "text 1",
  "sentence_2": "text 2"
}

Example:

curl -X POST "http://localhost:8000/embeddings/similarity" \
     -H "Content-Type: application/json" \
     -d @tests/payloads/similarity_payload.json

4. Semantic search (POST)

POST /embeddings/search
Content-Type: application/json

{
  "query": "search query",
  "sentences": ["candidate 1", "candidate 2", "..."]
}

Example:

curl -X POST "http://localhost:8000/embeddings/search" \
     -H "Content-Type: application/json" \
     -d @tests/payloads/search_payload.json

5. Health Check

GET /health

Example:

curl -X GET "http://localhost:8000/health"

Returns:

{"status": "ok"}

6. Metrics for Monitoring

GET /metrics

Example:

curl -X GET "http://localhost:8000/metrics"

Expose Prometheus metrics for scraping.

Testing

Sample JSON payload files for testing are located in:

tests/payloads/

Run curl commands with these payloads as shown in the API Usage section.

You can also run automated tests (if implemented):

pytest tests/test_api.py -v

Run locust for performance testing:

locust -f tests/stress/locust_test.py --host=http://localhost:8000

Monitoring & Observability

  • Prometheus metrics available at /metrics
  • Grafana dashboards can be set up with provided configs in monitoring/grafana-provisioning/
  • Ray Dashboard available on port 8265 for cluster health, resource utilization, and request tracing

Dashboards

  • Ray Serve Dashboard: http://localhost:8265
    Track replica scaling, request queues, latency, errors, and resource utilization.

  • Prometheus Dashboard: http://localhost:9090
    Explore raw metrics and PromQL queries.

  • Grafana Dashboard: http://localhost:3000
    Visual dashboards for Requests Per Minute (RPM), QPS, Latency (P95, P99), Error Rates, CPU/GPU Utilization

Docker & Deployment

Build and run Docker container

docker build -t ray-embedding-server .
docker run --gpus all -p 8000:8000 -p 8265:8265 ray-embedding-server

Docker Compose

Use docker-compose.yml to launch with Prometheus and Grafana:

docker-compose up -d

Repository Structure

e5-embedding-ray-serve/
├── app/
│   └── serve_app.py               # API & Ray Serve deployment
├── monitoring/
│   ├── prometheus.yml             # Prometheus config
│   └── grafana-provisioning/      # Grafana dashboards & datasources
├── tests/
│   ├── test_api.py                # Automated tests (optional)
│   └── payloads/                  # Sample JSON payloads for curl testing
│       ├── bulk_payload.json
│       ├── similarity_payload.json
│       └── search_payload.json
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
└── README.md

Troubleshooting

You would need NVIDIA Container Toolkit installed to enable Docker to use the GPU. You may also need to update the default runtime in /etc/docker/daemon.json with

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "/usr/bin/nvidia-container-runtime"
        }
    }
}

and then restart Docker service.

sudo systemctl restart docker

You may configure the GPU allocation by adding or removing the following code segments:

In docker-compose.yml:

services:
  ray-embed-server:
    ...
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

In app/serve_app.py, by updating ray_actor_options={"num_gpus": 1}.

License

This project is licensed under the MIT License.

About

Production-grade scalable embedding API server using SentenceTransformers "intfloat/multilingual-e5-base" model, powered by Ray Serve for multi-GPU orchestration, with Prometheus & Grafana monitoring.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published