E5 Embeddings API Server with Ray Serve

Overview

This repository contains a production-ready scalable API server that provides text embedding, similarity, and semantic search services using the intfloat/multilingual-e5-base model from Sentence Transformers.

The API is implemented with:

FastAPI for HTTP endpoints
Ray Serve for scalable model serving and orchestration
Redis for caching to reduce redundant computations and improve latency
Prometheus + Grafana for monitoring and observability

Features

Get single sentence embeddings (GET /embeddings?sentence=...)
Get bulk embeddings (POST /embeddings/bulk)
Compute cosine similarity between two sentences (POST /embeddings/similarity)
Semantic search to find the most similar sentence (POST /embeddings/search)
Built-in health check endpoint (GET /health)
Prometheus metrics endpoint for monitoring (GET /metrics)
Scales with Ray Serve, supports multi-GPU deployments
Easy to extend with additional models or endpoints

Why Use This API Server?

This API server offers a powerful, scalable, and cost-effective alternative to relying on hosted text embedding APIs like OpenAI’s embedding service. Here are the key benefits:

1. Overcome Rate Limits and Latency Bottlenecks

Hosted APIs have strict rate limits that can throttle throughput.
This server uses Ray Serve autoscaling to handle high concurrency and large request volumes without external API restrictions.
Embeddings are generated locally or in your cloud environment, reducing latency.

2. Cost-Effective at Scale

Hosted embedding APIs charge per request, which can get expensive with millions of calls.
Running your own embedding model (e.g., intfloat/multilingual-e5-base) on GPUs reduces costs.
Autoscaling helps optimize resource usage and spending.

3. Greater Control and Customization

Choose and update your embedding model anytime without waiting for third-party API updates.
Customize preprocessing or postprocessing as needed.
Integrate seamlessly with monitoring tools like Prometheus, Grafana, and Ray Dashboard.

4. Privacy and Data Security

Keep sensitive or proprietary text in your own environment.
Avoid sending data to third-party services, ensuring better compliance and data protection.

By deploying this API server, you gain full control, scalability, and cost efficiency—making it ideal for production workloads requiring high-throughput semantic embeddings.

Installation

Requirements

Install dependencies

git clone https://github.com/vishukla/e5-embedding-ray-serve.git
cd e5-embedding-ray-serve

python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip
pip install -r requirements.txt

Running the Server

Locally (without Docker)

ray start --head --port=6379 --dashboard-host=0.0.0.0
python serve_app.py

The API will be available at: http://localhost:8000

The Ray dashboard will be at: http://localhost:8265

API Usage

1. Get single embedding (GET)

GET /embeddings?sentence=your+text+here

Example:

curl "http://localhost:8000/embeddings?sentence=The+quick+brown+fox"

2. Get bulk embeddings (POST)

POST /embeddings/bulk
Content-Type: application/json

{
  "sentences": ["sentence 1", "sentence 2", "..."]
}

Example:

curl -X POST "http://localhost:8000/embeddings/bulk" \
     -H "Content-Type: application/json" \
     -d @tests/payloads/bulk_payload.json

3. Compute similarity (POST)

POST /embeddings/similarity
Content-Type: application/json

{
  "sentence_1": "text 1",
  "sentence_2": "text 2"
}

Example:

curl -X POST "http://localhost:8000/embeddings/similarity" \
     -H "Content-Type: application/json" \
     -d @tests/payloads/similarity_payload.json

4. Semantic search (POST)

POST /embeddings/search
Content-Type: application/json

{
  "query": "search query",
  "sentences": ["candidate 1", "candidate 2", "..."]
}

Example:

curl -X POST "http://localhost:8000/embeddings/search" \
     -H "Content-Type: application/json" \
     -d @tests/payloads/search_payload.json

5. Health Check

GET /health

Example:

curl -X GET "http://localhost:8000/health"

Returns:

{"status": "ok"}

6. Metrics for Monitoring

GET /metrics

Example:

curl -X GET "http://localhost:8000/metrics"

Expose Prometheus metrics for scraping.

Testing

Sample JSON payload files for testing are located in:

tests/payloads/

Run curl commands with these payloads as shown in the API Usage section.

You can also run automated tests (if implemented):

pytest tests/test_api.py -v

Run locust for performance testing:

locust -f tests/stress/locust_test.py --host=http://localhost:8000

Monitoring & Observability

Prometheus metrics available at /metrics
Grafana dashboards can be set up with provided configs in monitoring/grafana-provisioning/
Ray Dashboard available on port 8265 for cluster health, resource utilization, and request tracing

Dashboards

Ray Serve Dashboard: http://localhost:8265
Track replica scaling, request queues, latency, errors, and resource utilization.
Prometheus Dashboard: http://localhost:9090
Explore raw metrics and PromQL queries.
Grafana Dashboard: http://localhost:3000
Visual dashboards for Requests Per Minute (RPM), QPS, Latency (P95, P99), Error Rates, CPU/GPU Utilization

Docker & Deployment

Build and run Docker container

docker build -t ray-embedding-server .
docker run --gpus all -p 8000:8000 -p 8265:8265 ray-embedding-server

Docker Compose

Use docker-compose.yml to launch with Prometheus and Grafana:

docker-compose up -d

Repository Structure

e5-embedding-ray-serve/
├── app/
│   └── serve_app.py               # API & Ray Serve deployment
├── monitoring/
│   ├── prometheus.yml             # Prometheus config
│   └── grafana-provisioning/      # Grafana dashboards & datasources
├── tests/
│   ├── test_api.py                # Automated tests (optional)
│   └── payloads/                  # Sample JSON payloads for curl testing
│       ├── bulk_payload.json
│       ├── similarity_payload.json
│       └── search_payload.json
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
└── README.md

Troubleshooting

You would need NVIDIA Container Toolkit installed to enable Docker to use the GPU. You may also need to update the default runtime in /etc/docker/daemon.json with

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "/usr/bin/nvidia-container-runtime"
        }
    }
}

and then restart Docker service.

sudo systemctl restart docker

You may configure the GPU allocation by adding or removing the following code segments:

In docker-compose.yml:

services:
  ray-embed-server:
    ...
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]

In app/serve_app.py, by updating ray_actor_options={"num_gpus": 1}.

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

E5 Embeddings API Server with Ray Serve

Overview

Features

Why Use This API Server?

1. Overcome Rate Limits and Latency Bottlenecks

2. Cost-Effective at Scale

3. Greater Control and Customization

4. Privacy and Data Security

Table of Contents

Installation

Requirements

Install dependencies

Running the Server

Locally (without Docker)

API Usage

1. Get single embedding (GET)

2. Get bulk embeddings (POST)

3. Compute similarity (POST)

4. Semantic search (POST)

5. Health Check

6. Metrics for Monitoring

Testing

Monitoring & Observability

Dashboards

Docker & Deployment

Build and run Docker container

Docker Compose

Repository Structure

Troubleshooting

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
app		app
monitoring		monitoring
tests		tests
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
requirements.txt		requirements.txt

License

vishukla/e5-embedding-ray-serve

Folders and files

Latest commit

History

Repository files navigation

E5 Embeddings API Server with Ray Serve

Overview

Features

Why Use This API Server?

1. Overcome Rate Limits and Latency Bottlenecks

2. Cost-Effective at Scale

3. Greater Control and Customization

4. Privacy and Data Security

Table of Contents

Installation

Requirements

Install dependencies

Running the Server

Locally (without Docker)

API Usage

1. Get single embedding (GET)

2. Get bulk embeddings (POST)

3. Compute similarity (POST)

4. Semantic search (POST)

5. Health Check

6. Metrics for Monitoring

Testing

Monitoring & Observability

Dashboards

Docker & Deployment

Build and run Docker container

Docker Compose

Repository Structure

Troubleshooting

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages