Skip to content

Commit 4e1f5cd

Browse files
feat: add support for classification models (#76)
1 parent d9153e1 commit 4e1f5cd

File tree

26 files changed

+1218
-765
lines changed

26 files changed

+1218
-765
lines changed

Cargo.lock

Lines changed: 0 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Dockerfile

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRO
3434
tee /etc/apt/sources.list.d/oneAPI.list
3535

3636
RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
37-
intel-oneapi-mkl-devel \
37+
intel-oneapi-mkl-devel=2024.0.0-49656 \
3838
build-essential \
3939
&& rm -rf /var/lib/apt/lists/*
4040

@@ -74,10 +74,8 @@ COPY --from=builder /opt/intel/oneapi/mkl/latest/lib/intel64/libmkl_intel_thread
7474
COPY --from=builder /opt/intel/oneapi/mkl/latest/lib/intel64/libmkl_core.so.2 /usr/local/lib/libmkl_core.so.2
7575
COPY --from=builder /opt/intel/oneapi/mkl/latest/lib/intel64/libmkl_vml_def.so.2 /usr/local/lib/libmkl_vml_def.so.2
7676
COPY --from=builder /opt/intel/oneapi/mkl/latest/lib/intel64/libmkl_def.so.2 /usr/local/lib/libmkl_def.so.2
77-
COPY --from=builder /opt/intel/oneapi/mkl/latest/lib/intel64/libmkl_vml_avx.so.2 /usr/local/lib/libmkl_vml_avx.so.2
7877
COPY --from=builder /opt/intel/oneapi/mkl/latest/lib/intel64/libmkl_vml_avx2.so.2 /usr/local/lib/libmkl_vml_avx2.so.2
7978
COPY --from=builder /opt/intel/oneapi/mkl/latest/lib/intel64/libmkl_vml_avx512.so.2 /usr/local/lib/libmkl_vml_avx512.so.2
80-
COPY --from=builder /opt/intel/oneapi/mkl/latest/lib/intel64/libmkl_avx.so.2 /usr/local/lib/libmkl_avx.so.2
8179
COPY --from=builder /opt/intel/oneapi/mkl/latest/lib/intel64/libmkl_avx2.so.2 /usr/local/lib/libmkl_avx2.so.2
8280
COPY --from=builder /opt/intel/oneapi/mkl/latest/lib/intel64/libmkl_avx512.so.2 /usr/local/lib/libmkl_avx512.so.2
8381
COPY --from=builder /usr/src/libfakeintel.so /usr/local/libfakeintel.so

README.md

Lines changed: 87 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,10 @@
99
<img alt="Swagger API documentation" src="https://img.shields.io/badge/API-Swagger-informational">
1010
</a>
1111

12-
A blazing fast inference solution for text embeddings models.
12+
A blazing fast inference solution for text embeddings models.
1313

14-
Benchmark for [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) on an Nvidia A10 with a sequence length of 512 tokens:
14+
Benchmark for [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) on an Nvidia A10 with a sequence
15+
length of 512 tokens:
1516

1617
<p>
1718
<img src="assets/bs1-lat.png" width="400" />
@@ -27,33 +28,37 @@ Benchmark for [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1
2728
## Table of contents
2829

2930
- [Get Started](#get-started)
30-
- [Supported Models](#supported-models)
31-
- [Docker](#docker)
32-
- [Docker Images](#docker-images)
33-
- [API Documentation](#api-documentation)
34-
- [Using a private or gated model](#using-a-private-or-gated-model)
35-
- [Distributed Tracing](#distributed-tracing)
31+
- [Supported Models](#supported-models)
32+
- [Docker](#docker)
33+
- [Docker Images](#docker-images)
34+
- [API Documentation](#api-documentation)
35+
- [Using a private or gated model](#using-a-private-or-gated-model)
36+
- [Using Sequence Classification models](#using-sequence-classification-models)
37+
- [Distributed Tracing](#distributed-tracing)
3638
- [Local Install](#local-install)
3739
- [Docker Build](#docker-build)
3840

39-
Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings models. TEI enables
40-
high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE and E5. TEI implements many features
41-
such as:
41+
Text Embeddings Inference (TEI) is a toolkit for deploying and serving open source text embeddings and sequence
42+
classification models. TEI enables high-performance extraction for the most popular models, including FlagEmbedding,
43+
Ember, GTE and E5. TEI implements many features such as:
4244

4345
* No model graph compilation step
4446
* Small docker images and fast boot times. Get ready for true serverless!
4547
* Token based dynamic batching
4648
* Optimized transformers code for inference using [Flash Attention](https://github.com/HazyResearch/flash-attention),
47-
[Candle](https://github.com/huggingface/candle) and [cuBLASLt](https://docs.nvidia.com/cuda/cublas/#using-the-cublaslt-api)
49+
[Candle](https://github.com/huggingface/candle)
50+
and [cuBLASLt](https://docs.nvidia.com/cuda/cublas/#using-the-cublaslt-api)
4851
* [Safetensors](https://github.com/huggingface/safetensors) weight loading
4952
* Production ready (distributed tracing with Open Telemetry, Prometheus metrics)
5053

51-
5254
## Get Started
5355

5456
### Supported Models
5557

56-
You can use any JinaBERT model with Alibi or absolute positions or any BERT, CamemBERT, RoBERTa, or XLM-RoBERTa model with absolute positions in `text-embeddings-inference`.
58+
#### Text Embeddings
59+
60+
You can use any JinaBERT model with Alibi or absolute positions or any BERT, CamemBERT, RoBERTa, or XLM-RoBERTa model
61+
with absolute positions in `text-embeddings-inference`.
5762

5863
**Support for other model types will be added in the future.**
5964

@@ -73,8 +78,20 @@ Examples of supported models:
7378
| N/A | JinaBERT | [jinaai/jina-embeddings-v2-base-en](https://hf.co/jinaai/jina-embeddings-v2-base-en) |
7479
| N/A | JinaBERT | [jinaai/jina-embeddings-v2-small-en](https://hf.co/jinaai/jina-embeddings-v2-small-en) |
7580

81+
You can explore the list of best performing text embeddings
82+
models [here](https://huggingface.co/spaces/mteb/leaderboard).
83+
84+
#### Sequence Classification and Re-Ranking
85+
86+
`text-embeddings-inference` v0.4.0 added support for CamemBERT, RoBERTa and XLM-RoBERTa Sequence Classification models.
87+
88+
Example of supported sequence classification models:
7689

77-
You can explore the list of best performing text embeddings models [here](https://huggingface.co/spaces/mteb/leaderboard).
90+
| Task | Model Type | Model ID | Revision |
91+
|--------------------|-------------|---------------------------------------------------------------------------------------------|-------------|
92+
| Re-Ranking | XLM-RoBERTa | [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | `refs/pr/4` |
93+
| Re-Ranking | XLM-RoBERTa | [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | `refs/pr/5` |
94+
| Sentiment Analysis | RoBERTa | [SamLowe/roberta-base-go_emotions](https://huggingface.co/SamLowe/roberta-base-go_emotions) | |
7895

7996
### Docker
8097

@@ -95,7 +112,8 @@ curl 127.0.0.1:8080/embed \
95112
-H 'Content-Type: application/json'
96113
```
97114

98-
**Note:** To use GPUs, you need to install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).
115+
**Note:** To use GPUs, you need to install
116+
the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html).
99117
We also recommend using NVIDIA drivers with CUDA version 12.0 or higher.
100118

101119
To see all options to serve your models:
@@ -130,20 +148,18 @@ Options:
130148
131149
--dtype <DTYPE>
132150
The dtype to be forced upon the model
133-
134-
If `dtype` is not set, it defaults to float32 on accelerate, and float16 for all other architectures
135151
136152
[env: DTYPE=]
137153
[possible values: float16, float32]
138154
139155
--pooling <POOLING>
140-
Optionally control the pooling method.
141-
142-
If `pooling` is not set, the pooling configuration will be parsed from the model `1_Pooling/config.json`
143-
configuration.
144-
156+
Optionally control the pooling method for embedding models.
157+
158+
If `pooling` is not set, the pooling configuration will be parsed from the model `1_Pooling/config.json`
159+
configuration.
160+
145161
If `pooling` is set, it will override the model pooling configuration
146-
162+
147163
[env: POOLING=]
148164
[possible values: cls, mean]
149165
@@ -241,7 +257,8 @@ You can turn Flash Attention v1 ON by using the `USE_FLASH_ATTENTION=True` envir
241257
### API documentation
242258

243259
You can consult the OpenAPI documentation of the `text-embeddings-inference` REST API using the `/docs` route.
244-
The Swagger UI is also available at: [https://huggingface.github.io/text-embeddings-inference](https://huggingface.github.io/text-embeddings-inference).
260+
The Swagger UI is also available
261+
at: [https://huggingface.github.io/text-embeddings-inference](https://huggingface.github.io/text-embeddings-inference).
245262

246263
### Using a private or gated model
247264

@@ -264,6 +281,48 @@ token=<your cli READ token>
264281
docker run --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.3.0 --model-id $model
265282
```
266283

284+
### Using Sequence Classification models
285+
286+
`text-embeddings-inference` v0.4.0 added support for CamemBERT, RoBERTa and XLM-RoBERTa Sequence Classification models.
287+
See [this blogpost](https://blog.llamaindex.ai/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83) by
288+
the LlamaIndex team to understand how you can use Sequence Classification models in your RAG pipeline to improve
289+
downstream performance.
290+
291+
```shell
292+
model=BAAI/bge-reranker-large
293+
revision=refs/pr/4
294+
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
295+
296+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.3.0 --model-id $model --revision $revision
297+
```
298+
299+
And then you can rank the similarity between a pair of inputs with:
300+
301+
```bash
302+
curl 127.0.0.1:8080/predict \
303+
-X POST \
304+
-d '{"inputs":["What is Deep Learning?", "Deep learning is..."], "raw_scores": true}' \
305+
-H 'Content-Type: application/json'
306+
```
307+
308+
You can also use classic Sequence Classification models like `SamLowe/roberta-base-go_emotions`:
309+
310+
```shell
311+
model=SamLowe/roberta-base-go_emotions
312+
volume=$PWD/data
313+
314+
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.3.0 --model-id $model
315+
```
316+
317+
Once you have deployed the model you can use the `predict` endpoint to get the emotions most associated with an input:
318+
319+
```bash
320+
curl 127.0.0.1:8080/predict \
321+
-X POST \
322+
-d '{"inputs":"I like you."}' \
323+
-H 'Content-Type: application/json'
324+
```
325+
267326
### Distributed Tracing
268327

269328
`text-embeddings-inference` is instrumented with distributed tracing using OpenTelemetry. You can use this feature
@@ -290,7 +349,7 @@ cargo install --path router -F candle -F mkl
290349
cargo install --path router -F candle -F accelerate
291350
```
292351

293-
You can now launch Text Embeddings Inference on CPU with:
352+
You can now launch Text Embeddings Inference on CPU with:
294353

295354
```shell
296355
model=BAAI/bge-large-en-v1.5
@@ -309,7 +368,8 @@ sudo apt-get install libssl-dev gcc -y
309368

310369
GPUs with Cuda compute capabilities < 7.5 are not supported (V100, Titan V, GTX 1000 series, ...).
311370

312-
Make sure you have Cuda and the nvidia drivers installed. We recommend using NVIDIA drivers with CUDA version 12.0 or higher.
371+
Make sure you have Cuda and the nvidia drivers installed. We recommend using NVIDIA drivers with CUDA version 12.0 or
372+
higher.
313373
You also need to add the nvidia binaries to your path:
314374

315375
```shell

backends/candle/src/alibi.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
use candle::{DType, Device, Result, Tensor};
1717

1818
fn get_slopes_power_of_2(n: usize) -> Vec<f64> {
19-
let start: f64 = 2_f64.powf(-2_f64.powf(-((n as f64).log2() - 3_f64)));
19+
let start: f64 = 2_f64.powf(-(2_f64.powf(-((n as f64).log2() - 3_f64))));
2020

2121
(0..n).map(|i| start * start.powi(i as i32)).collect()
2222
}

0 commit comments

Comments
 (0)