@@ -101,7 +101,7 @@ model=BAAI/bge-large-en-v1.5
101
101
revision=refs/pr/5
102
102
volume=$PWD /data # share a volume with the Docker container to avoid downloading weights every run
103
103
104
- docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.4.0 --model-id $model --revision $revision
104
+ docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.5 --model-id $model --revision $revision
105
105
```
106
106
107
107
And then you can make requests like
@@ -242,15 +242,15 @@ Options:
242
242
243
243
Text Embeddings Inference ships with multiple Docker images that you can use to target a specific backend:
244
244
245
- | Architecture | Image |
246
- | -------------------------------------| --------------------------------------------------------------------------- |
247
- | CPU | ghcr.io/huggingface/text-embeddings-inference: cpu-0 .4.0 |
248
- | Volta | NOT SUPPORTED |
249
- | Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference: turing-0 .4.0 (experimental) |
250
- | Ampere 80 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:0.4.0 |
251
- | Ampere 86 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-0.4.0 |
252
- | Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-0.4.0 |
253
- | Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference: hopper-0 .4.0 (experimental) |
245
+ | Architecture | Image |
246
+ | -------------------------------------| -------------------------------------------------------------------------|
247
+ | CPU | ghcr.io/huggingface/text-embeddings-inference: cpu-0 .5 |
248
+ | Volta | NOT SUPPORTED |
249
+ | Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference: turing-0 .5 (experimental) |
250
+ | Ampere 80 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:0.5 |
251
+ | Ampere 86 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-0.5 |
252
+ | Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-0.5 |
253
+ | Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference: hopper-0 .5 (experimental) |
254
254
255
255
** Warning** : Flash Attention is turned off by default for the Turing image as it suffers from precision issues.
256
256
You can turn Flash Attention v1 ON by using the ` USE_FLASH_ATTENTION=True ` environment variable.
@@ -279,7 +279,7 @@ model=<your private model>
279
279
volume=$PWD /data # share a volume with the Docker container to avoid downloading weights every run
280
280
token=< your cli READ token>
281
281
282
- docker run --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.4.0 --model-id $model
282
+ docker run --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.5 --model-id $model
283
283
```
284
284
285
285
### Using Re-rankers models
@@ -297,7 +297,7 @@ model=BAAI/bge-reranker-large
297
297
revision=refs/pr/4
298
298
volume=$PWD /data # share a volume with the Docker container to avoid downloading weights every run
299
299
300
- docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.4.0 --model-id $model --revision $revision
300
+ docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.5 --model-id $model --revision $revision
301
301
```
302
302
303
303
And then you can rank the similarity between a query and a list of passages with:
@@ -317,7 +317,7 @@ You can also use classic Sequence Classification models like `SamLowe/roberta-ba
317
317
model=SamLowe/roberta-base-go_emotions
318
318
volume=$PWD /data # share a volume with the Docker container to avoid downloading weights every run
319
319
320
- docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.4.0 --model-id $model
320
+ docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.5 --model-id $model
321
321
```
322
322
323
323
Once you have deployed the model you can use the ` predict ` endpoint to get the emotions most associated with an input:
0 commit comments