@@ -102,7 +102,7 @@ model=BAAI/bge-large-en-v1.5
102
102
revision=refs/pr/5
103
103
volume=$PWD /data # share a volume with the Docker container to avoid downloading weights every run
104
104
105
- docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.5 --model-id $model --revision $revision
105
+ docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.6 --model-id $model --revision $revision
106
106
```
107
107
108
108
And then you can make requests like
@@ -245,13 +245,13 @@ Text Embeddings Inference ships with multiple Docker images that you can use to
245
245
246
246
| Architecture | Image |
247
247
| -------------------------------------| -------------------------------------------------------------------------|
248
- | CPU | ghcr.io/huggingface/text-embeddings-inference: cpu-0 .5 |
248
+ | CPU | ghcr.io/huggingface/text-embeddings-inference: cpu-0 .6 |
249
249
| Volta | NOT SUPPORTED |
250
- | Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference: turing-0 .5 (experimental) |
251
- | Ampere 80 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:0.5 |
252
- | Ampere 86 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-0.5 |
253
- | Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-0.5 |
254
- | Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference: hopper-0 .5 (experimental) |
250
+ | Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference: turing-0 .6 (experimental) |
251
+ | Ampere 80 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:0.6 |
252
+ | Ampere 86 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-0.6 |
253
+ | Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-0.6 |
254
+ | Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference: hopper-0 .6 (experimental) |
255
255
256
256
** Warning** : Flash Attention is turned off by default for the Turing image as it suffers from precision issues.
257
257
You can turn Flash Attention v1 ON by using the ` USE_FLASH_ATTENTION=True ` environment variable.
@@ -280,7 +280,7 @@ model=<your private model>
280
280
volume=$PWD /data # share a volume with the Docker container to avoid downloading weights every run
281
281
token=< your cli READ token>
282
282
283
- docker run --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.5 --model-id $model
283
+ docker run --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.6 --model-id $model
284
284
```
285
285
286
286
### Using Re-rankers models
@@ -298,7 +298,7 @@ model=BAAI/bge-reranker-large
298
298
revision=refs/pr/4
299
299
volume=$PWD /data # share a volume with the Docker container to avoid downloading weights every run
300
300
301
- docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.5 --model-id $model --revision $revision
301
+ docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.6 --model-id $model --revision $revision
302
302
```
303
303
304
304
And then you can rank the similarity between a query and a list of texts with:
@@ -318,7 +318,7 @@ You can also use classic Sequence Classification models like `SamLowe/roberta-ba
318
318
model=SamLowe/roberta-base-go_emotions
319
319
volume=$PWD /data # share a volume with the Docker container to avoid downloading weights every run
320
320
321
- docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.5 --model-id $model
321
+ docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.6 --model-id $model
322
322
```
323
323
324
324
Once you have deployed the model you can use the ` predict ` endpoint to get the emotions most associated with an input:
@@ -340,14 +340,14 @@ by setting the address to an OTLP collector with the `--otlp-endpoint` argument.
340
340
` text-embeddings-inference ` offers a gRPC API as an alternative to the default HTTP API for high performance
341
341
deployments. The API protobuf definition can be found [ here] ( https://github.com/huggingface/text-embeddings-inference/blob/main/proto/tei.proto ) .
342
342
343
- You can use the gRPC API by adding the ` + grpc` tag to any TEI Docker image. For example:
343
+ You can use the gRPC API by adding the ` - grpc` tag to any TEI Docker image. For example:
344
344
345
345
``` shell
346
346
model=BAAI/bge-large-en-v1.5
347
347
revision=refs/pr/5
348
348
volume=$PWD /data # share a volume with the Docker container to avoid downloading weights every run
349
349
350
- docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.5+ grpc --model-id $model --revision $revision
350
+ docker run --gpus all -p 8080:80 -v $volume :/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.6- grpc --model-id $model --revision $revision
351
351
```
352
352
353
353
``` shell
0 commit comments