You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/quick_tour.md
+61-18Lines changed: 61 additions & 18 deletions
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
1
+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2
2
3
3
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4
4
the License. You may obtain a copy of the License at
@@ -16,18 +16,19 @@ rendered properly in your Markdown viewer.
16
16
17
17
# Quick Tour
18
18
19
-
## Text Embeddings
19
+
## Set up
20
20
21
21
The easiest way to get started with TEI is to use one of the official Docker containers
22
22
(see [Supported models and hardware](supported_models) to choose the right container).
23
23
24
-
After making sure that your hardware is supported, install the
25
-
[NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) if you
26
-
plan on utilizing GPUs. NVIDIA drivers on your device need to be compatible with CUDA version 12.2 or higher.
24
+
Hence one needs to install Docker following their [installation instructions](https://docs.docker.com/get-docker/).
27
25
28
-
Next, install Docker following their [installation instructions](https://docs.docker.com/get-docker/).
26
+
TEI supports inference both on GPU and CPU. If you plan on using a GPU, make sure to check that your hardware is supported by checking [this table](https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#docker-images).
27
+
Next, install the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html). NVIDIA drivers on your device need to be compatible with CUDA version 12.2 or higher.
29
28
30
-
Finally, deploy your model. Let's say you want to use `BAAI/bge-large-en-v1.5`. Here's how you can do this:
29
+
## Deploy
30
+
31
+
Next it's time to deploy your model. Let's say you want to use [`BAAI/bge-large-en-v1.5`](https://huggingface.co/BAAI/bge-large-en-v1.5). Here's how you can do this:
31
32
32
33
```shell
33
34
model=BAAI/bge-large-en-v1.5
@@ -42,7 +43,13 @@ We also recommend sharing a volume with the Docker container (`volume=$PWD/data`
42
43
43
44
</Tip>
44
45
45
-
Once you have deployed a model, you can use the `embed` endpoint by sending requests:
46
+
## Inference
47
+
48
+
Inference can be performed in 3 ways: using cURL, or via the `InferenceClient` or `OpenAI` Python SDKs.
49
+
50
+
#### cURL
51
+
52
+
To send a POST request to the TEI endpoint using cURL, you can run the following command:
46
53
47
54
```bash
48
55
curl 127.0.0.1:8080/embed \
@@ -51,16 +58,53 @@ curl 127.0.0.1:8080/embed \
51
58
-H 'Content-Type: application/json'
52
59
```
53
60
54
-
## Re-rankers
61
+
#### Python
62
+
63
+
To run inference using Python, you can either use the [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/en/index) Python SDK (recommended) or the `openai` Python SDK.
64
+
65
+
##### huggingface_hub
66
+
67
+
You can install it via pip as `pip install --upgrade --quiet huggingface_hub`, and then run:
68
+
69
+
```python
70
+
from huggingface_hub import InferenceClient
71
+
72
+
client = InferenceClient()
73
+
74
+
embedding = client.feature_extraction("What is deep learning?",
75
+
model="http://localhost:8080/embed")
76
+
print(len(embedding[0]))
77
+
```
78
+
79
+
#### OpenAI
80
+
81
+
You can install it via pip as `pip install --upgrade openai`, and then run:
TEI also supports re-ranker and classic sequence classification models.
55
100
56
-
Re-rankers models are Sequence Classification cross-encoders models with a single class that scores the similarity
57
-
between a query and a text.
101
+
### Re-rankers
58
102
59
-
See [this blogpost](https://blog.llamaindex.ai/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83) by
103
+
Rerankers, also called cross-encoders, are sequence classification models with a single class that score the similarity between a query and a text. See [this blogpost](https://blog.llamaindex.ai/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83) by
60
104
the LlamaIndex team to understand how you can use re-rankers models in your RAG pipeline to improve
61
105
downstream performance.
62
106
63
-
Let's say you want to use `BAAI/bge-reranker-large`:
107
+
Let's say you want to use [`BAAI/bge-reranker-large`](https://huggingface.co/BAAI/bge-reranker-large). First, you can deploy it like so:
64
108
65
109
```shell
66
110
model=BAAI/bge-reranker-large
@@ -69,8 +113,7 @@ volume=$PWD/data
69
113
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.7 --model-id $model
70
114
```
71
115
72
-
Once you have deployed a model, you can use the `rerank` endpoint to rank the similarity between a query and a list
73
-
of texts:
116
+
Once you have deployed a model, you can use the `rerank` endpoint to rank the similarity between a query and a list of texts. With `cURL` this can be done like so:
74
117
75
118
```bash
76
119
curl 127.0.0.1:8080/rerank \
@@ -79,9 +122,9 @@ curl 127.0.0.1:8080/rerank \
79
122
-H 'Content-Type: application/json'
80
123
```
81
124
82
-
## Sequence Classification
125
+
###Sequence classification models
83
126
84
-
You can also use classic Sequence Classification models like `SamLowe/roberta-base-go_emotions`:
127
+
You can also use classic Sequence Classification models like [`SamLowe/roberta-base-go_emotions`](https://huggingface.co/SamLowe/roberta-base-go_emotions):
85
128
86
129
```shell
87
130
model=SamLowe/roberta-base-go_emotions
@@ -101,7 +144,7 @@ curl 127.0.0.1:8080/predict \
101
144
102
145
## Batching
103
146
104
-
You can send multiple inputs in a batch. For example, for embeddings
147
+
You can send multiple inputs in a batch. For example, for embeddings:
0 commit comments