Skip to content

Commit 944f5c7

Browse files
authored
docs: initial version of the TEI docs for the hf.co/docs/ (#60)
1 parent 463329a commit 944f5c7

13 files changed

+605
-0
lines changed
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
name: Build documentation
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
8+
jobs:
9+
build:
10+
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
11+
with:
12+
commit_sha: ${{ github.sha }}
13+
package: text-embeddings-inference
14+
languages: en
15+
secrets:
16+
token: ${{ secrets.HUGGINGFACE_PUSH }}
17+
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
name: Build PR Documentation
2+
3+
on:
4+
pull_request:
5+
6+
concurrency:
7+
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
8+
cancel-in-progress: true
9+
10+
jobs:
11+
build:
12+
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
13+
with:
14+
commit_sha: ${{ github.event.pull_request.head.sha }}
15+
pr_number: ${{ github.event.number }}
16+
package: text-embeddings-inference
17+
languages: en
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
name: Delete doc comment trigger
2+
3+
on:
4+
pull_request:
5+
types: [ closed ]
6+
7+
8+
jobs:
9+
delete:
10+
uses: huggingface/doc-builder/.github/workflows/delete_doc_comment_trigger.yml@main
11+
with:
12+
pr_number: ${{ github.event.number }}
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
name: Upload PR Documentation
2+
3+
on:
4+
workflow_run:
5+
workflows: ["Build PR Documentation"]
6+
types:
7+
- completed
8+
9+
jobs:
10+
build:
11+
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
12+
with:
13+
package_name: text-embeddings-inference
14+
secrets:
15+
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
16+
comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}

docs/source/en/_toctree.yml

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
- sections:
2+
- local: index
3+
title: Text Embeddings Inference
4+
- local: quick_tour
5+
title: Quick Tour
6+
- local: supported_models
7+
title: Supported models and hardware
8+
title: Getting started
9+
- sections:
10+
- local: local_cpu
11+
title: Using TEI locally with CPU
12+
- local: local_gpu
13+
title: Using TEI locally with GPU
14+
- local: private_models
15+
title: Serving private and gated models
16+
# - local: tei_cli
17+
# title: Using TEI CLI
18+
- local: custom_cpu_container
19+
title: Build custom container for TEI
20+
title: Tutorials
21+
- sections:
22+
- local: cli_arguments
23+
title: CLI arguments
24+
title: Reference

docs/source/en/cli_arguments.md

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
# CLI arguments
18+
19+
To see all options to serve your models, run the following:
20+
21+
```shell
22+
text-embeddings-router --help
23+
```
24+
25+
```
26+
Usage: text-embeddings-router [OPTIONS]
27+
28+
Options:
29+
--model-id <MODEL_ID>
30+
The name of the model to load. Can be a MODEL_ID as listed on <https://hf.co/models> like `thenlper/gte-base`.
31+
Or it can be a local directory containing the necessary files as saved by `save_pretrained(...)` methods of
32+
transformers
33+
34+
[env: MODEL_ID=]
35+
[default: thenlper/gte-base]
36+
37+
--revision <REVISION>
38+
The actual revision of the model if you're referring to a model on the hub. You can use a specific commit id
39+
or a branch like `refs/pr/2`
40+
41+
[env: REVISION=]
42+
43+
--tokenization-workers <TOKENIZATION_WORKERS>
44+
Optionally control the number of tokenizer workers used for payload tokenization, validation and truncation.
45+
Default to the number of CPU cores on the machine
46+
47+
[env: TOKENIZATION_WORKERS=]
48+
49+
--dtype <DTYPE>
50+
The dtype to be forced upon the model
51+
52+
If `dtype` is not set, it defaults to float32 on accelerate, and float16 for all other architectures
53+
54+
[env: DTYPE=]
55+
[possible values: float16]
56+
57+
--pooling <POOLING>
58+
Optionally control the pooling method.
59+
60+
If `pooling` is not set, the pooling configuration will be parsed from the model `1_Pooling/config.json`
61+
configuration.
62+
63+
If `pooling` is set, it will override the model pooling configuration
64+
65+
[env: POOLING=]
66+
[possible values: cls, mean]
67+
68+
--max-concurrent-requests <MAX_CONCURRENT_REQUESTS>
69+
The maximum amount of concurrent requests for this particular deployment.
70+
Having a low limit will refuse clients requests instead of having them wait for too long and is usually good
71+
to handle backpressure correctly
72+
73+
[env: MAX_CONCURRENT_REQUESTS=]
74+
[default: 512]
75+
76+
--max-batch-tokens <MAX_BATCH_TOKENS>
77+
**IMPORTANT** This is one critical control to allow maximum usage of the available hardware.
78+
79+
This represents the total amount of potential tokens within a batch.
80+
81+
For `max_batch_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens.
82+
83+
Overall this number should be the largest possible until the model is compute bound. Since the actual memory
84+
overhead depends on the model implementation, text-embeddings-inference cannot infer this number automatically.
85+
86+
[env: MAX_BATCH_TOKENS=]
87+
[default: 16384]
88+
89+
--max-batch-requests <MAX_BATCH_REQUESTS>
90+
Optionally control the maximum number of individual requests in a batch
91+
92+
[env: MAX_BATCH_REQUESTS=]
93+
94+
--max-client-batch-size <MAX_CLIENT_BATCH_SIZE>
95+
Control the maximum number of inputs that a client can send in a single request
96+
97+
[env: MAX_CLIENT_BATCH_SIZE=]
98+
[default: 32]
99+
100+
--hf-api-token <HF_API_TOKEN>
101+
Your HuggingFace hub token
102+
103+
[env: HF_API_TOKEN=]
104+
105+
--hostname <HOSTNAME>
106+
The IP address to listen on
107+
108+
[env: HOSTNAME=]
109+
[default: 0.0.0.0]
110+
111+
-p, --port <PORT>
112+
The port to listen on
113+
114+
[env: PORT=]
115+
[default: 3000]
116+
117+
--uds-path <UDS_PATH>
118+
The name of the unix socket some text-embeddings-inference backends will use as they communicate internally
119+
with gRPC
120+
121+
[env: UDS_PATH=]
122+
[default: /tmp/text-embeddings-inference-server]
123+
124+
--huggingface-hub-cache <HUGGINGFACE_HUB_CACHE>
125+
The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk
126+
for instance
127+
128+
[env: HUGGINGFACE_HUB_CACHE=/data]
129+
130+
--json-output
131+
Outputs the logs in JSON format (useful for telemetry)
132+
133+
[env: JSON_OUTPUT=]
134+
135+
--otlp-endpoint <OTLP_ENDPOINT>
136+
[env: OTLP_ENDPOINT=]
137+
138+
--cors-allow-origin <CORS_ALLOW_ORIGIN>
139+
[env: CORS_ALLOW_ORIGIN=]
140+
```

docs/source/en/custom_container.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
# Build a custom container for TEI
18+
19+
You can build our own CPU or CUDA TEI container using Docker. To build a CPU container, run the following command in the
20+
directory containing your custom Dockerfile:
21+
22+
```shell
23+
docker build .
24+
```
25+
26+
To build a CUDA container, it is essential to determine the compute capability (compute cap) of the GPU that will be
27+
used at runtime. This information is crucial for the proper configuration of the CUDA containers. The following are
28+
the examples of runtime compute capabilities for various GPU types:
29+
30+
- Turing (T4, RTX 2000 series, ...) - `runtime_compute_cap=75`
31+
- A100 - `runtime_compute_cap=80`
32+
- A10 - `runtime_compute_cap=86`
33+
- Ada Lovelace (RTX 4000 series, ...) - `runtime_compute_cap=89`
34+
- H100 - `runtime_compute_cap=90`
35+
36+
Once you have determined the compute capability is determined, set it as the `runtime_compute_cap` variable and build
37+
the container as shown in the example below:
38+
39+
```shell
40+
runtime_compute_cap=80
41+
42+
docker build . -f Dockerfile-cuda --build-arg CUDA_COMPUTE_CAP=$runtime_compute_cap
43+
```

docs/source/en/index.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
# Text Embeddings Inference
18+
19+
Text Embeddings Inference (TEI) is a comprehensive toolkit designed for efficient deployment and serving of open source
20+
text embeddings models. It enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE, and E5.
21+
22+
TEI offers multiple features tailored to optimize the deployment process and enhance overall performance.
23+
24+
**Key Features:**
25+
26+
* **Streamlined Deployment:** TEI eliminates the need for a model graph compilation step for a more efficient deployment process.
27+
* **Efficient Resource Utilization:** Benefit from small Docker images and rapid boot times, allowing for true serverless capabilities.
28+
* **Dynamic Batching:** TEI incorporates token-based dynamic batching thus optimizing resource utilization during inference.
29+
* **Optimized Inference:** TEI leverages [Flash Attention](https://github.com/HazyResearch/flash-attention), [Candle](https://github.com/huggingface/candle), and [cuBLASLt](https://docs.nvidia.com/cuda/cublas/#using-the-cublaslt-api) by using optimized transformers code for inference.
30+
* **Safetensors weight loading:** TEI loads [Safetensors](https://github.com/huggingface/safetensors) weights to enable tensor parallelism.
31+
* **Production-Ready:** TEI supports distributed tracing through Open Telemetry and Prometheus metrics.
32+
33+
**Benchmarks**
34+
35+
Benchmark for [BAAI/bge-base-en-v1.5](https://hf.co/BAAI/bge-large-en-v1.5) on an NVIDIA A10 with a sequence length of 512 tokens:
36+
37+
<p>
38+
<img src="assets/bs1-lat.png" width="400" />
39+
<img src="assets/bs1-tp.png" width="400" />
40+
</p>
41+
<p>
42+
<img src="assets/bs32-lat.png" width="400" />
43+
<img src="assets/bs32-tp.png" width="400" />
44+
</p>
45+
46+
**Getting Started:**
47+
48+
To start using TEI, check the [Quick Tour](quick_tour) guide.

docs/source/en/local_cpu.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
# Using TEI locally with CPU
18+
19+
You can install `text-embeddings-inference` locally to run it on your own machine. Here are the step-by-step instructions for installation:
20+
21+
## Step 1: Install Rust
22+
23+
[Install Rust]((https://rustup.rs/) on your machine by run the following in your terminal, then following the instructions:
24+
25+
```shell
26+
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
27+
```
28+
29+
## Step 2: Install necessary packages
30+
31+
Depending on your machine's architecture, run one of the following commands:
32+
33+
### For x86 Machines
34+
35+
```shell
36+
cargo install --path router -F candle -F mkl
37+
```
38+
39+
### For M1 or M2 Machines
40+
41+
```shell
42+
cargo install --path router -F candle -F accelerate
43+
```
44+
45+
## Step 3: Launch Text Embeddings Inference
46+
47+
Once the installation is successfully complete, you can launch Text Embeddings Inference on CPU with the following command:
48+
49+
```shell
50+
model=BAAI/bge-large-en-v1.5
51+
revision=refs/pr/5
52+
53+
text-embeddings-router --model-id $model --revision $revision --port 8080
54+
```
55+
56+
<Tip>
57+
58+
In some cases, you might also need the OpenSSL libraries and gcc installed. On Linux machines, run the following command:
59+
60+
```shell
61+
sudo apt-get install libssl-dev gcc -y
62+
```
63+
64+
</Tip>
65+
66+
Now you are ready to use `text-embeddings-inference` locally on your machine.
67+
If you want to run TEI locally with a GPU, check out the [Using TEI locally with GPU](local_gpu) page.

0 commit comments

Comments
 (0)