docs: initial version of the TEI docs for the hf.co/docs/ (#60)

MKhalusova · web-flow · commit 944f5c76fa5d · 2023-11-06T14:38:16.000+01:00
diff --git a/.github/workflows/build_documentation.yml b/.github/workflows/build_documentation.yml
@@ -0,0 +1,17 @@
+name: Build documentation
+
+on:
+  push:
+    branches:
+      - main
+
+jobs:
+   build:
+    uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
+    with:
+      commit_sha: ${{ github.sha }}
+      package: text-embeddings-inference
+      languages: en
+    secrets:
+      token: ${{ secrets.HUGGINGFACE_PUSH }}
+      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
diff --git a/.github/workflows/build_pr_documentation.yml b/.github/workflows/build_pr_documentation.yml
@@ -0,0 +1,17 @@
+name: Build PR Documentation
+
+on:
+  pull_request:
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
+  cancel-in-progress: true
+
+jobs:
+  build:
+    uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
+    with:
+      commit_sha: ${{ github.event.pull_request.head.sha }}
+      pr_number: ${{ github.event.number }}
+      package: text-embeddings-inference
+      languages: en
diff --git a/.github/workflows/delete_doc_comment_trigger.yml b/.github/workflows/delete_doc_comment_trigger.yml
@@ -0,0 +1,12 @@
+name: Delete doc comment trigger
+
+on:
+  pull_request:
+    types: [ closed ]
+
+
+jobs:
+  delete:
+    uses: huggingface/doc-builder/.github/workflows/delete_doc_comment_trigger.yml@main
+    with:
+      pr_number: ${{ github.event.number }}
diff --git a/.github/workflows/upload_pr_documentation.yml b/.github/workflows/upload_pr_documentation.yml
@@ -0,0 +1,16 @@
+name: Upload PR Documentation
+
+on:
+  workflow_run:
+    workflows: ["Build PR Documentation"]
+    types:
+      - completed
+
+jobs:
+  build:
+    uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
+    with:
+      package_name: text-embeddings-inference
+    secrets:
+      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
+      comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
@@ -0,0 +1,24 @@
+- sections:
+  - local: index
+    title: Text Embeddings Inference
+  - local: quick_tour
+    title: Quick Tour
+  - local: supported_models
+    title: Supported models and hardware
+  title: Getting started
+- sections:
+  - local: local_cpu
+    title: Using TEI locally with CPU
+  - local: local_gpu
+    title: Using TEI locally with GPU
+  - local: private_models
+    title: Serving private and gated models
+#  - local: tei_cli
+#    title: Using TEI CLI
+  - local: custom_cpu_container
+    title: Build custom container for TEI
+  title: Tutorials
+- sections:
+  - local: cli_arguments
+    title: CLI arguments
+  title: Reference
diff --git a/docs/source/en/cli_arguments.md b/docs/source/en/cli_arguments.md
@@ -0,0 +1,140 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# CLI arguments
+
+To see all options to serve your models, run the following:
+
+```shell
+text-embeddings-router --help
+```
+
+```
+Usage: text-embeddings-router [OPTIONS]
+
+Options:
+      --model-id <MODEL_ID>
+          The name of the model to load. Can be a MODEL_ID as listed on <https://hf.co/models> like `thenlper/gte-base`. 
+          Or it can be a local directory containing the necessary files as saved by `save_pretrained(...)` methods of 
+          transformers
+
+          [env: MODEL_ID=]
+          [default: thenlper/gte-base]
+
+      --revision <REVISION>
+          The actual revision of the model if you're referring to a model on the hub. You can use a specific commit id 
+          or a branch like `refs/pr/2`
+
+          [env: REVISION=]
+
+      --tokenization-workers <TOKENIZATION_WORKERS>
+          Optionally control the number of tokenizer workers used for payload tokenization, validation and truncation. 
+          Default to the number of CPU cores on the machine
+
+          [env: TOKENIZATION_WORKERS=]
+
+      --dtype <DTYPE>
+          The dtype to be forced upon the model
+          
+          If `dtype` is not set, it defaults to float32 on accelerate, and float16 for all other architectures
+
+          [env: DTYPE=]
+          [possible values: float16]
+
+      --pooling <POOLING>
+          Optionally control the pooling method. 
+          
+          If `pooling` is not set, the pooling configuration will be parsed from the model `1_Pooling/config.json`
+          configuration. 
+          
+          If `pooling` is set, it will override the model pooling configuration
+          
+          [env: POOLING=]
+          [possible values: cls, mean]
+
+      --max-concurrent-requests <MAX_CONCURRENT_REQUESTS>
+          The maximum amount of concurrent requests for this particular deployment. 
+          Having a low limit will refuse clients requests instead of having them wait for too long and is usually good 
+          to handle backpressure correctly
+
+          [env: MAX_CONCURRENT_REQUESTS=]
+          [default: 512]
+
+      --max-batch-tokens <MAX_BATCH_TOKENS>
+          **IMPORTANT** This is one critical control to allow maximum usage of the available hardware.
+
+          This represents the total amount of potential tokens within a batch.
+
+          For `max_batch_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens.
+
+          Overall this number should be the largest possible until the model is compute bound. Since the actual memory 
+          overhead depends on the model implementation, text-embeddings-inference cannot infer this number automatically.
+
+          [env: MAX_BATCH_TOKENS=]
+          [default: 16384]
+
+      --max-batch-requests <MAX_BATCH_REQUESTS>
+          Optionally control the maximum number of individual requests in a batch
+
+          [env: MAX_BATCH_REQUESTS=]
+
+      --max-client-batch-size <MAX_CLIENT_BATCH_SIZE>
+          Control the maximum number of inputs that a client can send in a single request
+
+          [env: MAX_CLIENT_BATCH_SIZE=]
+          [default: 32]
+
+      --hf-api-token <HF_API_TOKEN>
+          Your HuggingFace hub token
+
+          [env: HF_API_TOKEN=]
+
+      --hostname <HOSTNAME>
+          The IP address to listen on
+
+          [env: HOSTNAME=]
+          [default: 0.0.0.0]
+
+  -p, --port <PORT>
+          The port to listen on
+
+          [env: PORT=]
+          [default: 3000]
+
+      --uds-path <UDS_PATH>
+          The name of the unix socket some text-embeddings-inference backends will use as they communicate internally 
+          with gRPC
+
+          [env: UDS_PATH=]
+          [default: /tmp/text-embeddings-inference-server]
+
+      --huggingface-hub-cache <HUGGINGFACE_HUB_CACHE>
+          The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk 
+          for instance
+
+          [env: HUGGINGFACE_HUB_CACHE=/data]
+
+      --json-output
+          Outputs the logs in JSON format (useful for telemetry)
+
+          [env: JSON_OUTPUT=]
+
+      --otlp-endpoint <OTLP_ENDPOINT>
+          [env: OTLP_ENDPOINT=]
+
+      --cors-allow-origin <CORS_ALLOW_ORIGIN>
+          [env: CORS_ALLOW_ORIGIN=]
+```
diff --git a/docs/source/en/custom_container.md b/docs/source/en/custom_container.md
@@ -0,0 +1,43 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Build a custom container for TEI
+
+You can build our own CPU or CUDA TEI container using Docker.  To build a CPU container, run the following command in the 
+directory containing your custom Dockerfile:
+
+```shell
+docker build .
+```
+
+To build a CUDA container, it is essential to determine the compute capability (compute cap) of the GPU that will be 
+used at runtime. This information is crucial for the proper configuration of the CUDA containers. The following are 
+the examples of runtime compute capabilities for various GPU types:
+
+- Turing (T4, RTX 2000 series, ...) - `runtime_compute_cap=75`
+- A100 - `runtime_compute_cap=80`
+- A10 - `runtime_compute_cap=86`
+- Ada Lovelace (RTX 4000 series, ...) - `runtime_compute_cap=89`
+- H100 - `runtime_compute_cap=90`
+
+Once you have determined the compute capability is determined, set it as the `runtime_compute_cap` variable and build 
+the container as shown in the example below:
+
+```shell
+runtime_compute_cap=80
+
+docker build . -f Dockerfile-cuda --build-arg CUDA_COMPUTE_CAP=$runtime_compute_cap
+```
diff --git a/docs/source/en/index.md b/docs/source/en/index.md
@@ -0,0 +1,48 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Text Embeddings Inference
+
+Text Embeddings Inference (TEI) is a comprehensive toolkit designed for efficient deployment and serving of open source 
+text embeddings models. It enables high-performance extraction for the most popular models, including FlagEmbedding, Ember, GTE, and E5. 
+
+TEI offers multiple features tailored to optimize the deployment process and enhance overall performance.
+
+**Key Features:**
+
+* **Streamlined Deployment:** TEI eliminates the need for a model graph compilation step for a more efficient deployment process.
+* **Efficient Resource Utilization:** Benefit from small Docker images and rapid boot times, allowing for true serverless capabilities.
+* **Dynamic Batching:** TEI incorporates token-based dynamic batching thus optimizing resource utilization during inference.
+* **Optimized Inference:** TEI leverages [Flash Attention](https://github.com/HazyResearch/flash-attention), [Candle](https://github.com/huggingface/candle), and [cuBLASLt](https://docs.nvidia.com/cuda/cublas/#using-the-cublaslt-api) by using optimized transformers code for inference.
+* **Safetensors weight loading:** TEI loads [Safetensors](https://github.com/huggingface/safetensors) weights to enable tensor parallelism.
+* **Production-Ready:** TEI supports distributed tracing through Open Telemetry and Prometheus metrics.
+
+**Benchmarks**
+
+Benchmark for [BAAI/bge-base-en-v1.5](https://hf.co/BAAI/bge-large-en-v1.5) on an NVIDIA A10 with a sequence length of 512 tokens:
+
+<p>
+  <img src="assets/bs1-lat.png" width="400" />
+  <img src="assets/bs1-tp.png" width="400" />
+</p>
+<p>
+  <img src="assets/bs32-lat.png" width="400" />
+  <img src="assets/bs32-tp.png" width="400" />
+</p>
+
+**Getting Started:**
+
+To start using TEI, check the [Quick Tour](quick_tour) guide. 
diff --git a/docs/source/en/local_cpu.md b/docs/source/en/local_cpu.md
@@ -0,0 +1,67 @@
+<!--Copyright 2023 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# Using TEI locally with CPU
+
+You can install `text-embeddings-inference` locally to run it on your own machine. Here are the step-by-step instructions for installation:
+
+## Step 1: Install Rust
+
+[Install Rust]((https://rustup.rs/) on your machine by run the following in your terminal, then following the instructions:
+
+```shell
+curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
+```
+
+## Step 2: Install necessary packages
+
+Depending on your machine's architecture, run one of the following commands:
+
+### For x86 Machines
+
+```shell
+cargo install --path router -F candle -F mkl
+```
+
+### For M1 or M2 Machines
+
+```shell
+cargo install --path router -F candle -F accelerate
+```
+
+## Step 3: Launch Text Embeddings Inference
+
+Once the installation is successfully complete, you can launch Text Embeddings Inference on CPU with the following command:
+
+```shell
+model=BAAI/bge-large-en-v1.5
+revision=refs/pr/5
+
+text-embeddings-router --model-id $model --revision $revision --port 8080
+```
+
+<Tip>
+
+In some cases, you might also need the OpenSSL libraries and gcc installed. On Linux machines, run the following command:
+
+```shell
+sudo apt-get install libssl-dev gcc -y
+```
+
+</Tip>
+
+Now you are ready to use `text-embeddings-inference` locally on your machine.
+If you want to run TEI locally with a GPU, check out the [Using TEI locally with GPU](local_gpu) page.
diff --git a/docs/source/en/local_gpu.md b/docs/source/en/local_gpu.md
diff --git a/docs/source/en/private_models.md b/docs/source/en/private_models.md
diff --git a/docs/source/en/quick_tour.md b/docs/source/en/quick_tour.md
diff --git a/docs/source/en/supported_models.md b/docs/source/en/supported_models.md