ADD TGI docs (#43)

philschmid · eldarkurtic · web-flow · commit 328c0f5bc455 · 2024-09-09T12:17:02.000+02:00
Adds simple similar instructions for using TGI to benchmark.

P.S. Great tool!

Co-authored-by: Eldar Kurtic &lt;eldarkurtic314@gmail.com&gt;
diff --git a/README.md b/README.md
@@ -48,15 +48,29 @@ For detailed installation instructions and requirements, see the [Installation G
 
 ### Quick Start
 
-#### 1. Start an OpenAI Compatible Server (vLLM)
+#### 1a. Start an OpenAI Compatible Server (vLLM)
 
 GuideLLM requires an OpenAI-compatible server to run evaluations. [vLLM](https://github.com/vllm-project/vllm) is recommended for this purpose. To start a vLLM server with a Llama 3.1 8B quantized model, run the following command:
 
 ```bash
 vllm serve "neuralmagic/Meta-Llama-3.1-8B-Instruct-quantized.w4a16"
 ```
 
-For more information on starting a vLLM server, see the [vLLM Documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html).
+#### 1b. Start an OpenAI Compatible Server (Hugging Face TGI)
+
+GuideLLM requires an OpenAI-compatible server to run evaluations. [Text Generation Inference](https://github.com/huggingface/text-generation-inference) can be used here. To start a TGI server with a Llama 3.1 8B using docker, run the following command:
+
+```bash
+docker run --gpus 1 -ti --shm-size 1g --ipc=host --rm -p 8080:80 \
+  -e MODEL_ID=https://huggingface.co/llhf/Meta-Llama-3.1-8B-Instruct \
+  -e NUM_SHARD=1 \
+  -e MAX_INPUT_TOKENS=4096 \
+  -e MAX_TOTAL_TOKENS=6000 \
+  -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
+  ghcr.io/huggingface/text-generation-inference:2.2.0
+```
+
+For more information on starting a TGI server, see the [TGI Documentation](https://huggingface.co/docs/text-generation-inference/index).
 
 #### 2. Run a GuideLLM Evaluation