Skip to content

Commit b88c3e9

Browse files
authored
docs: Add support matrix for model parallelism in OpenAI Frontend (#7715)
1 parent 98edccd commit b88c3e9

File tree

1 file changed

+41
-5
lines changed

1 file changed

+41
-5
lines changed

python/openai/README.md

Lines changed: 41 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,10 @@ pip install -r requirements.txt
7070
# NOTE: Adjust the --tokenizer based on the model being used
7171
python3 openai_frontend/main.py --model-repository tests/vllm_models --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct
7272
```
73-
Once the server has successfully started, you should see something like this:
73+
74+
<details>
75+
<summary>Example output</summary>
76+
7477
```
7578
...
7679
+-----------------------+---------+--------+
@@ -87,6 +90,8 @@ INFO: Application startup complete.
8790
INFO: Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit) <- OpenAI Frontend Started Successfully
8891
```
8992

93+
</details>
94+
9095
4. Send a `/v1/chat/completions` request:
9196
- Note the use of `jq` is optional, but provides a nicely formatted output for JSON responses.
9297
```bash
@@ -96,7 +101,10 @@ curl -s http://localhost:9000/v1/chat/completions -H 'Content-Type: application/
96101
"messages": [{"role": "user", "content": "Say this is a test!"}]
97102
}' | jq
98103
```
99-
which should provide output that looks like this:
104+
105+
<details>
106+
<summary>Example output</summary>
107+
100108
```json
101109
{
102110
"id": "cmpl-6930b296-7ef8-11ef-bdd1-107c6149ca79",
@@ -122,6 +130,8 @@ which should provide output that looks like this:
122130
}
123131
```
124132

133+
</details>
134+
125135
5. Send a `/v1/completions` request:
126136
- Note the use of `jq` is optional, but provides a nicely formatted output for JSON responses.
127137
```bash
@@ -131,7 +141,10 @@ curl -s http://localhost:9000/v1/completions -H 'Content-Type: application/json'
131141
"prompt": "Machine learning is"
132142
}' | jq
133143
```
134-
which should provide an output that looks like this:
144+
145+
<details>
146+
<summary>Example output</summary>
147+
135148
```json
136149
{
137150
"id": "cmpl-d51df75c-7ef8-11ef-bdd1-107c6149ca79",
@@ -151,6 +164,8 @@ which should provide an output that looks like this:
151164
}
152165
```
153166

167+
</details>
168+
154169
6. Benchmark with `genai-perf`:
155170
- To install genai-perf in this container, see the instructions [here](https://github.com/triton-inference-server/perf_analyzer/tree/main/genai-perf#install-perf-analyzer-ubuntu-python-38)
156171
- Or try using genai-perf from the [SDK container](https://github.com/triton-inference-server/perf_analyzer/tree/main/genai-perf#install-perf-analyzer-ubuntu-python-38)
@@ -166,7 +181,10 @@ genai-perf profile \
166181
--url localhost:9000 \
167182
--streaming
168183
```
169-
which should provide an output that looks like:
184+
185+
<details>
186+
<summary>Example output</summary>
187+
170188
```
171189
2024-10-14 22:43 [INFO] genai_perf.parser:82 - Profiling these models: llama-3.1-8b-instruct
172190
2024-10-14 22:43 [INFO] genai_perf.wrapper:163 - Running Perf Analyzer : 'perf_analyzer -m llama-3.1-8b-instruct --async --input-data artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/inputs.json -i http --concurrency-range 1 --endpoint v1/chat/completions --service-kind openai -u localhost:9000 --measurement-interval 10000 --stability-percentage 999 --profile-export-file artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/profile_export.json'
@@ -186,6 +204,8 @@ which should provide an output that looks like:
186204
2024-10-14 22:44 [INFO] genai_perf.export_data.csv_exporter:71 - Generating artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/profile_export_genai_perf.csv
187205
```
188206

207+
</details>
208+
189209
7. Use the OpenAI python client directly:
190210
```python
191211
from openai import OpenAI
@@ -234,6 +254,7 @@ pytest -v tests/
234254
docker run -it --net=host --gpus all --rm \
235255
-v ${HOME}/.cache/huggingface:/root/.cache/huggingface \
236256
-e HF_TOKEN \
257+
-e TRTLLM_ORCHESTRATOR=1 \
237258
nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
238259
```
239260

@@ -265,7 +286,10 @@ curl -s http://localhost:9000/v1/chat/completions -H 'Content-Type: application/
265286
"messages": [{"role": "user", "content": "Say this is a test!"}]
266287
}' | jq
267288
```
268-
which should provide an output that looks like this:
289+
290+
<details>
291+
<summary>Example output</summary>
292+
269293
```json
270294
{
271295
"id": "cmpl-704c758c-8a84-11ef-b106-107c6149ca79",
@@ -290,6 +314,8 @@ which should provide an output that looks like this:
290314
}
291315
```
292316

317+
</details>
318+
293319
The other examples should be the same as vLLM, except that you should set `MODEL="tensorrt_llm_bls"` or `MODEL="ensemble"`,
294320
everywhere applicable as seen in the example request above.
295321

@@ -315,3 +341,13 @@ available arguments and default values.
315341

316342
For more information on the `tritonfrontend` python bindings, see the docs
317343
[here](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/tritonfrontend.md).
344+
345+
## Model Parallelism Support
346+
347+
- [x] vLLM ([EngineArgs](https://github.com/triton-inference-server/vllm_backend/blob/main/README.md#using-the-vllm-backend))
348+
- ex: Configure `tensor_parallel_size: 2` in the
349+
[model.json](https://github.com/triton-inference-server/vllm_backend/blob/main/samples/model_repository/vllm_model/1/model.json)
350+
- [x] TensorRT-LLM ([Orchestrator Mode](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md#orchestrator-mode))
351+
- Set the following environment variable: `export TRTLLM_ORCHESTRATOR=1`
352+
- [ ] TensorRT-LLM ([Leader Mode](https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md#leader-mode))
353+
- Not currently supported

0 commit comments

Comments
 (0)