@@ -70,7 +70,10 @@ pip install -r requirements.txt
70
70
# NOTE: Adjust the --tokenizer based on the model being used
71
71
python3 openai_frontend/main.py --model-repository tests/vllm_models --tokenizer meta-llama/Meta-Llama-3.1-8B-Instruct
72
72
```
73
- Once the server has successfully started, you should see something like this:
73
+
74
+ <details >
75
+ <summary >Example output</summary >
76
+
74
77
```
75
78
...
76
79
+-----------------------+---------+--------+
@@ -87,6 +90,8 @@ INFO: Application startup complete.
87
90
INFO: Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit) <- OpenAI Frontend Started Successfully
88
91
```
89
92
93
+ </details >
94
+
90
95
4 . Send a ` /v1/chat/completions ` request:
91
96
- Note the use of ` jq ` is optional, but provides a nicely formatted output for JSON responses.
92
97
``` bash
@@ -96,7 +101,10 @@ curl -s http://localhost:9000/v1/chat/completions -H 'Content-Type: application/
96
101
"messages": [{"role": "user", "content": "Say this is a test!"}]
97
102
}' | jq
98
103
```
99
- which should provide output that looks like this:
104
+
105
+ <details >
106
+ <summary >Example output</summary >
107
+
100
108
``` json
101
109
{
102
110
"id" : " cmpl-6930b296-7ef8-11ef-bdd1-107c6149ca79" ,
@@ -122,6 +130,8 @@ which should provide output that looks like this:
122
130
}
123
131
```
124
132
133
+ </details >
134
+
125
135
5 . Send a ` /v1/completions ` request:
126
136
- Note the use of ` jq ` is optional, but provides a nicely formatted output for JSON responses.
127
137
``` bash
@@ -131,7 +141,10 @@ curl -s http://localhost:9000/v1/completions -H 'Content-Type: application/json'
131
141
"prompt": "Machine learning is"
132
142
}' | jq
133
143
```
134
- which should provide an output that looks like this:
144
+
145
+ <details >
146
+ <summary >Example output</summary >
147
+
135
148
``` json
136
149
{
137
150
"id" : " cmpl-d51df75c-7ef8-11ef-bdd1-107c6149ca79" ,
@@ -151,6 +164,8 @@ which should provide an output that looks like this:
151
164
}
152
165
```
153
166
167
+ </details >
168
+
154
169
6 . Benchmark with ` genai-perf ` :
155
170
- To install genai-perf in this container, see the instructions [ here] ( https://github.com/triton-inference-server/perf_analyzer/tree/main/genai-perf#install-perf-analyzer-ubuntu-python-38 )
156
171
- Or try using genai-perf from the [ SDK container] ( https://github.com/triton-inference-server/perf_analyzer/tree/main/genai-perf#install-perf-analyzer-ubuntu-python-38 )
@@ -166,7 +181,10 @@ genai-perf profile \
166
181
--url localhost:9000 \
167
182
--streaming
168
183
```
169
- which should provide an output that looks like:
184
+
185
+ <details >
186
+ <summary >Example output</summary >
187
+
170
188
```
171
189
2024-10-14 22:43 [INFO] genai_perf.parser:82 - Profiling these models: llama-3.1-8b-instruct
172
190
2024-10-14 22:43 [INFO] genai_perf.wrapper:163 - Running Perf Analyzer : 'perf_analyzer -m llama-3.1-8b-instruct --async --input-data artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/inputs.json -i http --concurrency-range 1 --endpoint v1/chat/completions --service-kind openai -u localhost:9000 --measurement-interval 10000 --stability-percentage 999 --profile-export-file artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/profile_export.json'
@@ -186,6 +204,8 @@ which should provide an output that looks like:
186
204
2024-10-14 22:44 [INFO] genai_perf.export_data.csv_exporter:71 - Generating artifacts/llama-3.1-8b-instruct-openai-chat-concurrency1/profile_export_genai_perf.csv
187
205
```
188
206
207
+ </details >
208
+
189
209
7 . Use the OpenAI python client directly:
190
210
``` python
191
211
from openai import OpenAI
@@ -234,6 +254,7 @@ pytest -v tests/
234
254
docker run -it --net=host --gpus all --rm \
235
255
-v ${HOME} /.cache/huggingface:/root/.cache/huggingface \
236
256
-e HF_TOKEN \
257
+ -e TRTLLM_ORCHESTRATOR=1 \
237
258
nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3
238
259
```
239
260
@@ -265,7 +286,10 @@ curl -s http://localhost:9000/v1/chat/completions -H 'Content-Type: application/
265
286
"messages": [{"role": "user", "content": "Say this is a test!"}]
266
287
}' | jq
267
288
```
268
- which should provide an output that looks like this:
289
+
290
+ <details >
291
+ <summary >Example output</summary >
292
+
269
293
``` json
270
294
{
271
295
"id" : " cmpl-704c758c-8a84-11ef-b106-107c6149ca79" ,
@@ -290,6 +314,8 @@ which should provide an output that looks like this:
290
314
}
291
315
```
292
316
317
+ </details >
318
+
293
319
The other examples should be the same as vLLM, except that you should set ` MODEL="tensorrt_llm_bls" ` or ` MODEL="ensemble" ` ,
294
320
everywhere applicable as seen in the example request above.
295
321
@@ -315,3 +341,13 @@ available arguments and default values.
315
341
316
342
For more information on the ` tritonfrontend ` python bindings, see the docs
317
343
[ here] ( https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/tritonfrontend.md ) .
344
+
345
+ ## Model Parallelism Support
346
+
347
+ - [x] vLLM ([ EngineArgs] ( https://github.com/triton-inference-server/vllm_backend/blob/main/README.md#using-the-vllm-backend ) )
348
+ - ex: Configure ` tensor_parallel_size: 2 ` in the
349
+ [ model.json] ( https://github.com/triton-inference-server/vllm_backend/blob/main/samples/model_repository/vllm_model/1/model.json )
350
+ - [x] TensorRT-LLM ([ Orchestrator Mode] ( https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md#orchestrator-mode ) )
351
+ - Set the following environment variable: ` export TRTLLM_ORCHESTRATOR=1 `
352
+ - [ ] TensorRT-LLM ([ Leader Mode] ( https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/README.md#leader-mode ) )
353
+ - Not currently supported
0 commit comments