Skip to content

Commit f6021f7

Browse files
KrishnanPrashGuanLuormccorm4mc-nv
authored
feat: Python Deployment of Triton Inference Server (#7501)
Co-authored-by: GuanLuo <41310872+GuanLuo@users.noreply.github.com> Co-authored-by: Ryan McCormick <rmccormick@nvidia.com> Co-authored-by: Misha Chornyi <99709299+mc-nv@users.noreply.github.com>
1 parent 9ec820b commit f6021f7

37 files changed

+2315
-13
lines changed

Dockerfile.QA

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -390,6 +390,10 @@ RUN rm -fr qa/L0_copyrights qa/L0_build_variants && \
390390
RUN find qa/pkgs/ -maxdepth 1 -type f -name \
391391
"tritonserver-*.whl" | xargs -I {} pip3 install --upgrade {}[all]
392392

393+
# Install Triton Frontend Python API
394+
RUN find qa/pkgs/ -type f -name \
395+
"tritonfrontend-*.whl" | xargs -I {} pip3 install --upgrade {}[all]
396+
393397
ENV LD_LIBRARY_PATH /opt/tritonserver/qa/clients:${LD_LIBRARY_PATH}
394398

395399
# DLIS-3631: Needed to run Perf Analyzer CI tests correctly
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
### Triton Server (tritonfrontend) Bindings
2+
3+
The `tritonfrontend` python package is a set of bindings to Triton's existing frontends implemented in C++. Currently, `tritonfrontend` supports starting up `KServeHttp` and `KServeGrpc` frontends. These bindings used in-combination with Triton's Python In-Process API ([`tritonserver`](https://github.com/triton-inference-server/core/tree/main/python/tritonserver)) and [`tritonclient`](https://github.com/triton-inference-server/client/tree/main/src/python/library) extend the ability to use Triton's full feature set with a couple of lines of Python.
4+
5+
Let us walk through a simple example:
6+
1. First we need to load the desired models and start the server with `tritonserver`.
7+
```python
8+
import tritonserver
9+
10+
# Constructing path to Model Repository
11+
model_path = f"server/src/python/examples/example_model_repository"
12+
13+
server_options = tritonserver.Options(
14+
server_id="ExampleServer",
15+
model_repository=model_path,
16+
log_error=True,
17+
log_warn=True,
18+
log_info=True,
19+
)
20+
server = tritonserver.Server(server_options).start(wait_until_ready=True)
21+
```
22+
Note: `model_path` may need to be edited depending on your setup.
23+
24+
25+
2. Now, to start up the respective services with `tritonfrontend`
26+
```python
27+
from tritonfrontend import KServeHttp, KServeGrpc
28+
http_options = KServeHttp.Options(thread_count=5)
29+
http_service = KServeHttp.Server(server, http_options)
30+
http_service.start()
31+
32+
# Default options (if none provided)
33+
grpc_service = KServeGrpc.Server(server)
34+
grpc_service.start()
35+
```
36+
37+
3. Finally, with running services, we can use `tritonclient` or simple `curl` commands to send requests and receive responses from the frontends.
38+
39+
```python
40+
import tritonclient.http as httpclient
41+
import numpy as np # Use version numpy < 2
42+
model_name = "identity" # output == input
43+
url = "localhost:8000"
44+
45+
# Create a Triton client
46+
client = httpclient.InferenceServerClient(url=url)
47+
48+
# Prepare input data
49+
input_data = np.array([["Roger Roger"]], dtype=object)
50+
51+
# Create input and output objects
52+
inputs = [httpclient.InferInput("INPUT0", input_data.shape, "BYTES")]
53+
54+
# Set the data for the input tensor
55+
inputs[0].set_data_from_numpy(input_data)
56+
57+
results = client.infer(model_name, inputs=inputs)
58+
59+
# Get the output data
60+
output_data = results.as_numpy("OUTPUT0")
61+
62+
# Print results
63+
print("[INFERENCE RESULTS]")
64+
print("Output data:", output_data)
65+
66+
# Stop respective services and server.
67+
http_service.stop()
68+
grpc_service.stop()
69+
server.stop()
70+
```
71+
72+
---
73+
74+
Additionally, `tritonfrontend` provides context manager support as well. So steps 2-3, could also be achieved through:
75+
```python
76+
from tritonfrontend import KServeHttp
77+
import tritonclient.http as httpclient
78+
import numpy as np # Use version numpy < 2
79+
80+
with KServeHttp.Server(server) as http_service:
81+
# The identity model returns an exact duplicate of the input data as output
82+
model_name = "identity"
83+
url = "localhost:8000"
84+
# Create a Triton client
85+
with httpclient.InferenceServerClient(url=url) as client:
86+
# Prepare input data
87+
input_data = np.array(["Roger Roger"], dtype=object)
88+
# Create input and output objects
89+
inputs = [httpclient.InferInput("INPUT0", input_data.shape, "BYTES")]
90+
# Set the data for the input tensor
91+
inputs[0].set_data_from_numpy(input_data)
92+
# Perform inference
93+
results = client.infer(model_name, inputs=inputs)
94+
# Get the output data
95+
output_data = results.as_numpy("OUTPUT0")
96+
# Print results
97+
print("[INFERENCE RESULTS]")
98+
print("Output data:", output_data)
99+
100+
server.stop()
101+
```
102+
With this workflow, you can avoid having to stop each service after client requests have terminated.
103+
104+
105+
## Known Issues
106+
- The following features are not currently supported when launching the Triton frontend services through the python bindings:
107+
- [Tracing](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/trace.md)
108+
- [Shared Memory](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_shared_memory.md)
109+
- [Metrics](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/metrics.md)
110+
- [Restricted Protocols](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/inference_protocols.md#limit-endpoint-access-beta)
111+
- VertexAI
112+
- Sagemaker
113+
- After a running server has been stopped, if the client sends an inference request, a Segmentation Fault will occur.

qa/L0_python_api/test.sh

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
#!/bin/bash
2-
# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# Copyright 2023-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
33
#
44
# Redistribution and use in source and binary forms, with or without
55
# modification, are permitted provided that the following conditions
@@ -49,6 +49,15 @@ if [ $? -ne 0 ]; then
4949
RET=1
5050
fi
5151

52+
53+
FRONTEND_TEST_LOG="./python_kserve.log"
54+
python -m pytest --junitxml=test_kserve.xml test_kserve.py > $FRONTEND_TEST_LOG 2>&1
55+
if [ $? -ne 0 ]; then
56+
cat $FRONTEND_TEST_LOG
57+
echo -e "\n***\n*** Test Failed\n***"
58+
RET=1
59+
fi
60+
5261
set -e
5362

5463
if [ $RET -eq 0 ]; then

0 commit comments

Comments
 (0)