|
| 1 | +### Triton Server (tritonfrontend) Bindings |
| 2 | + |
| 3 | +The `tritonfrontend` python package is a set of bindings to Triton's existing frontends implemented in C++. Currently, `tritonfrontend` supports starting up `KServeHttp` and `KServeGrpc` frontends. These bindings used in-combination with Triton's Python In-Process API ([`tritonserver`](https://github.com/triton-inference-server/core/tree/main/python/tritonserver)) and [`tritonclient`](https://github.com/triton-inference-server/client/tree/main/src/python/library) extend the ability to use Triton's full feature set with a couple of lines of Python. |
| 4 | + |
| 5 | +Let us walk through a simple example: |
| 6 | +1. First we need to load the desired models and start the server with `tritonserver`. |
| 7 | +```python |
| 8 | +import tritonserver |
| 9 | + |
| 10 | +# Constructing path to Model Repository |
| 11 | +model_path = f"server/src/python/examples/example_model_repository" |
| 12 | + |
| 13 | +server_options = tritonserver.Options( |
| 14 | + server_id="ExampleServer", |
| 15 | + model_repository=model_path, |
| 16 | + log_error=True, |
| 17 | + log_warn=True, |
| 18 | + log_info=True, |
| 19 | +) |
| 20 | +server = tritonserver.Server(server_options).start(wait_until_ready=True) |
| 21 | +``` |
| 22 | +Note: `model_path` may need to be edited depending on your setup. |
| 23 | + |
| 24 | + |
| 25 | +2. Now, to start up the respective services with `tritonfrontend` |
| 26 | +```python |
| 27 | +from tritonfrontend import KServeHttp, KServeGrpc |
| 28 | +http_options = KServeHttp.Options(thread_count=5) |
| 29 | +http_service = KServeHttp.Server(server, http_options) |
| 30 | +http_service.start() |
| 31 | + |
| 32 | +# Default options (if none provided) |
| 33 | +grpc_service = KServeGrpc.Server(server) |
| 34 | +grpc_service.start() |
| 35 | +``` |
| 36 | + |
| 37 | +3. Finally, with running services, we can use `tritonclient` or simple `curl` commands to send requests and receive responses from the frontends. |
| 38 | + |
| 39 | +```python |
| 40 | +import tritonclient.http as httpclient |
| 41 | +import numpy as np # Use version numpy < 2 |
| 42 | +model_name = "identity" # output == input |
| 43 | +url = "localhost:8000" |
| 44 | + |
| 45 | +# Create a Triton client |
| 46 | +client = httpclient.InferenceServerClient(url=url) |
| 47 | + |
| 48 | +# Prepare input data |
| 49 | +input_data = np.array([["Roger Roger"]], dtype=object) |
| 50 | + |
| 51 | +# Create input and output objects |
| 52 | +inputs = [httpclient.InferInput("INPUT0", input_data.shape, "BYTES")] |
| 53 | + |
| 54 | +# Set the data for the input tensor |
| 55 | +inputs[0].set_data_from_numpy(input_data) |
| 56 | + |
| 57 | +results = client.infer(model_name, inputs=inputs) |
| 58 | + |
| 59 | +# Get the output data |
| 60 | +output_data = results.as_numpy("OUTPUT0") |
| 61 | + |
| 62 | +# Print results |
| 63 | +print("[INFERENCE RESULTS]") |
| 64 | +print("Output data:", output_data) |
| 65 | + |
| 66 | +# Stop respective services and server. |
| 67 | +http_service.stop() |
| 68 | +grpc_service.stop() |
| 69 | +server.stop() |
| 70 | +``` |
| 71 | + |
| 72 | +--- |
| 73 | + |
| 74 | +Additionally, `tritonfrontend` provides context manager support as well. So steps 2-3, could also be achieved through: |
| 75 | +```python |
| 76 | +from tritonfrontend import KServeHttp |
| 77 | +import tritonclient.http as httpclient |
| 78 | +import numpy as np # Use version numpy < 2 |
| 79 | + |
| 80 | +with KServeHttp.Server(server) as http_service: |
| 81 | + # The identity model returns an exact duplicate of the input data as output |
| 82 | + model_name = "identity" |
| 83 | + url = "localhost:8000" |
| 84 | + # Create a Triton client |
| 85 | + with httpclient.InferenceServerClient(url=url) as client: |
| 86 | + # Prepare input data |
| 87 | + input_data = np.array(["Roger Roger"], dtype=object) |
| 88 | + # Create input and output objects |
| 89 | + inputs = [httpclient.InferInput("INPUT0", input_data.shape, "BYTES")] |
| 90 | + # Set the data for the input tensor |
| 91 | + inputs[0].set_data_from_numpy(input_data) |
| 92 | + # Perform inference |
| 93 | + results = client.infer(model_name, inputs=inputs) |
| 94 | + # Get the output data |
| 95 | + output_data = results.as_numpy("OUTPUT0") |
| 96 | + # Print results |
| 97 | + print("[INFERENCE RESULTS]") |
| 98 | + print("Output data:", output_data) |
| 99 | + |
| 100 | +server.stop() |
| 101 | +``` |
| 102 | +With this workflow, you can avoid having to stop each service after client requests have terminated. |
| 103 | + |
| 104 | + |
| 105 | +## Known Issues |
| 106 | +- The following features are not currently supported when launching the Triton frontend services through the python bindings: |
| 107 | + - [Tracing](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/trace.md) |
| 108 | + - [Shared Memory](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_shared_memory.md) |
| 109 | + - [Metrics](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/metrics.md) |
| 110 | + - [Restricted Protocols](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/inference_protocols.md#limit-endpoint-access-beta) |
| 111 | + - VertexAI |
| 112 | + - Sagemaker |
| 113 | +- After a running server has been stopped, if the client sends an inference request, a Segmentation Fault will occur. |
0 commit comments