|
1 | 1 | <!--
|
2 |
| -# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. |
| 2 | +# Copyright (c) 2024-2025, NVIDIA CORPORATION. All rights reserved. |
3 | 3 | #
|
4 | 4 | # Redistribution and use in source and binary forms, with or without
|
5 | 5 | # modification, are permitted provided that the following conditions
|
|
25 | 25 | # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
26 | 26 | # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
27 | 27 | -->
|
28 |
| -# [Triton Inference Server Release 25.01](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-25-01.html#rel-25-01) |
| 28 | +# [Triton Inference Server Release 25.02](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-25-02.html#rel-25-02) |
29 | 29 |
|
30 |
| -The Triton Inference Server container image, release 25.01, is available |
| 30 | +The Triton Inference Server container image, release 25.02, is available |
31 | 31 | on [NGC](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver) and
|
32 | 32 | is open source
|
33 |
| -on [GitHub](https://github.com/triton-inference-server/server). |
34 |
| - |
35 |
| -## Contents of the Triton Inference Server container |
36 |
| - |
37 |
| -The [Triton Inference |
38 |
| -Server](https://github.com/triton-inference-server/server) Docker image |
39 |
| -contains the inference server executable and related shared libraries |
40 |
| -in /opt/tritonserver. |
41 |
| - |
42 |
| -For a complete list of what the container includes, refer to [Deep |
43 |
| -Learning Frameworks Support |
44 |
| -Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html). |
45 |
| - |
46 |
| -The container also includes the following: |
47 |
| - |
48 |
| -- [Ubuntu 24.04](https://releases.ubuntu.com/24.04/) including [Python |
49 |
| - 3.12](https://www.python.org/downloads/release/python-3120/) |
50 |
| - |
51 |
| -- [NVIDIA CUDA |
52 |
| - 12.8.0.038](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html) |
53 |
| - |
54 |
| -- [NVIDIA cuBLAS |
55 |
| - 12.8.3.14](https://docs.nvidia.com/cuda/cublas/index.html) |
56 |
| - |
57 |
| -- [cuDNN |
58 |
| - 9.7.0.66](https://docs.nvidia.com/deeplearning/cudnn/release-notes/) |
59 |
| - |
60 |
| -- [NVIDIA NCCL |
61 |
| - 2.25.1](https://docs.nvidia.com/deeplearning/nccl/release-notes/) (optimized |
62 |
| - for [NVIDIA NVLink](http://www.nvidia.com/object/nvlink.html)®) |
63 |
| - |
64 |
| -- [NVIDIA TensorRT™ |
65 |
| - 10.8.0.43](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/index.html) |
66 |
| - |
67 |
| -- OpenUCX 1.15.0 |
68 |
| - |
69 |
| -- GDRCopy 2.4.1 |
70 |
| - |
71 |
| -- NVIDIA HPC-X 2.21 |
72 |
| - |
73 |
| -- OpenMPI 4.1.7 |
74 |
| - |
75 |
| -- [FIL](https://github.com/triton-inference-server/fil_backend) |
76 |
| - |
77 |
| -- [NVIDIA DALI® |
78 |
| - 1.45](https://docs.nvidia.com/deeplearning/dali/release-notes/index.html) |
79 |
| - |
80 |
| -- [nvImageCodec |
81 |
| - 0.2.0.7](https://docs.nvidia.com/cuda/nvimagecodec/release_notes_v0.2.0.html) |
82 |
| - |
83 |
| -- ONNX Runtime 1.20.1 |
84 |
| - |
85 |
| -- Intel[ OpenVINO ](https://github.com/openvinotoolkit/openvino/tree/2022.1.0)2024.05.0 |
86 |
| - |
87 |
| -- DCGM 3.3.6 |
88 |
| - |
89 |
| -- [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/) version [release/0.17.0](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.17.0) |
90 |
| - |
91 |
| -- [vLLM](https://github.com/vllm-project/vllm) version 0.6.3.1 |
92 |
| - |
93 |
| -## Driver Requirements |
94 |
| - |
95 |
| -Release 25.01 is based on [CUDA |
96 |
| -12.8.0.038](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html) which |
97 |
| -requires [NVIDIA |
98 |
| -Driver](http://www.nvidia.com/Download/index.aspx?lang=en-us) release |
99 |
| -560 or later. However, if you are running on a data center GPU (for |
100 |
| -example, T4 or any other data center GPU), you can use NVIDIA driver |
101 |
| -release 470.57 (or later R470), 525.85 (or later R525), 535.86 (or later |
102 |
| -R535), or 545.23 (or later R545). |
103 |
| - |
104 |
| -The CUDA driver\'s compatibility package only supports particular |
105 |
| -drivers. Thus, users should upgrade from all R418, R440, R450, R460, |
106 |
| -R510, R520, R530, R545 and R555 drivers, which are not |
107 |
| -forward-compatible with CUDA 12.6. For a complete list of supported |
108 |
| -drivers, see the [CUDA Application |
109 |
| -Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#use-the-right-compat-package) topic. |
110 |
| -For more information, see [CUDA Compatibility and |
111 |
| -Upgrades](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#cuda-compatibility-and-upgrades). |
112 |
| - |
113 |
| -## GPU Requirements |
114 |
| - |
115 |
| -Release 25.01 supports CUDA compute capability 7.5 and later. This |
116 |
| -corresponds to GPUs in the NVIDIA Turing™, NVIDIA Ampere architecture, |
117 |
| -NVIDIA Hopper™, NVIDIA Ada Lovelace, and NVIDIA Blackwell architecture |
118 |
| -families. For a list of GPUs to which this compute capability |
119 |
| -corresponds, see [CUDA GPUs](https://developer.nvidia.com/cuda-gpus). |
120 |
| -For additional support details, see [Deep Learning Frameworks Support |
121 |
| -Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html). |
122 |
| - |
123 |
| -## Key Features and Enhancements |
124 |
| - |
125 |
| -This Inference Server release includes the following key features and |
126 |
| -enhancements. |
127 |
| - |
128 |
| -- Starting with the 25.01 release, Triton Inference Server supports |
129 |
| - Blackwell GPU architectures. |
130 |
| - |
131 |
| -- Fixed a bug when passing the correlation ID of string type to |
132 |
| - python_backend. Added datatype checks to correlation ID values. |
133 |
| - |
134 |
| -- vLLM backend can now take advantage of the [vLLM |
135 |
| - v0.6](https://blog.vllm.ai/2024/09/05/perf-update.html) performance |
136 |
| - improvement by communicating with the vLLM engine via ZMQ. |
137 |
| - |
138 |
| -- GenAI-Perf now provides the exact input sequence length requested |
139 |
| - for synthetic text generation. |
140 |
| - |
141 |
| -- GenAI-Perf supports the creation of a prefix pool to emulate system |
142 |
| - prompts via \--num-system-prompts and \--system-prompt-length. |
143 |
| - |
144 |
| -- GenAI-Perf improved error visibility via returning more detailed |
145 |
| - errors when OpenAI frontends return an error or metric generation |
146 |
| - fails. |
147 |
| - |
148 |
| -- GenAI-Perf reports time-to-second-token and request count in its |
149 |
| - metrics. |
150 |
| - |
151 |
| -- GenAI-Perf allows the use of a custom tokenizer in its "compare" |
152 |
| - subcommand for comparing multiple profiles. |
153 |
| - |
154 |
| -- GenAI-Perf natively supports \--request-count for sending a specific |
155 |
| - number of requests and \--header for sending a list of headers with |
156 |
| - every request. |
157 |
| - |
158 |
| -- Model Analyzer functionality has been migrated to GenAI-Perf via the |
159 |
| - "analyze" subcommand, enabling the tool to sweep and find the |
160 |
| - optimal model configuration. |
161 |
| - |
162 |
| -- A bytes appending bug was fixed in GenAI-Perf, resulting in more |
163 |
| - accurate output sequence lengths for Triton. |
164 |
| - |
165 |
| - |
166 |
| -## Known Issues |
167 |
| - |
168 |
| -- A segmentation fault related to DCGM and NSCQ may be encountered |
169 |
| - during server shutdown on NVSwitch systems. A possible workaround |
170 |
| - for this issue is to disable the collection of GPU metrics |
171 |
| - \`tritonserver \--allow-gpu-metrics false \...\` |
172 |
| - |
173 |
| -- vLLM backend currently does not take advantage of the [vLLM |
174 |
| - v0.6](https://blog.vllm.ai/2024/09/05/perf-update.html) performance |
175 |
| - improvement when metrics are enabled. |
176 |
| - |
177 |
| -- Please note, that the vllm version provided in 25.01 container is |
178 |
| - 0.6.3.post1. Due to some issues with vllm library versioning, |
179 |
| - \`vllm.\_\_version\_\_\` displays \`0.6.3\`. |
180 |
| - |
181 |
| -- Incorrect results are known to occur when using TensorRT (TRT) |
182 |
| - Backend for inference using int8 data type for I/O on the Blackwell |
183 |
| - GPU architecture. |
184 |
| - |
185 |
| -- When running Torch TRT models, the output may differ from running |
186 |
| - the same model on a previous release. |
187 |
| - |
188 |
| -- When using TensorRT models, if auto-complete configuration is |
189 |
| - disabled and is_non_linear_format_io:true for reformat-free tensors |
190 |
| - is not provided in the model configuration, the model may not load |
191 |
| - successfully. |
192 |
| - |
193 |
| -- When using Python models in[decoupled |
194 |
| - mode](https://github.com/triton-inference-server/python_backend/tree/main?tab=readme-ov-file#decoupled-mode), |
195 |
| - users need to ensure that the ResponseSender goes out of scope or is |
196 |
| - properly cleaned up before unloading the model to guarantee that the |
197 |
| - unloading process executes correctly. |
198 |
| - |
199 |
| -- Restart support was temporarily removed for Python models. |
200 |
| - |
201 |
| -- The Triton Inference Server with vLLM backend currently does not |
202 |
| - support running vLLM models with tensor parallelism sizes greater |
203 |
| - than 1 and default \"distributed_executor_backend\" setting when |
204 |
| - using explicit model control mode. In attempt to load a vllm model |
205 |
| - (tp \> 1) in explicit mode, users could potentially see failure at |
206 |
| - the \`initialize\` step: \`could not acquire lock for |
207 |
| - \<\_io.BufferedWriter name=\'\<stdout\>\'\> at interpreter shutdown, |
208 |
| - possibly due to daemon threads\`. For the default model control |
209 |
| - mode, after server shutdown, vllm related sub-processes are not |
210 |
| - killed. Related vllm |
211 |
| - issue: <https://github.com/vllm-project/vllm/issues/6766> . Please |
212 |
| - specify distributed_executor_backend:ray in the model.json when |
213 |
| - deploying vllm models with tensor parallelism \> 1. |
214 |
| - |
215 |
| -- When loading models with file override, multiple model configuration |
216 |
| - files are not supported. Users must provide the model configuration |
217 |
| - by setting parameter config : \<JSON\> instead of custom |
218 |
| - configuration file in the following |
219 |
| - format: file:configs/\<model-config-name\>.pbtxt : |
220 |
| - \<base64-encoded-file-content\>. |
221 |
| - |
222 |
| -- TensorRT-LLM [backend](https://github.com/triton-inference-server/tensorrtllm_backend) provides |
223 |
| - limited support of Triton extensions and features. |
224 |
| - |
225 |
| -- The TensorRT-LLM backend may core dump on server shutdown. This |
226 |
| - impacts server teardown only and will not impact inferencing. |
227 |
| - |
228 |
| -- The Java CAPI is known to have intermittent segfaults. |
229 |
| - |
230 |
| -- Some systems which implement malloc() may not release memory back to |
231 |
| - the operating system right away causing a false memory leak. This |
232 |
| - can be mitigated by using a different malloc implementation. |
233 |
| - Tcmalloc and jemalloc are installed in the Triton container and can |
234 |
| - be [used by specifying the library in |
235 |
| - LD_PRELOAD](https://github.com/triton-inference-server/server/blob/r22.12/docs/user_guide/model_management.md). |
236 |
| - NVIDIA recommends experimenting with both tcmalloc and jemalloc to |
237 |
| - determine which one works better for your use case. |
238 |
| - |
239 |
| -- Auto-complete may cause an increase in server start time. To avoid a |
240 |
| - start time increase, users can provide the full model configuration |
241 |
| - and launch the server with \--disable-auto-complete-config. |
242 |
| - |
243 |
| -- Auto-complete does not support PyTorch models due to lack of |
244 |
| - metadata in the model. It can only verify that the number of inputs |
245 |
| - and the input names matches what is specified in the model |
246 |
| - configuration. There is no model metadata about the number of |
247 |
| - outputs and datatypes. Related PyTorch |
248 |
| - bug:<https://github.com/pytorch/pytorch/issues/38273>. |
249 |
| - |
250 |
| -- Triton Client PIP wheels for ARM SBSA are not available from PyPI |
251 |
| - and pip will install an incorrect Jetson version of Triton Client |
252 |
| - library for Arm SBSA. The correct client wheel file can be pulled |
253 |
| - directly from the Arm SBSA SDK image and manually installed. |
254 |
| - |
255 |
| -- Traced models in PyTorch seem to create overflows when int8 tensor |
256 |
| - values are transformed to int32 on the GPU. Refer |
257 |
| - to [pytorch/pytorch#66930](https://github.com/pytorch/pytorch/issues/66930) for |
258 |
| - more information. |
259 |
| - |
260 |
| -- Triton cannot retrieve GPU metrics with [MIG-enabled GPU |
261 |
| - devices](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus). |
262 |
| - |
263 |
| -- Triton metrics might not work if the host machine is running a |
264 |
| - separate DCGM agent on bare-metal or in a container. |
265 |
| - |
266 |
| -- When cloud storage (AWS, GCS, AZURE) is used as a model repository |
267 |
| - and a model has multiple versions, Triton creates an extra local |
268 |
| - copy of the cloud model's folder in the temporary directory, which |
269 |
| - is deleted upon server's shutdown. |
270 |
| - |
271 |
| -- Python backend support for Windows is limited and does not currently |
272 |
| - support the following features: |
273 |
| - |
274 |
| - - GPU tensors |
275 |
| - |
276 |
| - - CPU and GPU-related metrics |
277 |
| - |
278 |
| - - Custom execution environments |
279 |
| - |
280 |
| - - The model load/unload APIs |
| 33 | +on [GitHub](https://github.com/triton-inference-server/server). Release notes can |
| 34 | +be found on the [GitHub Release Page](https://github.com/triton-inference-server/server/releases) |
0 commit comments