Skip to content

Commit ee14362

Browse files
authored
docs: Fix Triton release_notes.md (#8031)
1 parent a61df46 commit ee14362

File tree

2 files changed

+8
-251
lines changed

2 files changed

+8
-251
lines changed

docs/introduction/compatibility.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@
3737

3838
| Triton release version | NGC Tag | Python version | Torch version | TensorRT version | TensorRT-LLM version | CUDA version | CUDA Driver version | Size |
3939
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
40+
| 25.02 | nvcr.io/nvidia/tritonserver:25.02-trtllm-python-py3 | Python 3.12.3 | 2.6.0a0%2Becf3bae40a.nv25.1 | 10.8.0.43 | 0.17.0.post1 | 12.8.0.038 | 570.86.10 | 28G |
4041
| 25.01 | nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3 | Python 3.12.3 | 2.6.0a0%2Becf3bae40a.nv25.1 | 10.8.0.43 | 0.17.0 | 12.8.0.038 | 570.86.10 | 30G |
4142
| 24.12 | nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3 | Python 3.12.3 | 2.6.0a0%2Bdf5bbc09d1.nv24.11 | 10.7.0 | 0.16.0 | 12.6.3 | 560.35.05 | 22G |
4243
| 24.11 | nvcr.io/nvidia/tritonserver:24.11-trtllm-python-py3 | Python 3.10.12 | 2.5.0a0%2Be000cf0ad9.nv24.10 | 10.6.0 | 0.15.0 | 12.6.3 | 555.42.06 | 24.8G |
@@ -52,6 +53,7 @@
5253

5354
| Triton release version | NGC Tag | Python version | vLLM version | CUDA version | CUDA Driver version | Size |
5455
| --- | --- | --- | --- | --- | --- | --- |
56+
| 25.02 | nvcr.io/nvidia/tritonserver:25.02-vllm-python-py3 | Python 3.12.3 | 0.7.0+5e800e3d.nv25.2.cu128 | 12.8.0.038 | 570.86.10 | 22G |
5557
| 25.01 | nvcr.io/nvidia/tritonserver:25.01-vllm-python-py3 | Python 3.12.3 | 0.6.3.post1 | 12.8.0.038 | 570.86.10 | 23G |
5658
| 24.12 | nvcr.io/nvidia/tritonserver:24.12-vllm-python-py3 | Python 3.12.3 | 0.5.5 | 12.6.3.004 | 560.35.05 | 20G |
5759
| 24.11 | nvcr.io/nvidia/tritonserver:24.11-vllm-python-py3 | Python 3.12.3 | 0.5.5 | 12.6.3.001 | 560.35.05 | 22.1G |
@@ -67,6 +69,7 @@
6769

6870
| Triton release version | ONNX Runtime |
6971
| --- | --- |
72+
| 25.02 | 1.20.1 |
7073
| 25.01 | 1.20.1 |
7174
| 24.12 | 1.20.1 |
7275
| 24.11 | 1.19.2 |

docs/introduction/release_notes.md

Lines changed: 5 additions & 251 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
<!--
2-
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
2+
# Copyright (c) 2024-2025, NVIDIA CORPORATION. All rights reserved.
33
#
44
# Redistribution and use in source and binary forms, with or without
55
# modification, are permitted provided that the following conditions
@@ -25,256 +25,10 @@
2525
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
2626
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
2727
-->
28-
# [Triton Inference Server Release 25.01](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-25-01.html#rel-25-01)
28+
# [Triton Inference Server Release 25.02](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-25-02.html#rel-25-02)
2929

30-
The Triton Inference Server container image, release 25.01, is available
30+
The Triton Inference Server container image, release 25.02, is available
3131
on [NGC](https://ngc.nvidia.com/catalog/containers/nvidia:tritonserver) and
3232
is open source
33-
on [GitHub](https://github.com/triton-inference-server/server).
34-
35-
## Contents of the Triton Inference Server container
36-
37-
The [Triton Inference
38-
Server](https://github.com/triton-inference-server/server) Docker image
39-
contains the inference server executable and related shared libraries
40-
in /opt/tritonserver.
41-
42-
For a complete list of what the container includes, refer to [Deep
43-
Learning Frameworks Support
44-
Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
45-
46-
The container also includes the following:
47-
48-
- [Ubuntu 24.04](https://releases.ubuntu.com/24.04/) including [Python
49-
3.12](https://www.python.org/downloads/release/python-3120/)
50-
51-
- [NVIDIA CUDA
52-
12.8.0.038](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html)
53-
54-
- [NVIDIA cuBLAS
55-
12.8.3.14](https://docs.nvidia.com/cuda/cublas/index.html)
56-
57-
- [cuDNN
58-
9.7.0.66](https://docs.nvidia.com/deeplearning/cudnn/release-notes/)
59-
60-
- [NVIDIA NCCL
61-
2.25.1](https://docs.nvidia.com/deeplearning/nccl/release-notes/) (optimized
62-
for [NVIDIA NVLink](http://www.nvidia.com/object/nvlink.html)®)
63-
64-
- [NVIDIA TensorRT™
65-
10.8.0.43](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/index.html)
66-
67-
- OpenUCX 1.15.0
68-
69-
- GDRCopy 2.4.1
70-
71-
- NVIDIA HPC-X 2.21
72-
73-
- OpenMPI 4.1.7
74-
75-
- [FIL](https://github.com/triton-inference-server/fil_backend)
76-
77-
- [NVIDIA DALI®
78-
1.45](https://docs.nvidia.com/deeplearning/dali/release-notes/index.html)
79-
80-
- [nvImageCodec
81-
0.2.0.7](https://docs.nvidia.com/cuda/nvimagecodec/release_notes_v0.2.0.html)
82-
83-
- ONNX Runtime 1.20.1
84-
85-
- Intel[ OpenVINO ](https://github.com/openvinotoolkit/openvino/tree/2022.1.0)2024.05.0
86-
87-
- DCGM 3.3.6
88-
89-
- [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/) version [release/0.17.0](https://github.com/NVIDIA/TensorRT-LLM/tree/v0.17.0)
90-
91-
- [vLLM](https://github.com/vllm-project/vllm) version 0.6.3.1
92-
93-
## Driver Requirements
94-
95-
Release 25.01 is based on [CUDA
96-
12.8.0.038](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html) which
97-
requires [NVIDIA
98-
Driver](http://www.nvidia.com/Download/index.aspx?lang=en-us) release
99-
560 or later. However, if you are running on a data center GPU (for
100-
example, T4 or any other data center GPU), you can use NVIDIA driver
101-
release 470.57 (or later R470), 525.85 (or later R525), 535.86 (or later
102-
R535), or 545.23 (or later R545).
103-
104-
The CUDA driver\'s compatibility package only supports particular
105-
drivers. Thus, users should upgrade from all R418, R440, R450, R460,
106-
R510, R520, R530, R545 and R555 drivers, which are not
107-
forward-compatible with CUDA 12.6. For a complete list of supported
108-
drivers, see the [CUDA Application
109-
Compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#use-the-right-compat-package) topic.
110-
For more information, see [CUDA Compatibility and
111-
Upgrades](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#cuda-compatibility-and-upgrades).
112-
113-
## GPU Requirements
114-
115-
Release 25.01 supports CUDA compute capability 7.5 and later. This
116-
corresponds to GPUs in the NVIDIA Turing™, NVIDIA Ampere architecture,
117-
NVIDIA Hopper™, NVIDIA Ada Lovelace, and NVIDIA Blackwell architecture
118-
families. For a list of GPUs to which this compute capability
119-
corresponds, see [CUDA GPUs](https://developer.nvidia.com/cuda-gpus).
120-
For additional support details, see [Deep Learning Frameworks Support
121-
Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html).
122-
123-
## Key Features and Enhancements
124-
125-
This Inference Server release includes the following key features and
126-
enhancements.
127-
128-
- Starting with the 25.01 release, Triton Inference Server supports
129-
Blackwell GPU architectures.
130-
131-
- Fixed a bug when passing the correlation ID of string type to
132-
python_backend. Added datatype checks to correlation ID values.
133-
134-
- vLLM backend can now take advantage of the [vLLM
135-
v0.6](https://blog.vllm.ai/2024/09/05/perf-update.html) performance
136-
improvement by communicating with the vLLM engine via ZMQ.
137-
138-
- GenAI-Perf now provides the exact input sequence length requested
139-
for synthetic text generation.
140-
141-
- GenAI-Perf supports the creation of a prefix pool to emulate system
142-
prompts via \--num-system-prompts and \--system-prompt-length.
143-
144-
- GenAI-Perf improved error visibility via returning more detailed
145-
errors when OpenAI frontends return an error or metric generation
146-
fails.
147-
148-
- GenAI-Perf reports time-to-second-token and request count in its
149-
metrics.
150-
151-
- GenAI-Perf allows the use of a custom tokenizer in its "compare"
152-
subcommand for comparing multiple profiles.
153-
154-
- GenAI-Perf natively supports \--request-count for sending a specific
155-
number of requests and \--header for sending a list of headers with
156-
every request.
157-
158-
- Model Analyzer functionality has been migrated to GenAI-Perf via the
159-
"analyze" subcommand, enabling the tool to sweep and find the
160-
optimal model configuration.
161-
162-
- A bytes appending bug was fixed in GenAI-Perf, resulting in more
163-
accurate output sequence lengths for Triton.
164-
165-
166-
## Known Issues
167-
168-
- A segmentation fault related to DCGM and NSCQ may be encountered
169-
during server shutdown on NVSwitch systems. A possible workaround
170-
for this issue is to disable the collection of GPU metrics
171-
\`tritonserver \--allow-gpu-metrics false \...\`
172-
173-
- vLLM backend currently does not take advantage of the [vLLM
174-
v0.6](https://blog.vllm.ai/2024/09/05/perf-update.html) performance
175-
improvement when metrics are enabled.
176-
177-
- Please note, that the vllm version provided in 25.01 container is
178-
0.6.3.post1. Due to some issues with vllm library versioning,
179-
\`vllm.\_\_version\_\_\` displays \`0.6.3\`.
180-
181-
- Incorrect results are known to occur when using TensorRT (TRT)
182-
Backend for inference using int8 data type for I/O on the Blackwell
183-
GPU architecture.
184-
185-
- When running Torch TRT models, the output may differ from running
186-
the same model on a previous release.
187-
188-
- When using TensorRT models, if auto-complete configuration is
189-
disabled and is_non_linear_format_io:true for reformat-free tensors
190-
is not provided in the model configuration, the model may not load
191-
successfully.
192-
193-
- When using Python models in[decoupled
194-
mode](https://github.com/triton-inference-server/python_backend/tree/main?tab=readme-ov-file#decoupled-mode),
195-
users need to ensure that the ResponseSender goes out of scope or is
196-
properly cleaned up before unloading the model to guarantee that the
197-
unloading process executes correctly.
198-
199-
- Restart support was temporarily removed for Python models.
200-
201-
- The Triton Inference Server with vLLM backend currently does not
202-
support running vLLM models with tensor parallelism sizes greater
203-
than 1 and default \"distributed_executor_backend\" setting when
204-
using explicit model control mode. In attempt to load a vllm model
205-
(tp \> 1) in explicit mode, users could potentially see failure at
206-
the \`initialize\` step: \`could not acquire lock for
207-
\<\_io.BufferedWriter name=\'\<stdout\>\'\> at interpreter shutdown,
208-
possibly due to daemon threads\`. For the default model control
209-
mode, after server shutdown, vllm related sub-processes are not
210-
killed. Related vllm
211-
issue: <https://github.com/vllm-project/vllm/issues/6766> . Please
212-
specify distributed_executor_backend:ray in the model.json when
213-
deploying vllm models with tensor parallelism \> 1.
214-
215-
- When loading models with file override, multiple model configuration
216-
files are not supported. Users must provide the model configuration
217-
by setting parameter config : \<JSON\> instead of custom
218-
configuration file in the following
219-
format: file:configs/\<model-config-name\>.pbtxt :
220-
\<base64-encoded-file-content\>.
221-
222-
- TensorRT-LLM [backend](https://github.com/triton-inference-server/tensorrtllm_backend) provides
223-
limited support of Triton extensions and features.
224-
225-
- The TensorRT-LLM backend may core dump on server shutdown. This
226-
impacts server teardown only and will not impact inferencing.
227-
228-
- The Java CAPI is known to have intermittent segfaults.
229-
230-
- Some systems which implement malloc() may not release memory back to
231-
the operating system right away causing a false memory leak. This
232-
can be mitigated by using a different malloc implementation.
233-
Tcmalloc and jemalloc are installed in the Triton container and can
234-
be [used by specifying the library in
235-
LD_PRELOAD](https://github.com/triton-inference-server/server/blob/r22.12/docs/user_guide/model_management.md).
236-
NVIDIA recommends experimenting with both tcmalloc and jemalloc to
237-
determine which one works better for your use case.
238-
239-
- Auto-complete may cause an increase in server start time. To avoid a
240-
start time increase, users can provide the full model configuration
241-
and launch the server with \--disable-auto-complete-config.
242-
243-
- Auto-complete does not support PyTorch models due to lack of
244-
metadata in the model. It can only verify that the number of inputs
245-
and the input names matches what is specified in the model
246-
configuration. There is no model metadata about the number of
247-
outputs and datatypes. Related PyTorch
248-
bug:<https://github.com/pytorch/pytorch/issues/38273>.
249-
250-
- Triton Client PIP wheels for ARM SBSA are not available from PyPI
251-
and pip will install an incorrect Jetson version of Triton Client
252-
library for Arm SBSA. The correct client wheel file can be pulled
253-
directly from the Arm SBSA SDK image and manually installed.
254-
255-
- Traced models in PyTorch seem to create overflows when int8 tensor
256-
values are transformed to int32 on the GPU. Refer
257-
to [pytorch/pytorch#66930](https://github.com/pytorch/pytorch/issues/66930) for
258-
more information.
259-
260-
- Triton cannot retrieve GPU metrics with [MIG-enabled GPU
261-
devices](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#supported-gpus).
262-
263-
- Triton metrics might not work if the host machine is running a
264-
separate DCGM agent on bare-metal or in a container.
265-
266-
- When cloud storage (AWS, GCS, AZURE) is used as a model repository
267-
and a model has multiple versions, Triton creates an extra local
268-
copy of the cloud model's folder in the temporary directory, which
269-
is deleted upon server's shutdown.
270-
271-
- Python backend support for Windows is limited and does not currently
272-
support the following features:
273-
274-
- GPU tensors
275-
276-
- CPU and GPU-related metrics
277-
278-
- Custom execution environments
279-
280-
- The model load/unload APIs
33+
on [GitHub](https://github.com/triton-inference-server/server). Release notes can
34+
be found on the [GitHub Release Page](https://github.com/triton-inference-server/server/releases)

0 commit comments

Comments
 (0)