Skip to content

Commit 6d178ce

Browse files
authored
TensorRT 10.12 OSS Release (#4491)
Signed-off-by: Akhil Goel <akhilg@nvidia.com>
1 parent 9255eb3 commit 6d178ce

File tree

160 files changed

+5780
-1417
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

160 files changed

+5780
-1417
lines changed

.github/workflows/feedback-update.yml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
name: Remove feedback label on comment
2+
3+
on:
4+
issue_comment:
5+
types: [created]
6+
7+
jobs:
8+
remove_label:
9+
runs-on: ubuntu-latest
10+
if: github.event.issue.user.id == github.event.comment.user.id
11+
steps:
12+
- uses: actions/checkout@v2
13+
- uses: actions-ecosystem/action-remove-labels@v1
14+
with:
15+
github_token: ${{ secrets.GITHUB_TOKEN }}
16+
labels: "waiting for feedback"

.github/workflows/stale.yml

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
name: Label and close inactive issues
2+
on:
3+
workflow_dispatch:
4+
schedule:
5+
- cron: "0 * * * *"
6+
7+
jobs:
8+
stale:
9+
runs-on: ubuntu-latest
10+
permissions:
11+
issues: write
12+
pull-requests: write
13+
14+
steps:
15+
- uses: actions/stale@v9
16+
with:
17+
repo-token: ${{ secrets.GITHUB_TOKEN }}
18+
stale-issue-message: 'Issue has not received an update in over 14 days. Adding stale label. Please note the issue will be closed in 14 days after being marked stale if there is no update.'
19+
stale-pr-message: 'PR has not received an update in over 14 days. Adding stale label. Please note the PR will be closed in 14 days after being marked stale if there is no update.'
20+
close-issue-message: 'This issue was closed because it has been 14 days without activity since it has been marked as stale.'
21+
close-pr-message: 'This PR was closed because it has been 14 days without activity since it has been marked as stale.'
22+
days-before-issue-stale: 14
23+
days-before-close: 14
24+
only-labels: 'waiting for feedback'
25+
labels-to-add-when-unstale: 'investigating'
26+
labels-to-remove-when-unstale: 'stale,waiting for feedback'
27+
stale-issue-label: 'stale'
28+
stale-pr-label: 'stale'

CHANGELOG.md

Lines changed: 40 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,55 @@
11
# TensorRT OSS Release Changelog
22

3-
## 10.11.0 GA - 2025-5-21
3+
## 10.12.0 GA - 2025-6-10
4+
- Plugin changes
5+
- Migrated `IPluginV2`-descendent version 1 of `cropAndResizeDynamic`, to version 2, which implements `IPluginV3`.
6+
- Note: The newer versions preserve the attributes and I/O of the corresponding older plugin version. The older plugin versions are deprecated and will be removed in a future release
7+
- Deprecated the listed versions of the following plugins:
8+
- `DecodeBbox3DPlugin` (version 1)
9+
- `DetectionLayer_TRT` (version 1)
10+
- `EfficientNMS_TRT` (version 1)
11+
- `FlattenConcat_TRT` (version 1)
12+
- `GenerateDetection_TRT` (version 1)
13+
- `GridAnchor_TRT` (version 1)
14+
- `GroupNormalizationPlugin` (version 1)
15+
- `InstanceNormalization_TRT` (version 2)
16+
- `ModulatedDeformConv2d` (version 1)
17+
- `MultilevelCropAndResize_TRT` (version 1)
18+
- `MultilevelProposeROI_TRT` (version 1)
19+
- `RPROI_TRT` (version 1)
20+
- `PillarScatterPlugin` (version 1)
21+
- `PriorBox_TRT` (version 1)
22+
- `ProposalLayer_TRT` (version 1)
23+
- `ProposalDynamic` (version 1)
24+
- `Region_TRT` (version 1)
25+
- `Reorg_TRT` (version 2)
26+
- `ResizeNearest_TRT` (version 1)
27+
- `ScatterND` (version 1)
28+
- `VoxelGeneratorPlugin` (version 1)
29+
- Demo changes
30+
- Added [Image-to-Image](demo/Diffusion#generate-an-image-with-stable-diffusion-v35-large-with-controlnet-guided-by-an-image-and-a-text-prompt) support for Stable Diffusion v3.5-large ControlNet models.
31+
- Enabled download of [pre-exported ONNX models](https://huggingface.co/stabilityai/stable-diffusion-3.5-large-tensorrt) for the Stable Diffusion v3.5-large pipeline.
32+
- Sample changes
33+
- Added two refactored python samples [1_run_onnx_with_tensorrt](samples/python/refactored/1_run_onnx_with_tensorrt) and [2_construct_network_with_layer_apis](samples/python/refactored/2_construct_network_with_layer_apis)
34+
- Parser changes
35+
- Added support for integer-typed base tensors for `Pow` operations
36+
- Added support for custom `MXFP8` quantization operations
37+
- Added support for ellipses, diagonal, and broadcasting in `Einsum` operations
38+
39+
40+
## 10.11.0 GA - 2025-5-16
441

542
Key Features and Updates:
643

744
- Plugin changes
8-
- Migrated `IPluginV2`-descendent version 1 of `modulatedDeformConvPlugin`, to version 2, which implements `IPluginV3`.
45+
- Migrated `IPluginV2`-descendent version 1 of `cropAndResizePluginDynamic`, to version 2, which implements `IPluginV3`.
946
- Migrated `IPluginV2`-descendent version 1 of `DisentangledAttention_TRT`, to version 2, which implements `IPluginV3`.
1047
- Migrated `IPluginV2`-descendent version 1 of `MultiscaleDeformableAttnPlugin_TRT`, to version 2, which implements `IPluginV3`.
1148
- Note: The newer versions preserve the attributes and I/O of the corresponding older plugin version. The older plugin versions are deprecated and will be removed in a future release.
1249
- Demo changes
1350
- demoDiffusion
1451
- Added support for Stable Diffusion 3.5-medium and 3.5-large pipelines in BF16 and FP16 precisions.
52+
- Added support for Stable Diffusion 3.5-large pipeline in FP8 precision.
1553
- Parser changes
1654
- Added `kENABLE_UINT8_AND_ASYMMETRIC_QUANTIZATION_DLA` parser flag to enable UINT8 asymmetric quantization on engines targeting DLA.
1755
- Removed restriction that inputs to `RandomNormalLike` and `RandomUniformLike` must be tensors.

CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -183,7 +183,7 @@ if (DEFINED GPU_ARCHS)
183183
message(STATUS "GPU_ARCHS defined as ${GPU_ARCHS}. Generating CUDA code for SM ${GPU_ARCHS}")
184184
separate_arguments(GPU_ARCHS)
185185
foreach(SM IN LISTS GPU_ARCHS)
186-
list(APPEND CMAKE_CUDA_ARCHITECTURES SM)
186+
list(APPEND CMAKE_CUDA_ARCHITECTURES "${SM}")
187187
endforeach()
188188
else()
189189
list(APPEND CMAKE_CUDA_ARCHITECTURES 72 75 80 86 87 89 90)

README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ To build the TensorRT-OSS components, you will first need the following software
3232

3333
**TensorRT GA build**
3434

35-
- TensorRT v10.11.0.33
35+
- TensorRT v10.12.0.36
3636
- Available from direct download links listed below
3737

3838
**System Packages**
@@ -86,24 +86,24 @@ To build the TensorRT-OSS components, you will first need the following software
8686

8787
Else download and extract the TensorRT GA build from [NVIDIA Developer Zone](https://developer.nvidia.com) with the direct links below:
8888

89-
- [TensorRT 10.11.0.33 for CUDA 11.8, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.11.0/tars/TensorRT-10.11.0.33.Linux.x86_64-gnu.cuda-11.8.tar.gz)
90-
- [TensorRT 10.11.0.33 for CUDA 12.9, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.11.0/tars/TensorRT-10.11.0.33.Linux.x86_64-gnu.cuda-12.9.tar.gz)
91-
- [TensorRT 10.11.0.33 for CUDA 11.8, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.11.0/zip/TensorRT-10.11.0.33.Windows.win10.cuda-11.8.zip)
92-
- [TensorRT 10.11.0.33 for CUDA 12.9, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.11.0/zip/TensorRT-10.11.0.33.Windows.win10.cuda-12.9.zip)
89+
- [TensorRT 10.12.0.36 for CUDA 11.8, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.12.0/tars/TensorRT-10.12.0.36.Linux.x86_64-gnu.cuda-11.8.tar.gz)
90+
- [TensorRT 10.12.0.36 for CUDA 12.9, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.12.0/tars/TensorRT-10.12.0.36.Linux.x86_64-gnu.cuda-12.9.tar.gz)
91+
- [TensorRT 10.12.0.36 for CUDA 11.8, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.12.0/zip/TensorRT-10.12.0.36.Windows.win10.cuda-11.8.zip)
92+
- [TensorRT 10.12.0.36 for CUDA 12.9, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.12.0/zip/TensorRT-10.12.0.36.Windows.win10.cuda-12.9.zip)
9393

9494
**Example: Ubuntu 20.04 on x86-64 with cuda-12.9**
9595

9696
```bash
9797
cd ~/Downloads
98-
tar -xvzf TensorRT-10.11.0.33.Linux.x86_64-gnu.cuda-12.9.tar.gz
99-
export TRT_LIBPATH=`pwd`/TensorRT-10.11.0.33
98+
tar -xvzf TensorRT-10.12.0.36.Linux.x86_64-gnu.cuda-12.9.tar.gz
99+
export TRT_LIBPATH=`pwd`/TensorRT-10.12.0.36
100100
```
101101

102102
**Example: Windows on x86-64 with cuda-12.9**
103103

104104
```powershell
105-
Expand-Archive -Path TensorRT-10.11.0.33.Windows.win10.cuda-12.9.zip
106-
$env:TRT_LIBPATH="$pwd\TensorRT-10.11.0.33\lib"
105+
Expand-Archive -Path TensorRT-10.12.0.36.Windows.win10.cuda-12.9.zip
106+
$env:TRT_LIBPATH="$pwd\TensorRT-10.12.0.36\lib"
107107
```
108108

109109
## Setting Up The Build Environment

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
10.11.0.33
1+
10.12.0.36

cmake/modules/ShouldCompileKernel.cmake

Lines changed: 28 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,15 +13,38 @@
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
1515

16+
# \brief Converts a SM string (i.e. 86+abc) into the numeric SM version (i.e. 86).
17+
# \returns the sm in the name specified by OUT_VAR.
18+
function(get_numeric_sm SM OUT_VAR)
19+
# Convert the SM string to a numeric value
20+
if(${SM} MATCHES "^([0-9]+).*$")
21+
set(${OUT_VAR} ${CMAKE_MATCH_1} PARENT_SCOPE)
22+
else()
23+
message(FATAL_ERROR "Invalid SM version: ${SM}")
24+
endif()
25+
endfunction()
26+
27+
# \brief Converts the CMAKE_CUDA_ARCHITECTURES list into a list of numeric SM values.
28+
# \returns the list in the name specified by OUT_VAR.
29+
function(get_all_numeric_sms OUT_VAR)
30+
set(ALL_NUMERIC_SMS "")
31+
foreach(SM IN LISTS CMAKE_CUDA_ARCHITECTURES)
32+
get_numeric_sm(${SM} "SM")
33+
list(APPEND ALL_NUMERIC_SMS ${SM})
34+
endforeach()
35+
set(${OUT_VAR} ${ALL_NUMERIC_SMS} PARENT_SCOPE)
36+
endfunction()
37+
1638
# Certain cubins are binary compatible between different SM versions, so they are reused.
1739
# This function checks if a SM-named file should be compiled based on current SM enablement.
1840
# Specifically, the SM80 files are compiled if either 80, 86, or 89 are enabled.
1941
function(should_compile_kernel SM OUT_VAR)
20-
# If the target SM is any of 80/86/89, we need to check if any of those are enabled in CMAKE_CUDA_ARCHITECTURES.
42+
get_all_numeric_sms(__TRT_NUMERIC_CUDA_ARCHS)
43+
# If the target SM is any of 80/86/89, we need to check if any of those are enabled in __TRT_NUMERIC_CUDA_ARCHS.
2144
if((${SM} EQUAL 80) OR (${SM} EQUAL 86) OR (${SM} EQUAL 89))
22-
list(FIND CMAKE_CUDA_ARCHITECTURES 80 SM80_INDEX)
23-
list(FIND CMAKE_CUDA_ARCHITECTURES 86 SM86_INDEX)
24-
list(FIND CMAKE_CUDA_ARCHITECTURES 89 SM89_INDEX)
45+
list(FIND __TRT_NUMERIC_CUDA_ARCHS 80 SM80_INDEX)
46+
list(FIND __TRT_NUMERIC_CUDA_ARCHS 86 SM86_INDEX)
47+
list(FIND __TRT_NUMERIC_CUDA_ARCHS 89 SM89_INDEX)
2548
if((NOT ${SM80_INDEX} EQUAL -1) OR
2649
(NOT ${SM86_INDEX} EQUAL -1) OR
2750
(NOT ${SM89_INDEX} EQUAL -1)
@@ -31,7 +54,7 @@ function(should_compile_kernel SM OUT_VAR)
3154
set(${OUT_VAR} FALSE PARENT_SCOPE)
3255
endif()
3356
else()
34-
list(FIND CMAKE_CUDA_ARCHITECTURES ${SM} SM_INDEX)
57+
list(FIND __TRT_NUMERIC_CUDA_ARCHS ${SM} SM_INDEX)
3558
if (NOT ${SM_INDEX} EQUAL -1)
3659
set(${OUT_VAR} TRUE PARENT_SCOPE)
3760
else()

demo/BERT/builder.py

Lines changed: 24 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@ def __init__(
7070
use_qat,
7171
use_sparsity,
7272
timing_cache,
73+
distributive_independence = False,
7374
use_deprecated_plugins=False,
7475
):
7576
with open(bert_config_path, "r") as f:
@@ -90,7 +91,7 @@ def __init__(
9091
self.use_sparsity = use_sparsity
9192
self.timing_cache = timing_cache
9293
self.use_deprecated_plugins = use_deprecated_plugins
93-
94+
self.distributive_independence = distributive_independence
9495

9596
def set_tensor_name(tensor, prefix, name):
9697
tensor.name = prefix + name
@@ -338,7 +339,18 @@ def emb_layernorm(builder, network, config, weights_dict, builder_config, sequen
338339
# Specify profiles for the batch sizes we're interested in.
339340
# Make sure the profile also works for all sizes not covered by the previous profile.
340341

341-
if len(sequence_lengths) > 1 or len(batch_sizes) > 1:
342+
# When distributive independence is enabled, only one profile can be used.
343+
if config.distributive_independence:
344+
max_batch_size = max(batch_sizes)
345+
max_sequence_length = max(sequence_lengths)
346+
profile = builder.create_optimization_profile()
347+
min_shape = (1, max_sequence_length)
348+
shape = (max_batch_size, max_sequence_length)
349+
profile.set_shape("input_ids", min=min_shape, opt=shape, max=shape)
350+
profile.set_shape("segment_ids", min=min_shape, opt=shape, max=shape)
351+
profile.set_shape("input_mask", min=min_shape, opt=shape, max=shape)
352+
builder_config.add_optimization_profile(profile)
353+
elif len(sequence_lengths) > 1 or len(batch_sizes) > 1:
342354
for batch_size in sorted(batch_sizes):
343355
if len(sequence_lengths) == 1:
344356
profile = builder.create_optimization_profile()
@@ -420,7 +432,8 @@ def build_engine(batch_sizes, workspace_size, sequence_lengths, config, weights_
420432

421433
if verbose:
422434
builder_config.profiling_verbosity = trt.ProfilingVerbosity.DETAILED
423-
435+
if config.distributive_independence:
436+
builder_config.set_flag(trt.BuilderFlag.DISTRIBUTIVE_INDEPENDENCE)
424437
if config.use_sparsity:
425438
TRT_LOGGER.log(TRT_LOGGER.INFO, "Setting sparsity flag on builder_config.")
426439
builder_config.set_flag(trt.BuilderFlag.SPARSE_WEIGHTS)
@@ -626,6 +639,13 @@ def main():
626639
help="Path to tensorrt build timeing cache file, only available for tensorrt 8.0 and later (default: None)",
627640
required=False,
628641
)
642+
parser.add_argument(
643+
"--distributive_independence",
644+
default=False,
645+
action="store_true",
646+
help="Enable TensorRT's distributive independence builder flag (default: false)",
647+
required=False,
648+
)
629649
parser.add_argument(
630650
"--verbose",
631651
action="store_true",
@@ -672,6 +692,7 @@ def main():
672692
args.int8 and args.onnx != None,
673693
args.sparse,
674694
args.timing_cache_file,
695+
args.distributive_independence,
675696
args.use_deprecated_plugins,
676697
)
677698

demo/BERT/builder_varseqlen.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -572,7 +572,7 @@ def main():
572572
parser.add_argument(
573573
"-w",
574574
"--workspace-size",
575-
help="Workspace size in MiB for building the BERT engine (default: unlimited)",
575+
help="Workspace size in MiB for building the BERT engine (default: unlimited)",
576576
type=int,
577577
)
578578
parser.add_argument(

demo/Diffusion/README.md

Lines changed: 25 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ onnx 1.15.0
4949
onnx-graphsurgeon 0.5.2
5050
onnxruntime 1.16.3
5151
polygraphy 0.49.9
52-
tensorrt 10.11.0.33
52+
tensorrt 10.12.0.36
5353
tokenizers 0.13.3
5454
torch 2.2.0
5555
transformers 4.42.2
@@ -154,6 +154,8 @@ python3 demo_controlnet.py "A beautiful bird with rainbow colors" --controlnet-t
154154

155155
> NOTE: Currently only `--controlnet-type canny` is supported. `--input-image` must be a pre-processed image corresponding to `--controlnet-type canny`. If unspecified, a sample image will be downloaded.
156156
157+
> NOTE: FP8 quantization (`--fp8`) is supported.
158+
157159
### Generate an image guided by a text prompt, and using specified LoRA model weight updates
158160

159161
```bash
@@ -208,10 +210,13 @@ Run the command below to generate an image using Stable Diffusion 3 and Stable D
208210
python3 demo_txt2img_sd3.py "A vibrant street wall covered in colorful graffiti, the centerpiece spells \"SD3 MEDIUM\", in a storm of colors" --version sd3 --hf-token=$HF_TOKEN
209211

210212
# Stable Diffusion 3.5-medium
211-
python3 demo_txt2img_sd35.py "a beautiful photograph of Mt. Fuji during cherry blossom" --version=3.5-medium --denoising-steps=30 --guidance-scale 3.5 --hf-token=$HF_TOKEN
213+
python3 demo_txt2img_sd35.py "a beautiful photograph of Mt. Fuji during cherry blossom" --version=3.5-medium --denoising-steps=30 --guidance-scale 3.5 --hf-token=$HF_TOKEN --bf16
212214

213215
# Stable Diffusion 3.5-large
214-
python3 demo_txt2img_sd35.py "a beautiful photograph of Mt. Fuji during cherry blossom" --version=3.5-large --denoising-steps=30 --guidance-scale 3.5 --hf-token=$HF_TOKEN
216+
python3 demo_txt2img_sd35.py "a beautiful photograph of Mt. Fuji during cherry blossom" --version=3.5-large --denoising-steps=30 --guidance-scale 3.5 --hf-token=$HF_TOKEN --bf16 --download-onnx-models
217+
218+
# Stable Diffusion 3.5-large FP8
219+
python3 demo_txt2img_sd35.py "a beautiful photograph of Mt. Fuji during cherry blossom" --version=3.5-large --denoising-steps=30 --guidance-scale 3.5 --hf-token=$HF_TOKEN --fp8 --download-onnx-models --onnx-dir onnx_35_fp8/ --engine-dir engine_35_fp8/
215220
```
216221

217222
You can also specify an input image conditioning as shown below
@@ -225,6 +230,19 @@ python3 demo_txt2img_sd3.py "dog wearing a sweater and a blue collar" --version
225230

226231
Note that a denosing-percentage is applied to the number of denoising-steps when an input image conditioning is provided. Its default value is set to 0.6. This parameter can be updated using `--denoising-percentage`
227232

233+
### Generate an image with Stable Diffusion v3.5-large with ControlNet guided by an image and a text prompt
234+
235+
```bash
236+
# Depth
237+
python3 demo_controlnet_sd35.py "a photo of a man" --controlnet-type depth --hf-token=$HF_TOKEN --denoising-steps 40 --guidance-scale 4.5 --bf16
238+
239+
# Canny
240+
python3 demo_controlnet_sd35.py "A Night time photo taken by Leica M11, portrait of a Japanese woman in a kimono, looking at the camera, Cherry blossoms" --controlnet-type canny --hf-token=$HF_TOKEN --denoising-steps 60 --guidance-scale 3.5 --bf16
241+
242+
# Blur
243+
python3 demo_controlnet_sd35.py "generated ai art, a tiny, lost rubber ducky in an action shot close-up, surfing the humongous waves, inside the tube, in the style of Kelly Slater" --controlnet-type blur --hf-token=$HF_TOKEN --denoising-steps 60 --guidance-scale 3.5 --bf16
244+
```
245+
228246
### Generate a video guided by an initial image using Stable Video Diffusion
229247

230248
Download the pre-exported ONNX model
@@ -442,3 +460,7 @@ Custom override paths to pre-built engine files can be provided using `--custom-
442460
- To accelerate engine building time use `--timing-cache <path to cache file>`. The cache file will be created if it does not already exist. Note that performance may degrade if cache files are used across multiple GPU targets. It is recommended to use timing caches only during development. To achieve the best perfromance in deployment, please build engines without timing cache.
443461
- Specify new directories for storing onnx and engine files when switching between versions, LoRAs, ControlNets, etc. This can be done using `--onnx-dir <new onnx dir>` and `--engine-dir <new engine dir>`.
444462
- Inference performance can be improved by enabling [CUDA graphs](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs) using `--use-cuda-graph`. Enabling CUDA graphs requires fixed input shapes, so this flag must be combined with `--build-static-batch` and cannot be combined with `--build-dynamic-shape`.
463+
464+
### Known Issues
465+
466+
The Stable Diffusion XL pipeline is currently not supported on RTX 5090 due to memory constraints. This issue will be resolved in an upcoming release.

0 commit comments

Comments
 (0)