NVIDIA
diff --git a/‎.github/workflows/feedback-update.yml
Lines changed: 16 additions & 0 deletions b/‎.github/workflows/feedback-update.yml
Lines changed: 16 additions & 0 deletions
diff --git a/‎.github/workflows/stale.yml
Lines changed: 28 additions & 0 deletions b/‎.github/workflows/stale.yml
Lines changed: 28 additions & 0 deletions
diff --git a/‎CHANGELOG.md
Lines changed: 40 additions & 2 deletions b/‎CHANGELOG.md
Lines changed: 40 additions & 2 deletions
diff --git a/‎CMakeLists.txt
Lines changed: 1 addition & 1 deletion b/‎CMakeLists.txt
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md
Lines changed: 9 additions & 9 deletions b/‎README.md
Lines changed: 9 additions & 9 deletions
diff --git a/‎VERSION
Lines changed: 1 addition & 1 deletion b/‎VERSION
Lines changed: 1 addition & 1 deletion
diff --git a/‎cmake/modules/ShouldCompileKernel.cmake
Lines changed: 28 additions & 5 deletions b/‎cmake/modules/ShouldCompileKernel.cmake
Lines changed: 28 additions & 5 deletions
diff --git a/‎demo/BERT/builder.py
Lines changed: 24 additions & 3 deletions b/‎demo/BERT/builder.py
Lines changed: 24 additions & 3 deletions
diff --git a/‎demo/BERT/builder_varseqlen.py
Lines changed: 1 addition & 1 deletion b/‎demo/BERT/builder_varseqlen.py
Lines changed: 1 addition & 1 deletion
diff --git a/‎demo/Diffusion/README.md
Lines changed: 25 additions & 3 deletions b/‎demo/Diffusion/README.md
Lines changed: 25 additions & 3 deletions
@@ -0,0 +1,16 @@
+name: Remove feedback label on comment
+
+on:
+  issue_comment:
+    types: [created]
+
+jobs:
+  remove_label:
+    runs-on: ubuntu-latest
+    if: github.event.issue.user.id == github.event.comment.user.id
+    steps:
+      - uses: actions/checkout@v2
+      - uses: actions-ecosystem/action-remove-labels@v1
+        with:
+          github_token: ${{ secrets.GITHUB_TOKEN }}
+          labels: "waiting for feedback"
@@ -0,0 +1,28 @@
+name: Label and close inactive issues
+on:
+  workflow_dispatch:
+  schedule:
+    - cron: "0 * * * *"
+
+jobs:
+  stale:
+    runs-on: ubuntu-latest
+    permissions:
+      issues: write
+      pull-requests: write
+
+    steps:
+      - uses: actions/stale@v9
+        with:
+          repo-token: ${{ secrets.GITHUB_TOKEN }}
+          stale-issue-message: 'Issue has not received an update in over 14 days. Adding stale label.  Please note the issue will be closed in 14 days after being marked stale if there is no update.'
+          stale-pr-message: 'PR has not received an update in over 14 days. Adding stale label.  Please note the PR will be closed in 14 days after being marked stale if there is no update.'
+          close-issue-message: 'This issue was closed because it has been 14 days without activity since it has been marked as stale.'
+          close-pr-message: 'This PR was closed because it has been 14 days without activity since it has been marked as stale.'
+          days-before-issue-stale: 14
+          days-before-close: 14
+          only-labels: 'waiting for feedback'
+          labels-to-add-when-unstale: 'investigating'
+          labels-to-remove-when-unstale: 'stale,waiting for feedback'
+          stale-issue-label: 'stale'
+          stale-pr-label: 'stale'
@@ -1,17 +1,55 @@
 # TensorRT OSS Release Changelog
 
-## 10.11.0 GA - 2025-5-21
+## 10.12.0 GA - 2025-6-10
+- Plugin changes
+  - Migrated `IPluginV2`-descendent version 1 of `cropAndResizeDynamic`, to version 2, which implements `IPluginV3`.
+  - Note: The newer versions preserve the attributes and I/O of the corresponding older plugin version. The older plugin versions are deprecated and will be removed in a future release
+  - Deprecated the listed versions of the following plugins:
+    - `DecodeBbox3DPlugin` (version 1)
+    - `DetectionLayer_TRT` (version 1)
+    - `EfficientNMS_TRT` (version 1)
+    - `FlattenConcat_TRT` (version 1)
+    - `GenerateDetection_TRT` (version 1)
+    - `GridAnchor_TRT` (version 1)
+    - `GroupNormalizationPlugin` (version 1)
+    - `InstanceNormalization_TRT` (version 2)
+    - `ModulatedDeformConv2d` (version 1)
+    - `MultilevelCropAndResize_TRT` (version 1)
+    - `MultilevelProposeROI_TRT` (version 1)
+    - `RPROI_TRT` (version 1)
+    - `PillarScatterPlugin` (version 1)
+    - `PriorBox_TRT` (version 1)
+    - `ProposalLayer_TRT` (version 1)
+    - `ProposalDynamic` (version 1)
+    - `Region_TRT` (version 1)
+    - `Reorg_TRT` (version 2)
+    - `ResizeNearest_TRT` (version 1)
+    - `ScatterND` (version 1)
+    - `VoxelGeneratorPlugin` (version 1)
+- Demo changes
+  - Added [Image-to-Image](demo/Diffusion#generate-an-image-with-stable-diffusion-v35-large-with-controlnet-guided-by-an-image-and-a-text-prompt) support for Stable Diffusion v3.5-large ControlNet models.
+  - Enabled download of [pre-exported ONNX models](https://huggingface.co/stabilityai/stable-diffusion-3.5-large-tensorrt) for the Stable Diffusion v3.5-large pipeline.
+- Sample changes
+  - Added two refactored python samples [1_run_onnx_with_tensorrt](samples/python/refactored/1_run_onnx_with_tensorrt) and [2_construct_network_with_layer_apis](samples/python/refactored/2_construct_network_with_layer_apis) 
+- Parser changes
+  - Added support for integer-typed base tensors for `Pow` operations
+  - Added support for custom `MXFP8` quantization operations
+  - Added support for ellipses, diagonal, and broadcasting in `Einsum` operations
+
+
+## 10.11.0 GA - 2025-5-16
 
 Key Features and Updates:
 
 - Plugin changes
-  - Migrated `IPluginV2`-descendent version 1 of `modulatedDeformConvPlugin`, to version 2, which implements `IPluginV3`.
+  - Migrated `IPluginV2`-descendent version 1 of `cropAndResizePluginDynamic`, to version 2, which implements `IPluginV3`.
   - Migrated `IPluginV2`-descendent version 1 of `DisentangledAttention_TRT`, to version 2, which implements `IPluginV3`.
   - Migrated `IPluginV2`-descendent version 1 of `MultiscaleDeformableAttnPlugin_TRT`, to version 2, which implements `IPluginV3`.
   - Note: The newer versions preserve the attributes and I/O of the corresponding older plugin version. The older plugin versions are deprecated and will be removed in a future release.
 - Demo changes
   - demoDiffusion
     - Added support for Stable Diffusion 3.5-medium and 3.5-large pipelines in BF16 and FP16 precisions.
+    - Added support for Stable Diffusion 3.5-large pipeline in FP8 precision.
 - Parser changes
   - Added `kENABLE_UINT8_AND_ASYMMETRIC_QUANTIZATION_DLA` parser flag to enable UINT8 asymmetric quantization on engines targeting DLA.
   - Removed restriction that inputs to `RandomNormalLike` and `RandomUniformLike` must be tensors.
 
@@ -183,7 +183,7 @@ if (DEFINED GPU_ARCHS)
   message(STATUS "GPU_ARCHS defined as ${GPU_ARCHS}. Generating CUDA code for SM ${GPU_ARCHS}")
   separate_arguments(GPU_ARCHS)
   foreach(SM IN LISTS GPU_ARCHS)
-    list(APPEND CMAKE_CUDA_ARCHITECTURES SM)
+    list(APPEND CMAKE_CUDA_ARCHITECTURES "${SM}")
   endforeach()
 else()
   list(APPEND CMAKE_CUDA_ARCHITECTURES 72 75 80 86 87 89 90)
 
@@ -32,7 +32,7 @@ To build the TensorRT-OSS components, you will first need the following software
 
 **TensorRT GA build**
 
-- TensorRT v10.11.0.33
+- TensorRT v10.12.0.36
   - Available from direct download links listed below
 
 **System Packages**
@@ -86,24 +86,24 @@ To build the TensorRT-OSS components, you will first need the following software
 
    Else download and extract the TensorRT GA build from [NVIDIA Developer Zone](https://developer.nvidia.com) with the direct links below:
 
-   - [TensorRT 10.11.0.33 for CUDA 11.8, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.11.0/tars/TensorRT-10.11.0.33.Linux.x86_64-gnu.cuda-11.8.tar.gz)
-   - [TensorRT 10.11.0.33 for CUDA 12.9, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.11.0/tars/TensorRT-10.11.0.33.Linux.x86_64-gnu.cuda-12.9.tar.gz)
-   - [TensorRT 10.11.0.33 for CUDA 11.8, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.11.0/zip/TensorRT-10.11.0.33.Windows.win10.cuda-11.8.zip)
-   - [TensorRT 10.11.0.33 for CUDA 12.9, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.11.0/zip/TensorRT-10.11.0.33.Windows.win10.cuda-12.9.zip)
+   - [TensorRT 10.12.0.36 for CUDA 11.8, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.12.0/tars/TensorRT-10.12.0.36.Linux.x86_64-gnu.cuda-11.8.tar.gz)
+   - [TensorRT 10.12.0.36 for CUDA 12.9, Linux x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.12.0/tars/TensorRT-10.12.0.36.Linux.x86_64-gnu.cuda-12.9.tar.gz)
+   - [TensorRT 10.12.0.36 for CUDA 11.8, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.12.0/zip/TensorRT-10.12.0.36.Windows.win10.cuda-11.8.zip)
+   - [TensorRT 10.12.0.36 for CUDA 12.9, Windows x86_64](https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.12.0/zip/TensorRT-10.12.0.36.Windows.win10.cuda-12.9.zip)
 
    **Example: Ubuntu 20.04 on x86-64 with cuda-12.9**
 
    ```bash
    cd ~/Downloads
-   tar -xvzf TensorRT-10.11.0.33.Linux.x86_64-gnu.cuda-12.9.tar.gz
-   export TRT_LIBPATH=`pwd`/TensorRT-10.11.0.33
+   tar -xvzf TensorRT-10.12.0.36.Linux.x86_64-gnu.cuda-12.9.tar.gz
+   export TRT_LIBPATH=`pwd`/TensorRT-10.12.0.36
    ```
 
    **Example: Windows on x86-64 with cuda-12.9**
 
    ```powershell
-   Expand-Archive -Path TensorRT-10.11.0.33.Windows.win10.cuda-12.9.zip
-   $env:TRT_LIBPATH="$pwd\TensorRT-10.11.0.33\lib"
+   Expand-Archive -Path TensorRT-10.12.0.36.Windows.win10.cuda-12.9.zip
+   $env:TRT_LIBPATH="$pwd\TensorRT-10.12.0.36\lib"
    ```
 
 ## Setting Up The Build Environment
 
@@ -1 +1 @@
-10.11.0.33
+10.12.0.36
@@ -13,15 +13,38 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+# \brief Converts a SM string (i.e. 86+abc) into the numeric SM version (i.e. 86).
+# \returns the sm in the name specified by OUT_VAR.
+function(get_numeric_sm SM OUT_VAR)
+    # Convert the SM string to a numeric value
+    if(${SM} MATCHES "^([0-9]+).*$")
+        set(${OUT_VAR} ${CMAKE_MATCH_1} PARENT_SCOPE)
+    else()
+        message(FATAL_ERROR "Invalid SM version: ${SM}")
+    endif()
+endfunction()
+
+# \brief Converts the CMAKE_CUDA_ARCHITECTURES list into a list of numeric SM values.
+# \returns the list in the name specified by OUT_VAR.
+function(get_all_numeric_sms OUT_VAR)
+    set(ALL_NUMERIC_SMS "")
+    foreach(SM IN LISTS CMAKE_CUDA_ARCHITECTURES)
+        get_numeric_sm(${SM} "SM")
+        list(APPEND ALL_NUMERIC_SMS ${SM})
+    endforeach()
+    set(${OUT_VAR} ${ALL_NUMERIC_SMS} PARENT_SCOPE)
+endfunction()
+
 # Certain cubins are binary compatible between different SM versions, so they are reused.
 # This function checks if a SM-named file should be compiled based on current SM enablement.
 # Specifically, the SM80 files are compiled if either 80, 86, or 89 are enabled.
 function(should_compile_kernel SM OUT_VAR)
-    # If the target SM is any of 80/86/89, we need to check if any of those are enabled in CMAKE_CUDA_ARCHITECTURES.
+    get_all_numeric_sms(__TRT_NUMERIC_CUDA_ARCHS)
+    # If the target SM is any of 80/86/89, we need to check if any of those are enabled in __TRT_NUMERIC_CUDA_ARCHS.
     if((${SM} EQUAL 80) OR (${SM} EQUAL 86) OR (${SM} EQUAL 89))
-        list(FIND CMAKE_CUDA_ARCHITECTURES 80 SM80_INDEX)
-        list(FIND CMAKE_CUDA_ARCHITECTURES 86 SM86_INDEX)
-        list(FIND CMAKE_CUDA_ARCHITECTURES 89 SM89_INDEX)
+        list(FIND __TRT_NUMERIC_CUDA_ARCHS 80 SM80_INDEX)
+        list(FIND __TRT_NUMERIC_CUDA_ARCHS 86 SM86_INDEX)
+        list(FIND __TRT_NUMERIC_CUDA_ARCHS 89 SM89_INDEX)
         if((NOT ${SM80_INDEX} EQUAL -1) OR
            (NOT ${SM86_INDEX} EQUAL -1) OR
            (NOT ${SM89_INDEX} EQUAL -1)
@@ -31,7 +54,7 @@ function(should_compile_kernel SM OUT_VAR)
             set(${OUT_VAR} FALSE PARENT_SCOPE)
         endif()
     else()
-        list(FIND CMAKE_CUDA_ARCHITECTURES ${SM} SM_INDEX)
+        list(FIND __TRT_NUMERIC_CUDA_ARCHS ${SM} SM_INDEX)
         if (NOT ${SM_INDEX} EQUAL -1)
             set(${OUT_VAR} TRUE PARENT_SCOPE)
         else()
 
@@ -70,6 +70,7 @@ def __init__(
         use_qat,
         use_sparsity,
         timing_cache,
+        distributive_independence = False,
         use_deprecated_plugins=False,
     ):
         with open(bert_config_path, "r") as f:
@@ -90,7 +91,7 @@ def __init__(
             self.use_sparsity = use_sparsity
             self.timing_cache = timing_cache
             self.use_deprecated_plugins = use_deprecated_plugins
-
+            self.distributive_independence = distributive_independence
 
 def set_tensor_name(tensor, prefix, name):
     tensor.name = prefix + name
@@ -338,7 +339,18 @@ def emb_layernorm(builder, network, config, weights_dict, builder_config, sequen
     # Specify profiles for the batch sizes we're interested in.
     # Make sure the profile also works for all sizes not covered by the previous profile.
 
-    if len(sequence_lengths) > 1 or len(batch_sizes) > 1:
+    # When distributive independence is enabled, only one profile can be used.
+    if config.distributive_independence:
+        max_batch_size = max(batch_sizes)
+        max_sequence_length = max(sequence_lengths)
+        profile = builder.create_optimization_profile()
+        min_shape = (1, max_sequence_length)
+        shape = (max_batch_size, max_sequence_length)
+        profile.set_shape("input_ids", min=min_shape, opt=shape, max=shape)
+        profile.set_shape("segment_ids", min=min_shape, opt=shape, max=shape)
+        profile.set_shape("input_mask", min=min_shape, opt=shape, max=shape)
+        builder_config.add_optimization_profile(profile)
+    elif len(sequence_lengths) > 1 or len(batch_sizes) > 1:
         for batch_size in sorted(batch_sizes):
             if len(sequence_lengths) == 1:
                 profile = builder.create_optimization_profile()
@@ -420,7 +432,8 @@ def build_engine(batch_sizes, workspace_size, sequence_lengths, config, weights_
 
         if verbose:
             builder_config.profiling_verbosity = trt.ProfilingVerbosity.DETAILED
-
+        if config.distributive_independence:
+            builder_config.set_flag(trt.BuilderFlag.DISTRIBUTIVE_INDEPENDENCE)
         if config.use_sparsity:
             TRT_LOGGER.log(TRT_LOGGER.INFO, "Setting sparsity flag on builder_config.")
             builder_config.set_flag(trt.BuilderFlag.SPARSE_WEIGHTS)
@@ -626,6 +639,13 @@ def main():
         help="Path to tensorrt build timeing cache file, only available for tensorrt 8.0 and later (default: None)",
         required=False,
     )
+    parser.add_argument(
+        "--distributive_independence",
+        default=False,
+        action="store_true",
+        help="Enable TensorRT's distributive independence builder flag (default: false)",
+        required=False,
+    )
     parser.add_argument(
         "--verbose",
         action="store_true",
@@ -672,6 +692,7 @@ def main():
         args.int8 and args.onnx != None,
         args.sparse,
         args.timing_cache_file,
+        args.distributive_independence,
         args.use_deprecated_plugins,
     )
 
 
@@ -572,7 +572,7 @@ def main():
     parser.add_argument(
         "-w",
         "--workspace-size",
-	help="Workspace size in MiB for building the BERT engine (default: unlimited)",
+        help="Workspace size in MiB for building the BERT engine (default: unlimited)",
         type=int,
     )
     parser.add_argument(
 
@@ -49,7 +49,7 @@ onnx                1.15.0
 onnx-graphsurgeon   0.5.2
 onnxruntime         1.16.3
 polygraphy          0.49.9
-tensorrt            10.11.0.33
+tensorrt            10.12.0.36
 tokenizers          0.13.3
 torch               2.2.0
 transformers        4.42.2
@@ -154,6 +154,8 @@ python3 demo_controlnet.py "A beautiful bird with rainbow colors" --controlnet-t
 
 > NOTE: Currently only `--controlnet-type canny` is supported. `--input-image` must be a pre-processed image corresponding to `--controlnet-type canny`. If unspecified, a sample image will be downloaded.
 
+> NOTE: FP8 quantization (`--fp8`) is supported.
+
 ### Generate an image guided by a text prompt, and using specified LoRA model weight updates
 
 ```bash
@@ -208,10 +210,13 @@ Run the command below to generate an image using Stable Diffusion 3 and Stable D
 python3 demo_txt2img_sd3.py "A vibrant street wall covered in colorful graffiti, the centerpiece spells \"SD3 MEDIUM\", in a storm of colors" --version sd3 --hf-token=$HF_TOKEN
 
 # Stable Diffusion 3.5-medium
-python3 demo_txt2img_sd35.py "a beautiful photograph of Mt. Fuji during cherry blossom" --version=3.5-medium --denoising-steps=30 --guidance-scale 3.5 --hf-token=$HF_TOKEN
+python3 demo_txt2img_sd35.py "a beautiful photograph of Mt. Fuji during cherry blossom" --version=3.5-medium --denoising-steps=30 --guidance-scale 3.5 --hf-token=$HF_TOKEN --bf16
 
 # Stable Diffusion 3.5-large
-python3 demo_txt2img_sd35.py "a beautiful photograph of Mt. Fuji during cherry blossom" --version=3.5-large --denoising-steps=30 --guidance-scale 3.5 --hf-token=$HF_TOKEN
+python3 demo_txt2img_sd35.py "a beautiful photograph of Mt. Fuji during cherry blossom" --version=3.5-large --denoising-steps=30 --guidance-scale 3.5 --hf-token=$HF_TOKEN --bf16 --download-onnx-models
+
+# Stable Diffusion 3.5-large FP8
+python3 demo_txt2img_sd35.py "a beautiful photograph of Mt. Fuji during cherry blossom" --version=3.5-large --denoising-steps=30 --guidance-scale 3.5 --hf-token=$HF_TOKEN --fp8 --download-onnx-models --onnx-dir onnx_35_fp8/ --engine-dir engine_35_fp8/
 ```
 
 You can also specify an input image conditioning as shown below
@@ -225,6 +230,19 @@ python3 demo_txt2img_sd3.py "dog wearing a sweater and a blue collar" --version
 
 Note that a denosing-percentage is applied to the number of denoising-steps when an input image conditioning is provided. Its default value is set to 0.6. This parameter can be updated using `--denoising-percentage`
 
+### Generate an image with Stable Diffusion v3.5-large with ControlNet guided by an image and a text prompt
+
+```bash
+# Depth
+python3 demo_controlnet_sd35.py "a photo of a man" --controlnet-type depth --hf-token=$HF_TOKEN --denoising-steps 40 --guidance-scale 4.5 --bf16
+
+# Canny
+python3 demo_controlnet_sd35.py "A Night time photo taken by Leica M11, portrait of a Japanese woman in a kimono, looking at the camera, Cherry blossoms" --controlnet-type canny --hf-token=$HF_TOKEN --denoising-steps 60 --guidance-scale 3.5 --bf16
+
+# Blur
+python3 demo_controlnet_sd35.py "generated ai art, a tiny, lost rubber ducky in an action shot close-up, surfing the humongous waves, inside the tube, in the style of Kelly Slater" --controlnet-type blur --hf-token=$HF_TOKEN --denoising-steps 60 --guidance-scale 3.5 --bf16
+```
+
 ### Generate a video guided by an initial image using Stable Video Diffusion
 
 Download the pre-exported ONNX model
@@ -442,3 +460,7 @@ Custom override paths to pre-built engine files can be provided using `--custom-
 - To accelerate engine building time use `--timing-cache <path to cache file>`. The cache file will be created if it does not already exist. Note that performance may degrade if cache files are used across multiple GPU targets. It is recommended to use timing caches only during development. To achieve the best perfromance in deployment, please build engines without timing cache.
 - Specify new directories for storing onnx and engine files when switching between versions, LoRAs, ControlNets, etc. This can be done using `--onnx-dir <new onnx dir>` and `--engine-dir <new engine dir>`.
 - Inference performance can be improved by enabling [CUDA graphs](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs) using `--use-cuda-graph`. Enabling CUDA graphs requires fixed input shapes, so this flag must be combined with `--build-static-batch` and cannot be combined with `--build-dynamic-shape`.
+
+### Known Issues
+
+The Stable Diffusion XL pipeline is currently not supported on RTX 5090 due to memory constraints. This issue will be resolved in an upcoming release.
Original file line number	Diff line number	Diff line change
`@@ -572,7 +572,7 @@ def main():`
`572`	`572`	`parser.add_argument(`
`573`	`573`	`"-w",`
`574`	`574`	`"--workspace-size",`
`575`		`- help="Workspace size in MiB for building the BERT engine (default: unlimited)",`
	`575`	`+ help="Workspace size in MiB for building the BERT engine (default: unlimited)",`
`576`	`576`	`type=int,`
`577`	`577`	`)`
`578`	`578`	`parser.add_argument(`