Releases: NVIDIA/cccl
python-0.3.3
These are the release notes for the cuda-cccl Python package version 0.3.3, dated October 21st, 2025. The previous release was v0.3.2.
cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Installation
Please refer to the install instructions here
Features and improvements
- This is the first release that features Windows wheels published to PyPI. You can now
pip install cuda-cccl[cu12]orpip install cuda-cccl[cu13]on Windows for Python versions 3.10, 3.11, 3.12, and 3.13.
Bug Fixes
Breaking Changes
python-0.3.2
These are the release notes for the cuda-cccl Python package version 0.3.2, dated October 17th, 2025. The previous release was v0.3.1.
cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Installation
Please refer to the install instructions here
Features and improvements
- Allow passing in a device array or
Noneas the initial value in scan.
Bug Fixes
Breaking Changes
v3.1.0
What's Changed
🚀 Thrust / CUB
- [Thrust] Perform asynchronous allocations by default for the
par_nosyncpolicy by @brycelelbach in #4204 - [Thrust]
reduce_intoby @brycelelbach in #4355 - Enable Catch2 tests in Thrust by @bernhardmgruber in #2669
- Add memcpy_async transform kernel for Ampere by @bernhardmgruber in #2394
- Allow default-initializing and skipping initialization of Thrust vectors by @bernhardmgruber in #4183
- Add thrust::strided_iterator and a step for thrust::counting_iterator by @bernhardmgruber in #4014
- Add new WarpReduce overloadings by @fbusato in #3884
- Optimize ThreadReduce by @fbusato in #3441
📚 Libcudacxx
- Enable device assertions in CUDA debug mode
nvcc -Gby @fbusato in #4444 - avoid EDG bug by moving diagnostic push & pop out of templates by @ericniebler in #4416
- Add host/device/managed mdspan and accessors by @fbusato in #3686
- Add cuda::ptx::elect.sync by @fbusato in #4445
- Add pointer utilities cuda::is_aligned, cuda::align_up, cuda::align_down, cuda::ptr_rebind by @fbusato in #5037
- Add cuda::ceil_ilog2 by @fbusato in #4485
- Add cuda::is_power_of_two, cuda::next_power_of_two, cuda::prev_power_of_two by @fbusato in #4627
- Add cuda::device::warp_match_all by @fbusato in #4746
- Add cuda::static_for by @fbusato in #4855
- Improve/cleanup cuda::annotated_ptr implementation by @fbusato in #4503
- Add cuda::fast_mod_div Fast Modulo Division by @fbusato in #5210
📝 Documentation
- Making extended API documentation slightly more uniform by @fbusato in #4965
- Add memory space note to
cuda::memorydocumentation by @fbusato in #5151 - Better specify
lane_mask::all_active()behavior by @fbusato in #5183
🔄 Other Changes
- [CUDAX] Add universal comparison across memory resources by @pciolkosz in #4168
- Implement
ranges::range_adaptorby @miscco in #4066 - Avoiding looping over problem size in individual tests by @oleksandr-pavlyk in #4140
- Replace CUB
util_arch.cuhmacros withinline constexprvariables by @fbusato in #4165 - Improves test times for
DeviceSegmentedRadixSortby @elstehle in #4156 - Simplify Thrust iterator functions by @bernhardmgruber in #4178
- Remove
_LIBCUDACXX_UNUSED_VARby @davebayer in #4174 - Remove
_CCCL_NO_IF_CONSTEXPRby @davebayer in #4187 - Implement
__fp_native_type_tby @davebayer in #4173 - Adds support for large number of segments and large number of items to
DeviceSegmentedRadixSortby @elstehle in #3402 - Implement inclusive scan in cuda.parallel by @NaderAlAwar in #4147
- Remove
_CCCL_NO_NOEXCEPT_FUNCTION_TYPEby @davebayer in #4190 - Fix
not_fnby @miscco in #4186 - Remove
_CCCL_NTTP_AUTOby @davebayer in #4191 - Avoid instantiating discard_iterator while parsing by @bernhardmgruber in #4180
- Host/Device accessors for
mdspanby @fbusato in #3686 - Remove
_CCCL_NO_DEDUCTION_GUIDESby @davebayer in #4188 - Set NO_CMAKE_FIND_ROOT_PATH for cudax. by @bdice in #4162
- Fix build breaking with setuptools by @miscco in #4212
- Replaces remaining uses of
thrust::{host,device}_vectorin our Catch2 tests by @elstehle in #4205 - Add check that CXX + CUDA_HOST compilers match when necessary. by @alliepiper in #4201
- Disable test on 12.0 CTK by @miscco in #4214
- Implement fp properties by @davebayer in #4213
- [CUDAX] Separate non-async pinned memory resource into legacy_pinned_memory_resource by @pciolkosz in #4179
- Avoid errors in
get_device_addresstests by @miscco in #4209 - Implement extended fp traits by @davebayer in #4211
- Remove
_CCCL_INLINE_VARby @davebayer in #4192 - Improve host/device mdspan documentation by @fbusato in #4220
- Drop
_LIBCUDACXX_BEGIN_NAMESPACE_RANGES_ABIby @miscco in #4210 - Fix C++ version used in CONTRIBUTING.md by @bernhardmgruber in #4224
- Extend tuning documentation by @bernhardmgruber in #4184
- Drop tuning params for benchmarks with custom ops by @bernhardmgruber in #4176
- Make compiler version comparisons safer by @davebayer in #4185
- Document python packages for sol plot script by @bernhardmgruber in #4228
- Remove
_CCCL_NO_FOLD_EXPRESSIONSby @davebayer in #4189 - Remove python/cuda_cooperative/setup.py by @rwgk in #4221
- Allow cuda::par*.on() to take cuda::stream_ref by @bernhardmgruber in #4225
- Drop
_CCCL_NO_VARIABLE_TEMPLATESby @miscco in #4229 - Fix typos in cuda mdspan documentation by @fbusato in #4231
- Simplify Thrust assign_value by @bernhardmgruber in #4227
- Remove double underscore limit macros by @davebayer in #4194
- Document deprecations from #4165 by @bernhardmgruber in #4237
- Implement
__fp_is_subsettrait by @davebayer in #4230 - Extend tuning verification docs by @bernhardmgruber in #4236
- Use
[[maybe_unused]]in whole cccl by @davebayer in #4207 - Move implementation of
cuda::std::arrayto libcu++ by @davebayer in #4239 - Implement
__cccl_fpclass by @davebayer in #4238 - Add transform c parallel implementation by @shwina in #4048
- Drop duplicated system header blocks by @miscco in #4245
- Exclude sm101 from RDC testing. by @alliepiper in #4247
- Make
cuda::stream_refconstructible on device by @miscco in #4243 - Fix logic in test_segmented_reduce by @oleksandr-pavlyk in #4198
- Add new
WarpReduceoverloadings by @fbusato in #3884 - Fix construction of host init value in test_reduce made incorrect after refactoring by @oleksandr-pavlyk in #4251
- Refac...
python-0.3.1
These are the release notes for the cuda-cccl Python package version 0.3.1, dated October 8th, 2025. The previous release was v0.3.0.
cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Installation
Please refer to the install instructions here
Features and improvements
- The
cuda.cccl.parallel.experimentalpackage has been renamed tocuda.compute. - The
cuda.cccl.cooperative.experimentalpackage has been renamed tocuda.coop. - The old imports will continue to work for now, but will be removed in a subsequent release.
- Documentation at https://nvidia.github.io/cccl/python/ has been updated to reflect these changes.
Bug Fixes
Breaking Changes
- If you previously were importing subpackages of
cuda.cccl.parallel.experimentalorcuda.cccl.cooperative.experimental, those imports may not work as expected. Please import fromcuda.computeandcuda.cooprespectively.
v3.0.3
What's Changed
🔄 Other Changes
- Backport #5442 to branch/3.0x by @shwina in #5469
- Backport to 3.0: Fix grid dependency sync in cub::DeviceMergeSort (#5456) by @bernhardmgruber in #5461
- Partial backport to 3.0: Fix SMEM alignment in DeviceTransform by @bernhardmgruber in #5463
- [Version] Update branch/3.0.x to v3.0.3 by @github-actions[bot] in #5502
- [Backport branch/3.0.x] NV_TARGET and cuda::ptx for CTK 13 by @fbusato in #5481
- [BACKPORT 3.0]: Update PTX ISA version for CUDA 13 (#5676) by @miscco in #5700
- Backport some MSVC test fixes to 3.0 by @miscco in #5819
- [Backport 3.0]: Work around
submdspancompiler issue on MSVC (#5885) by @miscco in #5903 - Backport pin of llvmlite dependency to branch/3.0x by @shwina in #6000
- [Backport branch/3.0.x] Ensure that we are actually calling the cuda APIs ... (#4570) by @davebayer in #6098
- [Backport to 3.0] add a specialization of
__make_tuple_typesforcomplex<T>(#6102) by @davebayer in #6117 - [Backport 3.0.x] Use proper qualification in allocate.h (#4796) by @wmaxey in #6126
Full Changelog: v3.0.2...v3.0.3
CCCL Python Libraries (v0.3.0)
These are the release notes for the cuda-cccl Python package version 0.3.0, dated October 2nd, 2025. The previous release was v0.2.1.
cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Installation
Please refer to the install instructions here
Features and improvements
-
ARM64 wheel and conda package support: Installation via
pipandcondanow supported on ARM64 (aarch64) architecture. -
New algorithm: three-way partitioning: The
three_way_partitionalgorithm enables partitioning an array (or iterator) into three partitions, given two selection operators. -
Improved scan performance: The
inclusive_scanandexclusive_scanAPIs provide improved performance by automatically selecting the optimal tuning for the input data types and device architecture.
Bug Fixes
None.
Breaking Changes
None.
CCCL Python Libraries v0.1.3.2.0.dev128 (pre-release)
These are the changes in the cuda.cccl libraries introduced in the pre-release 0.1.3.2.0dev128 dated August 14th, 2025.
cuda.cccl is in "experimental" status, meaning that its API and feature set can change quite rapidly.
Major API improvements
Single-call APIs in cuda.cccl.parallel
Previously, performing operation like reduce_into required 4 API invocations to
(1) create a reducer object, (2) compute the amount of temporary storage required for the reduction,
(3) allocate the required amount of temporary memory, and (4) perform the reduction.
In this version, cuda.cccl.parallel introduces simpler, single-call APIs. For example, reduction looks like:
# New API - single function call with automatic temp storage
parallel.reduce_into(d_input, d_output, add_op, num_items, h_init)If you wish to have more control over temporary memory allocation,
the previous API still exists (and always will). It has been renamed from reduce_into to make_reduce_into:
# Object API
reducer = parallel.make_reduce_into(d_input, d_output, add_op, h_init)
temp_storage_size = reducer(None, d_input, d_output, num_items, h_init)
temp_storage = cp.empty(temp_storage_size, dtype=np.uint8)
reducer(temp_storage, d_input, d_output, num_items, h_init)New algorithms
Device-wide histogram
The histogram_even
function provides Python exposure of the corresponding CUB C++ API DeviceHistogram::HistogramEven.
StripedtoBlock exchange
cuda.cccl.cooperative adds a block.exchange
providing Python exposure of the corresponding CUB C++ API BlockExchange.
Currently, only the StripedToBlock exchange pattern is supported.
Infrastructure improvements
CuPy dependency replaced with cuda.core
Use of CuPy within the library has been replaced with the lighter weight cuda.core
package. This means that installing cuda.cccl won't install CuPy as a dependency.
Support for CUDA 13 drivers
cuda.cccl can be used with CUDA 13 compatible drivers. However, the CUDA 13 toolkit (runtime and libraries) is not
yet supported, meaning you still need the CUDA 12 toolkit. Full support for CUDA 13 toolkit is planned for the next
pre-release.
v3.0.2
What's Changed
🔄 Other Changes
- [Version] Update branch/3.0.x to v3.0.2 by @github-actions[bot] in #5348
- Backport to 3.0: Add a macro to disable PDL (#5316) by @bernhardmgruber in #5330
- [Backport branch/3.0x] Add gitlab devcontainers (#5325) by @wmaxey in #5352
Full Changelog: v3.0.1...v3.0.2
v3.0.1
What's Changed
🔄 Other Changes
- [Version] Update branch/3.0.x to v3.0.1 by @github-actions[bot] in #5256
- [Backport branch/3.0.x] Disable assertions for QNX, they do not provide the machinery with their libc by @github-actions[bot] in #5258
- [BACKPORT 3.0] Make sure that nested
tupleandpairhave the expected size (#5246) by @miscco in #5265 - [BACKPORT] Add missed specializations of the new aligned vector types to cub (#5264) by @miscco in #5271
- [BACKPORT 3.0] Backport diagnostic suppression machinery by @miscco in #5281
Full Changelog: v3.0.0...v3.0.1
v3.0.0
CCCL 3.0 Release
The 3.0 release of the CUDA Core Compute Libraries (CCCL) marks our first major version since unifying the Thrust, CUB, and libcudacxx libraries under a single repository. This release reflects over a year of work focused on cleanup, consolidation, and modernizing the codebase to support future growth.
While this release includes a number of breaking changes, many involve the consolidation of APIs—particularly in the thrust:: and cub:: namespaces—as well as cleanup of internal details that were never intended for public use. In many cases, redundant functionality from thrust:: or cub:: has been replaced with equivalent or improved abstractions from the cuda:: or cuda::std:: namespaces. Impact should be minimal for most users. For full details and recommended migration steps, please consult the CCCL 2.x to 3.0 Migration Guide.
Key Changes in CCCL 3.0
Requirements
- C++17 or newer is now required (support for C++11 and C++14 has been dropped #3255)
- CUDA Toolkit 12.0+ is now required (support for CTK 11.0+ has been dropped). For details on version compatibility, see the README.
- Compilers:
- Dropped support for
Header Directory Changes in CUDA Toolkit 13.0
CCCL 3.0 will be included with an upcoming CUDA Toolkit 13.0 release. In this release, the bundled CCCL headers have moved to new top-level directories under ${CTK_ROOT}/include/cccl/.
| Before CUDA 13.0 | After CUDA 13.0 |
|---|---|
${CTK_ROOT}/include/cuda/ |
${CTK_ROOT}/include/cccl/cuda/ |
${CTK_ROOT}/include/cub/ |
${CTK_ROOT}/include/cccl/cub/ |
${CTK_ROOT}/include/thrust/ |
${CTK_ROOT}/include/cccl/thrust/ |
These changes only affect the on-disk location of CCCL headers within the CUDA Toolkit installation.
What you need to know
- ❌ Do NOT write
#include <cccl/...>— this will break. - If using CCCL headers only in files compiled with nvcc
- ✅ No action needed. This is the default for most users.
- If using CCCL headers in files compiled exclusively with a host compiler (e.g., GCC, Clang, MSVC):
- Using CMake and linking
CCCL::CCCL- ✅ No action needed. (This is the recommended path. See example)
- Other build systems
⚠️ Add${CTK_ROOT}/include/ccclto your compiler’s include search path (e.g., with-I)
- Using CMake and linking
These changes prevent issues when mixing CCCL headers bundled with the CUDA Toolkit and those from external package managers. For more detail, see the CCCL 2.x to 3.0 Migration Guide.
Major API Changes
Hundreds of macros, internal types, and implementation details were removed or relocated to internal namespaces. This significantly reduces surface area and eliminates long-standing technical debt, improving both compile times and maintainability.
Removed Macros
Over 50 legacy macros have been removed in favor of modern C++ alternatives:
CUB_{MIN,MAX}: usecuda::std::{min,max}instead #3821THRUST_NODISCARD: use[[nodiscard]]instead #3746THRUST_INLINE_CONSTANT: use `inline constexpr` instead #3746- See CCCL 2.x to 3.0 Migration Guide for complete list
Removed Functions and Classes
thrust::optional: usecuda::std::optionalinstead #4172thrust::tuple: usecuda::std::tupleinstead #2395thrust::pair: usecuda::std::pairinstead #2395thrust::numeric_limits: usecuda::std::numeric_limitsinstead #3366cub::BFE: use `cuda::bitfield_inser`t andcuda::bitfield_extractinstead #4031cub::ConstantInputIterator: usethrust::constant_iteratorinstead #3831cub::CountingInputIterator: usethrust::counting_iteratorinstead #3831cub::GridBarrier: use cooperative groups instead #3745cub::DeviceSpmv: use cuSPARSE instead #3320cub::Mutex: usecuda::std::mutexinstead #3251- See CCCL 2.x to 3.0 Migration Guide for complete list
New Features
C++
cuda::
cuda::std::numeric_limitsnow supports__float128#4059cuda::std::optional<T&>implementation (P2988) #3631cuda::std::numbersheader for mathematical constants #3355NVFP8/6/4extended floating-point types support in<cuda/std/cmath>#3843cuda::overflow_castfor safe numeric conversions #4151cuda::ilog2andcuda::ilog10integer logarithms #4100cuda::round_upandcuda::round_downutilities #3234
cub::
- `cub::DeviceSegmentedReduce` now supports large number of segments #3746
- `cub::DeviceCopy::Batched` now supports large number of buffers #4129
- `cub::DeviceMemcpy::Batched` now supports large number of buffers #4065
thrust::
- New `thrust::offset_iterator` iterator #4073
- Temporary storage allocations in parallel algorithms now respect `par_nosync` #4204
Python
CUDA Python Core Libraries are now available on PyPI through the cuda-cccl package.
pip install cuda-cccl
cuda.cccl.cooperative
- Block-level sorting now supports multi-dimensional thread blocks #4035, #4028
- Block-level data movement now supports multi-dimensional thread blocks #3161
- New block-level inclusive sum algorithm #3921
cuda.cccl.parallel
- New device-level segmented-reduce algorithm #3906
- New device-level unique-by-key algorithm #3947
- New device-level merge-sort algorithm #3763
What's Changed
🚀 Thrust / CUB
- Drop cub::Mutex by @bernhardmgruber in #3251
- Remove legacy macros from CUB util_arch.cuh by @bernhardmgruber in #3257
- Remove thrust::[unary|binary]_traits by @bernhardmgruber in #3260
- Drop thrust not1 and not2 by @bernhardmgruber in #3264
- Deprecate GridBarrier and GridBarrierLifetime by @bernhardmgruber in #3258
- Drop thrust::[unary|binary]_function by @bernhardmgruber in #3274
- Enable thrust::identity test for non-MSVC by @bernhardmgruber in #3281
- Enable PDL in triple chevron launch by @bernhardmgruber in #3282
- Drop Thrust legacy arch macros by @bernhardmgruber in #3298
- Drop Thrust's compiler_fence.h by @bernhardmgruber in #3300
- Drop CUB's util_compiler.cuh by @bernhardmgruber in #3302
- Drop Thrust's deprecated compiler macros by @bernhardmgruber in #3301
- Drop CUB_RUNTIME_ENABLED and THRUST_HAS_CUDART by @bernhardmgruber in #3305
- Require C++17 for compiling Thrust and CUB by @bernhardmgruber in #3255
- Deprecate Thrust's cpp_compatibility.h macros by @bernhardmgruber in #3299
- Deprecate cub::IterateThreadStore by @bernhardmgruber in #3337
- Drop CUB's BinaryFlip operator by @bernhardmgruber in #3332
- Deprecate cub::Swap by @bernhardmgruber in #3333
- Drop CUB APIs with a debug_synchronous parameter by ...