RFC: Cythonize `cuda.core` while keeping it CUDA-agnostic

Today the majority of `cuda.core` is implemented in pure Python. As a result, we've been dealing with microsecond-level overhead in the past few months if not weeks (ex: https://github.com/NVIDIA/cuda-python/issues/739, https://github.com/NVIDIA/cuda-python/issues/658). As much as I think it is pre-mature optimization at this stage, I do hear the desire of keeping the performance competitive while staying productive. 

This RFC outlines one such solution to address the performance concerns. Below are the critical requirements
1. `cuda.core` continues to support multiple CUDA major versions
2. The installation UX `pip install cuda-core` stays unchanged
3. Local development workflow is uninterrupted 
4. No user- or developer- visible breaking change is introduced (despite we're still in the experimental phase)

The critical question to answer is how we'll lower to Cython while having to build against `cuda.bindings` 12x & 13.x. Here is the steps following the great work @dalcinl did for [mpi4py v4.1.0](https://mpi4py.readthedocs.io/en/stable/changes.html#release-4-1-0-2025-06-25) (to support both Open MPI and MPICH):
1. We turn all Python modules from `.py` to `.pyx` and update the build system 
   - We could consider having one mega `.pyx` with others being `.pyi` and literal-included, similar to the `mpi4py.MPI` module 
3. We build `cuda-core` twice, once against CUDA & `cuda-bindings` 12.x, and then 13.x
4. We merge two generated wheels into a single one ([script](https://github.com/mpi4py/mpi4py/blob/master/cibw/wheel-build.sh))   
   - A runtime dispatching snippet should be injected into `cuda/core/experimental/__init__.py` to decide which extension module to load, based on the installed `cuda-bindings` major version

It is worth noting that the Step 2 and 3 only happen in the public CI, so as to meet the Requirement 3 (for local development, neither internal nor external developers should need to have multiple CUDA versions installed).

Another note is that this RFC is only applicable to make our Python wheels stay **variant-free** (no -cu12/-cu13); for conda packages, it is trivial to build variant packages without changing the UX (`conda install cuda-core`), so no extra work is needed.

This RFC also mirrors our plan for `cuda-cccl` (https://github.com/NVIDIA/cccl/issues/2555).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Cythonize `cuda.core` while keeping it CUDA-agnostic #866

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Cythonize cuda.core while keeping it CUDA-agnostic #866

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

RFC: Cythonize `cuda.core` while keeping it CUDA-agnostic #866