-
Notifications
You must be signed in to change notification settings - Fork 201
Description
Today the majority of cuda.core
is implemented in pure Python. As a result, we've been dealing with microsecond-level overhead in the past few months if not weeks (ex: #739, #658). As much as I think it is pre-mature optimization at this stage, I do hear the desire of keeping the performance competitive while staying productive.
This RFC outlines one such solution to address the performance concerns. Below are the critical requirements
cuda.core
continues to support multiple CUDA major versions- The installation UX
pip install cuda-core
stays unchanged - Local development workflow is uninterrupted
- No user- or developer- visible breaking change is introduced (despite we're still in the experimental phase)
The critical question to answer is how we'll lower to Cython while having to build against cuda.bindings
12x & 13.x. Here is the steps following the great work @dalcinl did for mpi4py v4.1.0 (to support both Open MPI and MPICH):
- We turn all Python modules from
.py
to.pyx
and update the build system- We could consider having one mega
.pyx
with others being.pyi
and literal-included, similar to thempi4py.MPI
module
- We could consider having one mega
- We build
cuda-core
twice, once against CUDA &cuda-bindings
12.x, and then 13.x - We merge two generated wheels into a single one (script)
- A runtime dispatching snippet should be injected into
cuda/core/experimental/__init__.py
to decide which extension module to load, based on the installedcuda-bindings
major version
- A runtime dispatching snippet should be injected into
It is worth noting that the Step 2 and 3 only happen in the public CI, so as to meet the Requirement 3 (for local development, neither internal nor external developers should need to have multiple CUDA versions installed).
Another note is that this RFC is only applicable to make our Python wheels stay variant-free (no -cu12/-cu13); for conda packages, it is trivial to build variant packages without changing the UX (conda install cuda-core
), so no extra work is needed.
This RFC also mirrors our plan for cuda-cccl
(NVIDIA/cccl#2555).
Metadata
Metadata
Assignees
Labels
Type
Projects
Status