Skip to content

RFC: Cythonize cuda.core while keeping it CUDA-agnostic #866

@leofang

Description

@leofang

Today the majority of cuda.core is implemented in pure Python. As a result, we've been dealing with microsecond-level overhead in the past few months if not weeks (ex: #739, #658). As much as I think it is pre-mature optimization at this stage, I do hear the desire of keeping the performance competitive while staying productive.

This RFC outlines one such solution to address the performance concerns. Below are the critical requirements

  1. cuda.core continues to support multiple CUDA major versions
  2. The installation UX pip install cuda-core stays unchanged
  3. Local development workflow is uninterrupted
  4. No user- or developer- visible breaking change is introduced (despite we're still in the experimental phase)

The critical question to answer is how we'll lower to Cython while having to build against cuda.bindings 12x & 13.x. Here is the steps following the great work @dalcinl did for mpi4py v4.1.0 (to support both Open MPI and MPICH):

  1. We turn all Python modules from .py to .pyx and update the build system
    • We could consider having one mega .pyx with others being .pyi and literal-included, similar to the mpi4py.MPI module
  2. We build cuda-core twice, once against CUDA & cuda-bindings 12.x, and then 13.x
  3. We merge two generated wheels into a single one (script)
    • A runtime dispatching snippet should be injected into cuda/core/experimental/__init__.py to decide which extension module to load, based on the installed cuda-bindings major version

It is worth noting that the Step 2 and 3 only happen in the public CI, so as to meet the Requirement 3 (for local development, neither internal nor external developers should need to have multiple CUDA versions installed).

Another note is that this RFC is only applicable to make our Python wheels stay variant-free (no -cu12/-cu13); for conda packages, it is trivial to build variant packages without changing the UX (conda install cuda-core), so no extra work is needed.

This RFC also mirrors our plan for cuda-cccl (NVIDIA/cccl#2555).

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0High priority - Must do!RFCPlans and announcementscuda.coreEverything related to the cuda.core moduleenhancementAny code-related improvementspackagingAnything related to wheels or Conda packages

    Type

    No type

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions