Skip to content

Conversation

@Linux-cpp-lisp
Copy link

(Thanks to @springer13 for making your work on a PyTorch cuTENSOR wrapper public!)

Currently, the Python cuTENSOR wrapper always uses TensorFloat32 as the compute dtype for 32-bit float tensors, which is unsupported on non-Ampere GPUs. This PR uses the PyTorch and Tensorflow configuration options for TensorFloat32 (which autodetect Ampere) to set the compute dtype to normal 32-bit float when tf32 is not supported.

@Linux-cpp-lisp Linux-cpp-lisp changed the title Automatically enable/disable TensorFloat32 [cuTENSOR] Automatically enable/disable TensorFloat32 Jul 12, 2021
@JanuszL JanuszL closed this Oct 20, 2025
@JanuszL JanuszL reopened this Oct 20, 2025
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR adds a new Python package for cuTENSOR with PyTorch and TensorFlow bindings, implementing automatic TensorFloat32 (TF32) detection based on GPU architecture. The core improvement queries torch.backends.cuda.matmul.allow_tf32 and tf.config.experimental.tensor_float_32_execution_enabled() at build time to conditionally define the CUTENSOR_USE_TF32 macro, ensuring the wrapper falls back to standard FP32 on non-Ampere GPUs. The PR also updates several C++ cuTENSOR examples to use CUTENSOR_COMPUTE_32F instead of CUTENSOR_R_MIN_32F, removing hardcoded TF32 assumptions and making examples universally compatible. The Python package includes full build infrastructure (setup.py, MANIFEST.in, package metadata), comprehensive test suites for both frameworks, C++ extension kernels with shape inference, and utility functions for Einstein notation parsing. Additionally, build configuration is updated to support Windows and add the new einsum example target.

Changed Files
Filename Score Overview
.gitignore 5/5 Adds root-level .gitignore to exclude .vscode/ directory from version control
cuTENSOR/python/cutensor/__init__.py 4/5 New package init file importing metadata from package_info module
cuTENSOR/python/MANIFEST.in 5/5 Adds directive to include header files in source distribution
cuTENSOR/python/einsum.h 0/5 Empty file content provided - cannot review
cuTENSOR/elementwise_permute.cu 5/5 Removes unused argc/argv parameters from main function signature
cuTENSOR/python/cutensor/tensorflow/__init__.py 4/5 New TensorFlow wrapper init exposing einsum function; shebang has incorrect spacing
cuTENSOR/elementwise_trinary.cu 5/5 Removes unused command-line parameters from main function
cuTENSOR/elementwise_binary.cu 5/5 Removes unused argc/argv parameters from main function
cuTENSOR/python/cutensor/torch/__init__.py 4/5 New PyTorch wrapper init exposing einsum functionality
cuTENSOR/python/README.md 2/5 Adds documentation with critical syntax errors: malformed markdown link and missing colon in function definition
cuTENSOR/README.md 5/5 New README with Linux/Windows build instructions; minor formatting issues
cuTENSOR/Makefile 4/5 Adds einsum target and switches to CUTENSOR_ROOT env var; einsum missing from clean target
cuTENSOR/python/.gitignore 5/5 Comprehensive Python .gitignore covering build artifacts and development environments
cuTENSOR/python/cutensor/c_extensions.py 4/5 Defines extension build configuration for PyTorch and TensorFlow; shebang spacing issue
cuTENSOR/reduction.cu 5/5 Changes compute type from TF32 to standard FP32; removes unused main parameters
cuTENSOR/python/cutensor/package_info.py 4/5 Adds package metadata; download URL points to non-existent repository path
cuTENSOR/python/cutensor/tensorflow/einsum_module.cc 2/5 Creates Python C extension module; incorrectly calls Py_Initialize() in module init
cuTENSOR/contraction.cu 5/5 Switches from TF32 to FP32 compute type; removes unused main parameters
cuTENSOR/CMakeLists.txt 1/5 Adds Windows support and cublasLt dependency; critical bug with CUTENSOR_LIBRARY_DEF undefined on non-Windows
cuTENSOR/contraction_simple.cu 5/5 Changes compute type to FP32 and makes algorithm configurable; removes unused parameters
cuTENSOR/python/cutensor/torch/einsum.py 3/5 Implements PyTorch einsum with autograd; lacks validation for empty/malformed einsum paths
cuTENSOR/einsum.cu 4/5 New einsum example with templated implementation; minor memory allocation inefficiency
cuTENSOR/python/cutensor/torch/einsum_test.py 4/5 Comprehensive PyTorch tests; contains redundant self-assignments
cuTENSOR/python/cutensor/tensorflow/einsum_kernel.cc 4/5 TensorFlow kernel implementation; workspace size calculation may truncate
cuTENSOR/python/cutensor/tensorflow/einsum_test.py 4/5 Comprehensive TensorFlow tests; shebang syntax error
cuTENSOR/python/cutensor/common.py 4/5 Einsum subscript normalization utility; typo in exception message and missing input validation
cuTENSOR/python/cutensor/torch/einsum.cc 3/5 PyTorch C++ bindings; TF32 conditional logic for float32 not visible in this file
cuTENSOR/contraction_plan_cache.cu 4/5 New plan caching example; missing error checks on cudaMemcpy and malloc null check
cuTENSOR/python/cutensor/tensorflow/einsum.py 3/5 TensorFlow einsum wrapper; glob unpacking will crash if binding file count != 1
cuTENSOR/python/cutensor/c_extensions_utils.py 2/5 Build utilities with TF32 detection; missing nvcc null check and return statement in Tensorflow method
cuTENSOR/python/setup.py 2/5 Package setup script; imports from cutensor package before installation will cause ImportError
cuTENSOR/python/cutensor/tensorflow/einsum_ops.cc 2/5 TensorFlow op registration; uses assert() which causes undefined behavior in release builds

Confidence score: 2/5

  • This PR requires careful review and testing before merging due to multiple critical bugs that will cause runtime failures
  • Score reflects serious issues: CMake configuration bug on Linux (CUTENSOR_LIBRARY_DEF undefined), Python package installation failures (imports before setup), missing null checks causing crashes (nvcc path, malloc), and unsafe assertions in TensorFlow ops
  • Pay close attention to cuTENSOR/CMakeLists.txt (Linux build failure), cuTENSOR/python/setup.py (installation broken), cuTENSOR/python/cutensor/c_extensions_utils.py (nvcc crash), cuTENSOR/python/cutensor/tensorflow/einsum_ops.cc (release build undefined behavior), and cuTENSOR/python/cutensor/tensorflow/einsum_module.cc (Py_Initialize misuse)

Additional Comments (40)

  1. cuTENSOR/python/einsum.h, line 378-380 (link)

    logic: Memory leak: cachelines allocated with new[] but never freed. Consider using unique_ptr or documenting ownership transfer to cuTENSOR. Does cutensorHandleAttachPlanCachelines take ownership of the cachelines array?

  2. cuTENSOR/python/cutensor/tensorflow/__init__.py, line 1 (link)

    syntax: space before /usr/bin/python in shebang breaks the shebang functionality

  3. cuTENSOR/python/README.md, line 3 (link)

    syntax: broken markdown syntax: backtick before opening square bracket should be removed

  4. cuTENSOR/python/README.md, line 41 (link)

    syntax: missing colon at end of function definition

  5. cuTENSOR/README.md, line 7 (link)

    style: Trailing whitespace after 'Linux'

  6. cuTENSOR/README.md, line 36 (link)

    style: Add % wrapper around PATH

  7. cuTENSOR/Makefile, line 14 (link)

    logic: einsum binary is built but not removed by make clean

  8. cuTENSOR/python/cutensor/c_extensions.py, line 1 (link)

    syntax: Remove space after # in shebang (should be #!/usr/bin/python)

  9. cuTENSOR/python/cutensor/package_info.py, line 43 (link)

    logic: URL path appears incorrect - should be /tree/master/cuTENSOR/python rather than /tree/master/cuTENSOR/cutensor since the package is in the python directory, not a cutensor subdirectory

  10. cuTENSOR/python/cutensor/tensorflow/einsum_module.cc, line 45 (link)

    logic: Py_Initialize() should not be called in module initialization—Python is already initialized. This can cause crashes or double-initialization issues. Remove this call.

  11. cuTENSOR/CMakeLists.txt, line 71-76 (link)

    logic: CUTENSOR_LIBRARY_DEF is only set for WIN32 but used unconditionally on line 79

  12. cuTENSOR/python/cutensor/torch/einsum.py, line 109-136 (link)

    logic: Missing error handling if path is empty or tensors list becomes empty during iteration. Could cause UnboundLocalError if result is never assigned. What happens when np.einsum_path returns an empty path or when the initial tensors list is empty?

  13. cuTENSOR/einsum.cu, line 258 (link)

    syntax: typo in comment: 'wor_raw' should be 'work_raw'

  14. cuTENSOR/einsum.cu, line 396-397 (link)

    style: allocating memory for B_raw even when B_shape is empty (for reduction-only operations). Should memory allocation be skipped when B_shape is empty?

  15. cuTENSOR/python/cutensor/torch/einsum_test.py, line 138 (link)

    style: redundant assignment that has no effect

  16. cuTENSOR/python/cutensor/torch/einsum_test.py, line 149 (link)

    style: redundant assignment that has no effect

  17. cuTENSOR/python/cutensor/torch/einsum_test.py, line 208 (link)

    style: redundant assignment that has no effect

  18. cuTENSOR/python/cutensor/torch/einsum_test.py, line 215 (link)

    style: redundant assignment that has no effect

  19. cuTENSOR/python/cutensor/tensorflow/einsum_kernel.cc, line 97 (link)

    logic: potential issue: work_tensor_size calculation assumes worksize is always divisible by sizeof(float). If worksize is not a multiple of 4, this will truncate and allocate insufficient workspace. Is worksize guaranteed to be a multiple of sizeof(float)?

  20. cuTENSOR/python/cutensor/tensorflow/einsum_kernel.cc, line 84-85 (link)

    style: Einsum object is constructed on every Compute() call. If construction is expensive and the equation is fixed, consider caching the Einsum instance as a member variable

  21. cuTENSOR/python/cutensor/tensorflow/einsum_test.py, line 1 (link)

    syntax: Space between shebang and Python path will prevent execution on some systems

  22. cuTENSOR/python/cutensor/common.py, line 1 (link)

    syntax: shebang has unnecessary space after #

  23. cuTENSOR/python/cutensor/common.py, line 41 (link)

    syntax: 'Elipsis' is misspelled - should be 'Ellipsis'

  24. cuTENSOR/python/cutensor/common.py, line 34-36 (link)

    logic: splitting on -> returns a 2-element list but doesn't validate - if subscript contains multiple ->, this will silently ignore elements after the second. Should this validate that subscript contains exactly one '->' delimiter?

  25. cuTENSOR/python/cutensor/torch/einsum.cc, line 40-52 (link)

    logic: The PR description mentions automatically enabling/disabling TensorFloat32 based on PyTorch configuration, but these type traits are hardcoded to always use CUTENSOR_R_MIN_32F for half/bfloat16 types. For float32 types (not shown in these specializations), the compute type should conditionally check torch::allow_tf32_cublas() to set either TF32 or standard F32 compute types. Where is the TF32auto-detection logic for float32 tensors? The PR description states this addresses unsupported TF32 on non-Ampere GPUs, but this file doesn't show conditional compute type selection.

  26. cuTENSOR/contraction_plan_cache.cu, line 371 (link)

    logic: Missing error check on cudaMemcpy return value - if copy fails, timing results will be invalid

  27. cuTENSOR/contraction_plan_cache.cu, line 192-193 (link)

    logic: Missing null check after malloc - if allocation fails, the program will crash on line 193

  28. cuTENSOR/python/cutensor/tensorflow/einsum.py, line 44 (link)

    logic: unpacking single value from glob result may fail if multiple binding files match pattern or none exist. consider error handling. Should this handle cases where multiple binding files exist or none are found?

  29. cuTENSOR/python/cutensor/tensorflow/einsum.py, line 82 (link)

    syntax: tf.logging.warn is deprecated in TensorFlow 2.x

  30. cuTENSOR/python/cutensor/c_extensions_utils.py, line 43-47 (link)

    logic: find_executable('nvcc') can return None if nvcc is not in PATH. This would cause AttributeError on line 44 (os.path.dirname(None)) and unpack error on line 47 (.group(1) on None result).

  31. cuTENSOR/python/cutensor/c_extensions_utils.py, line 73 (link)

    syntax: Typo: 'seperate' should be 'separate'

  32. cuTENSOR/python/cutensor/c_extensions_utils.py, line 127 (link)

    logic: Missing return statement. The Tensorflow method should return ext (like Torch does on line 96).

  33. cuTENSOR/python/setup.py, line 34-41 (link)

    logic: Imports from cutensor package before it's installed. These will fail during initial setup unless cutensor is already in sys.path or imported dynamically at setup time. Consider moving package metadata to a separate file that doesn't depend on cutensor imports, or using importlib to load dynamically. How is the cutensor package available during setup? Is there a init.py that makes these imports work before installation?

  34. cuTENSOR/python/setup.py, line 43-50 (link)

    style: Missing common setup() parameters like author, author_email, classifiers, python_requires, install_requires, and long_description. These help with package discovery and dependency management on PyPI.

  35. cuTENSOR/python/cutensor/tensorflow/einsum_ops.cc, line 54 (link)

    logic: asserts terminate in release builds with undefined behavior if condition is false; use TF_RET_CHECK or return proper error status instead

  36. cuTENSOR/python/cutensor/tensorflow/einsum_ops.cc, line 59 (link)

    logic: same issue: assert in shape inference can cause undefined behavior in release builds

  37. cuTENSOR/python/cutensor/tensorflow/einsum_ops.cc, line 65 (link)

    logic: assert will not provide error feedback to TensorFlow; should check condition and return error Status

  38. cuTENSOR/python/cutensor/tensorflow/einsum_ops.cc, line 70 (link)

    logic: replace assert with proper error handling that returns Status

  39. cuTENSOR/python/cutensor/tensorflow/einsum_ops.cc, line 66 (link)

    style: modeA.size() returns size_t (unsigned); comparing with signed int can cause issues if Rank returns negative values or for very large sizes

  40. cuTENSOR/python/cutensor/tensorflow/einsum_ops.cc, line 81-82 (link)

    logic: if modeC contains a character not in dim_map, this will insert a default-constructed (invalid) DimensionHandle; should validate that all modes in modeC exist in dim_map

32 files reviewed, 40 comments

Edit Code Review Agent Settings | Greptile

@kvoronin
Copy link
Collaborator

Hi @Linux-cpp-lisp!

We have changed the license of the repository from BSD-3 to Apache 2.0 to allow accepting external contributions in the right way.

What this means is that

  • the new default branch is main. (The old one called master is still there, but master and an extra bsd3_main are not supposed to be changed)
  • all commits to the repository must be signed, see the new https://github.com/NVIDIA/CUDALibrarySamples/blob/main/CONTRIBUTING.md. Technically, it is as simple as doing git commit -s ... which would enforce DCO for the contribution (see more details in the CONTRIBUTING.md).

What this means for this PR:
Please change the target branch to main (potentially instead of rebase it is easier to cherry-pick commits into a new branch) and sign the commits.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants