Skip to content

uses current CUDAStream correctly #118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 8, 2025
Merged

uses current CUDAStream correctly #118

merged 3 commits into from
Jul 8, 2025

Conversation

soumith
Copy link
Member

@soumith soumith commented Jul 8, 2025

Fixes pytorch/pytorch#157363

Thanks to @vlejd for finding the issue, debugging and reporting it.

soumith and others added 2 commits July 8, 2025 10:40
This commit fixes GitHub issue pytorch/pytorch#157363 where custom CUDA
kernels were not properly synchronized with PyTorch's CUDA stream when
used with torch.compile in reduce-overhead mode.

Changes:
- Add #include <ATen/cuda/CUDAContext.h> for getCurrentCUDAStream()
- Use at::cuda::getCurrentCUDAStream() to get PyTorch's current CUDA stream
- Launch all kernels with the correct stream parameter

The issue occurred because custom kernels launched on the default CUDA stream
while PyTorch operations (like nn.Linear) run on PyTorch's managed stream.
This created race conditions where custom kernels would execute before
PyTorch operations completed, resulting in incorrect output values.

With this fix, all custom kernels are properly synchronized with PyTorch's
CUDA stream, ensuring correct execution order and preventing race conditions
when used with torch.compile.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Added comprehensive tests to verify the fix for GitHub issue pytorch/pytorch#157363:

1. test_compile_with_linear_layer:
   - Tests custom CUDA kernels with nn.Linear + torch.compile
   - Verifies correct behavior with various input sizes (1000, 5000, 10000)
   - Uses reduce-overhead mode to reproduce the original issue conditions

2. test_compile_custom_only:
   - Tests custom operations without linear layers
   - Ensures custom operations work correctly with torch.compile

These tests ensure that custom CUDA kernels properly synchronize with
PyTorch's CUDA stream when used with torch.compile, preventing race
conditions that previously caused incorrect outputs.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Contributor

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see you have embraced claude code

Replace manual tolerance specification with self.assertEqual which
automatically handles appropriate tolerances for tensor comparisons.
This makes the tests more concise and follows PyTorch testing conventions.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@soumith soumith merged commit 0ec4969 into master Jul 8, 2025
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The opp is not compatible with compile mode="reduce-overhead" and linear layers for large inputs.
4 participants