[BUG] Illegal CUDA shared memory access in SM90 GEMM TMA Warpspecialized at ClusterBarrier::init #1247

kadeng · 2023-12-06T16:31:57Z

Bug description

When running the provided code as a standalone executable, a CUDA illegal memory access is reported. Using compute-sanitizer, I could pinpoint this to an illegal shared memory access inside the SM90 pipeline initializer code.

The code is generated and is a SM90 GEMM with a custom EVT-based epilogue.

Details of the source code, compilation and reproduction instructions are here:
https://gist.github.com/kadeng/31df46a19d093bdfb36977892f578e1c

Steps/Code to reproduce bug**

See https://gist.github.com/kadeng/31df46a19d093bdfb36977892f578e1c
which contains source code, compilation instructions and an error trace.

Compilation instructions ( also part of the gist above )

Environment:

Linux x64, NVIDIA H100 GPU
CUDA 12.1
Cutlass v3.3.0 ( tagged release )

Command ( example ):

nvcc -t=0 -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -w -gencode=arch=compute_90a,code=[sm_90a,compute_90a] -O1 -std=c++17 --expt-relaxed-constexpr -lineinfo -g -DCUTLASS_DEBUG_TRACE_LEVEL=1 -Xcompiler=-fPIC -Xcompiler=-fno-strict-aliasing -Xcompiler -fvisibility=hidden -Xcompiler=-Wconversion -I/home/klondenberg/github/pytorch/pytorch/third_party/cutlass/include -I/home/klondenberg/github/pytorch/pytorch/third_party/cutlass/tools/library/include -I/home/klondenberg/github/pytorch/pytorch/third_party/cutlass/tools/library/src -I/home/klondenberg/github/pytorch/pytorch/third_party/cutlass/tools/util/include -L/home/klondenberg/local/cuda121/lib64 -L/home/klondenberg/local/cuda121/lib64/stubs -lcuda -lcudart -DGENERATE_STANDALONE_RUNNER -o broken5 broken5.cu

Where

/home/klondenberg/github/pytorch/pytorch/third_party/cutlass is the Cutlass v3.3.0 check out directory
/home/klondenberg/local/cuda121 is the CUDA 12.1 Toolkit path
nvcc is from CUDA 12.1 toolkit

To obtain the error trace above, run the compiled executable under compute-sanitizer

** Error trace from compute-sanitizer: ( also part of the gist above )

Out-of-range shared or local address
========= at 0xbd0 in /home/klondenberg/github/pytorch/pytorch/third_party/cutlass/include/cutlass/arch/barrier.h:169:cutlass::arch::ClusterBarrier::init(const unsigned long *, unsigned int)
========= by thread (0,0,0) in block (0,1,0)
========= Device Frame:/home/klondenberg/github/pytorch/pytorch/third_party/cutlass/include/cutlass/arch/barrier.h:127:cutlass::arch::ClusterBarrier::init(unsigned int) const [0xb20]
========= Device Frame:/home/klondenberg/github/pytorch/pytorch/third_party/cutlass/include/cutlass/pipeline/sm90_pipeline.hpp:1073:cutlass::OrderedSequenceBarrier<(int)1, (int)2>::OrderedSequenceBarrier(cutlass::OrderedSequenceBarrier<(int)1, (int)2>::SharedStorage &, const cutlass::OrderedSequenceBarrier<(int)1, (int)2>::Params &) [0xb20]
========= Device Frame:/home/klondenberg/github/pytorch/pytorch/third_party/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:382:cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<(int)27, cute::tuple<cute::C<(int)1>, cute::C<(int)1>, cute::C<(int)1>>, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple<cute::C<(int)64>, cute::C<(int)64>, cute::C<(int)32>>, cutlass::half_t, cute::tuple<long, cute::C<(int)1>, long>, cutlass::half_t, cute::tuple<cute::C<(int)1>, long, long>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x64x16_F16F16F16_SS<(cute::GMMA::Major)0, (cute::GMMA::Major)1, (cute::GMMA::ScaleIn)1, (cute::GMMA::ScaleIn)1>>, cute::Layout<cute::tuple<cute::C<(int)1>, cute::C<(int)1>, cute::C<(int)1>>, cute::tuple<cute::C<(int)0>, cute::C<(int)0>, cute::C<(int)0>>>, cute::Layout<cute::tuple<cute::C<(int)1>, cute::C<(int)1>, cute::C<(int)1>>, cute::tuple<cute::C<(int)0>, cute::C<(int)0>, cute::C<(int)0>>>, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore>>, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)2, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)32>>, cute::tuple<cute::C<(int)32>, cute::C<(int)1>>>>, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)64>, cute::C<(int)8>>, cute::tuple<cute::C<(int)1>, cute::C<(int)64>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<(int)2, (int)2, (int)16, (bool)0>, cute::tuple<cute::C<(int)64>, cute::C<(int)64>, cute::C<(int)32>>, cute::tuple<cute::C<(int)64>, cute::C<(int)32>>, void, cute::tuple<long, cute::C<(int)1>, long>, cutlass::half_t, cute::tuple<long, cute::C<(int)1>, long>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<identity_op, cutlass::half_t, cutlass::half_t, (cutlass::FloatRoundStyle)2, void>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::maximum, cutlass::half_t, cutlass::half_t, (cutlass::FloatRoundStyle)2, void>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::plus, cutlass::half_t, cutlass::half_t, (cutlass::FloatRoundStyle)2, void>, cutlass::epilogue::fusion::Sm90AccFetch, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<identity_op, cutlass::half_t, cutlass::half_t, (cutlass::FloatRoundStyle)2, void>, cutlass::epilogue::fusion::Sm90RowBroadcast<(int)2, cute::tuple<cute::C<(int)64>, cute::C<(int)64>, cute::C<(int)32>>, cutlass::half_t, cute::tuple<cute::C<(int)0>, cute::C<(int)1>, cute::C<(int)0>>, (int)8, (bool)1>>>, cutlass::epilogue::fusion::Sm90ScalarBroadcast<cutlass::half_t, cute::tuple<cute::C<(int)0>, cute::C<(int)0>, cute::C<(int)0>>, (int)1, cutlass::multiplies>>>, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)2, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)32>>, cute::tuple<cute::C<(int)32>, cute::C<(int)1>>>>, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<(int)2, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)32>>, cute::tuple<cute::C<(int)32>, cute::C<(int)1>>>>, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::operator ()(const cutlass::gemm::kernel::GemmUniversal<cute::tuple<int, int, int, int>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm90TmaGmmaWarpSpecialized<(int)27, cute::tuple<cute::C<(int)1>, cute::C<(int)1>, cute::C<(int)1>>, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple<cute::C<(int)64>, cute::C<(int)64>, cute::C<(int)32>>, cutlass::half_t, cute::tuple<long, cute::C<(int)1>, long>, cutlass::half_t, cute::tuple<cute::C<(int)1>, long, long>, cute::TiledMMA<cute::MMA_Atom<cute::SM90_64x64x16_F16F16F16_SS<(cute::GMMA::Major)0, (cute::GMMA::Major)1, (cute::GMMA::ScaleIn)1, (cute::GMMA::ScaleIn)1>>, cute::Layout<cute::tuple<cute::C<(int)1>, cute::C<(int)1>, cute::C<(int)1>>, cute::tuple<cute::C<(int)0>, cute::C<(int)0>, cute::C<(int)0>>>, cute::Layout<cute::tuple<cute::C<(int)1>, cute::C<(int)1>, cute::C<(int)1>>, cute::tuple<cute::C<(int)0>, cute::C<(int)0>, cute::C<(int)0>>>, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore>>, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)2, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)32>>, cute::tuple<cute::C<(int)32>, cute::C<(int)1>>>>, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)64>, cute::C<(int)8>>, cute::tuple<cute::C<(int)1>, cute::C<(int)64>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm90TmaWarpSpecialized<(int)2, (int)2, (int)16, (bool)0>, cute::tuple<cute::C<(int)64>, cute::C<(int)64>, cute::C<(int)32>>, cute::tuple<cute::C<(int)64>, cute::C<(int)32>>, void, cute::tuple<long, cute::C<(int)1>, long>, cutlass::half_t, cute::tuple<long, cute::C<(int)1>, long>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<identity_op, cutlass::half_t, cutlass::half_t, (cutlass::FloatRoundStyle)2, void>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::maximum, cutlass::half_t, cutlass::half_t, (cutlass::FloatRoundStyle)2, void>, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<cutlass::plus, cutlass::half_t, cutlass::half_t, (cutlass::FloatRoundStyle)2, void>, cutlass::epilogue::fusion::Sm90AccFetch, cutlass::epilogue::fusion::Sm90TreeVisitor<cutlass::epilogue::fusion::Sm90Compute<identity_op, cutlass::half_t, cutlass::half_t, (cutlass::FloatRoundStyle)2, void>, cutlass::epilogue::fusion::Sm90RowBroadcast<(int)2, cute::tuple<cute::C<(int)64>, cute::C<(int)64>, cute::C<(int)32>>, cutlass::half_t, cute::tuple<cute::C<(int)0>, cute::C<(int)1>, cute::C<(int)0>>, (int)8, (bool)1>>>, cutlass::epilogue::fusion::Sm90ScalarBroadcast<cutlass::half_t, cute::tuple<cute::C<(int)0>, cute::C<(int)0>, cute::C<(int)0>>, (int)1, cutlass::multiplies>>>, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)2, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)32>>, cute::tuple<cute::C<(int)32>, cute::C<(int)1>>>>, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<(int)2, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)32>>, cute::tuple<cute::C<(int)32>, cute::C<(int)1>>>>, cute::SM90_U32x4_STSM_N>, cutlass::gemm::PersistentScheduler, void>::Params &, char *) [0xad0]
========= Device Frame:/home/klondenberg/github/pytorch/pytorch/third_party/cutlass/include/cutlass/device_kernel.h:109:void cutlass::device_kernel<cutlass3x_sm90_tensorop_h64x64x16gemm_f16_f16_f16_void_f16_64x64x32_1x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma>(T1::Params) [0x20]

thakkarV · 2023-12-06T16:36:07Z

I see a -g flag in your nvcc command line. Does the issue occur if you remove the -g. Additionally, does this issue persist if you change the -O1 to -O2 instead?

kadeng · 2023-12-06T16:44:26Z

Tried that, both do not make a difference. On a sidenote, I had to update the linked gist, since I noticed that the code formatter I used directly before pasting it in there destroyed the source. Now it should compile, but the formatting is ugly..

github-actions · 2024-01-11T20:04:51Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

mnicely · 2024-02-22T15:04:17Z

@kadeng did you resolve your issue?

kadeng · 2024-02-22T15:11:39Z

No, but I did not try on the latest Cutlass version.

github-actions · 2024-03-23T16:05:03Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions · 2024-06-21T17:05:27Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

kadeng added ? - Needs Triage bug Something isn't working labels Dec 6, 2023

mnicely removed the ? - Needs Triage label Dec 6, 2023

thakkarV added the ? - Needs Triage label Dec 12, 2023

github-actions bot added the inactive-30d label Jan 11, 2024

github-actions bot removed the inactive-30d label Feb 22, 2024

github-actions bot added the inactive-30d label Mar 23, 2024

github-actions bot added the inactive-90d label Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Illegal CUDA shared memory access in SM90 GEMM TMA Warpspecialized at ClusterBarrier::init #1247

[BUG] Illegal CUDA shared memory access in SM90 GEMM TMA Warpspecialized at ClusterBarrier::init #1247

kadeng commented Dec 6, 2023 •

edited

Loading

thakkarV commented Dec 6, 2023

Uh oh!

kadeng commented Dec 6, 2023 •

edited

Loading

Uh oh!

github-actions bot commented Jan 11, 2024

Uh oh!

mnicely commented Feb 22, 2024

Uh oh!

kadeng commented Feb 22, 2024

Uh oh!

github-actions bot commented Mar 23, 2024

Uh oh!

github-actions bot commented Jun 21, 2024

Uh oh!

[BUG] Illegal CUDA shared memory access in SM90 GEMM TMA Warpspecialized at ClusterBarrier::init #1247

[BUG] Illegal CUDA shared memory access in SM90 GEMM TMA Warpspecialized at ClusterBarrier::init #1247

Comments

kadeng commented Dec 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug description

Steps/Code to reproduce bug**

Compilation instructions ( also part of the gist above )

thakkarV commented Dec 6, 2023

Uh oh!

kadeng commented Dec 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 11, 2024

Uh oh!

mnicely commented Feb 22, 2024

Uh oh!

kadeng commented Feb 22, 2024

Uh oh!

github-actions bot commented Mar 23, 2024

Uh oh!

github-actions bot commented Jun 21, 2024

Uh oh!

kadeng commented Dec 6, 2023 •

edited

Loading

kadeng commented Dec 6, 2023 •

edited

Loading