Skip to content

[deepseek][kernels][blackwell] Cutlass blackwell grouped gemm using cute dsl (forward,backward) #1276

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 34 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
e686f60
Create cute_grouped_gemm.py
lessw2020 Jun 8, 2025
97bf42b
add benchmark showcasing pytorch integration
lessw2020 Jun 8, 2025
79b2a62
add groupGemm cute strategy
lessw2020 Jun 9, 2025
05d4506
use triton.do_bench for improved accuracy
lessw2020 Jun 9, 2025
a570007
add xlarge config to mimic deepseek (256 experts)
lessw2020 Jun 9, 2025
93db6c0
add gg driver to test out configs
lessw2020 Jun 10, 2025
86dae8b
start on 2cta and larger cluster
lessw2020 Jun 11, 2025
8523b72
refactor cutlass group gemm for 2UMMA support and streamlined code
lessw2020 Jun 11, 2025
2da79a8
all cluster sizes running nicely
lessw2020 Jun 11, 2025
565d014
minimize cpu-gpu synchs
lessw2020 Jun 12, 2025
aff4a06
initial backwards for cutlass (not working)
lessw2020 Jun 12, 2025
ad9d29b
initial backwards for cutlass, simple test working
lessw2020 Jun 13, 2025
9a02ac0
backwards, add initial numerics check (failing)
lessw2020 Jun 13, 2025
a4b35c3
backwards, add initial numerics check (failing)
lessw2020 Jun 13, 2025
0c5b84c
progress on backwards, still failing
lessw2020 Jun 13, 2025
74257fb
standalone gg for backwards debugging
lessw2020 Jun 14, 2025
d6f5a03
backwards working(!)
lessw2020 Jun 14, 2025
59cf49a
new benchmarks...gg back not fully working again
lessw2020 Jun 14, 2025
4661156
add _set_cuda_context, update simple backwards test
lessw2020 Jun 14, 2025
a288251
backwards K mismatch
lessw2020 Jun 15, 2025
f97cbf6
add pytorch_cute_converter
lessw2020 Jun 20, 2025
5a8cb9c
remove transpose warning - we handle via strides
lessw2020 Jun 20, 2025
d4d6314
ds inference all working again, blackwell group gemm and manual looping
lessw2020 Jun 21, 2025
f2a146a
standalone version for cutlass gg
lessw2020 Jun 22, 2025
a345508
standalone running, but values incorrect
lessw2020 Jun 22, 2025
47d614c
integrate cute kernel cache options
lessw2020 Jun 22, 2025
00b90c1
move working version to standlone
lessw2020 Jun 22, 2025
388de94
simpler standalone version
lessw2020 Jun 22, 2025
211e75a
standalone still not working...
lessw2020 Jun 22, 2025
20a0a92
pretranspose working
lessw2020 Jun 22, 2025
e833196
3 different versions...one working, 2 with issues
lessw2020 Jun 22, 2025
016ca47
remove torch.cuda.synchronize
lessw2020 Jun 22, 2025
c3ee57c
reasonable working version
lessw2020 Jun 22, 2025
e0a0647
improved converter class
lessw2020 Jun 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Loading