Add SM80/89 blockwise scaling kernel, support FP8 block/groupwise on Ada, INT8 on Ampere #2328

solrex · 2025-05-24T02:44:08Z

Inspired by #1932 and #2037, implement blockscaling kernel on platforms before SM90.

FP8 blockwise/groupwise scaling kernel for Ada(L20, L40S, 4090) (Requires accumulator type to be float)
INT8 blockwise/groupwise scaling kernel for Ampere(A100/800, A10, A30) (Requires accumulator type to be int)
CUTLASS 3.x API

* FP8 blockwise/groupwise kernel for Ada(L20,L40S,4090) * INT8 blockwise/groupwise kernel for Ampere(A100/800)

hwu36 · 2025-05-28T02:05:16Z

@jackkosaian

include/cutlass/gemm/collective/sm80_mma_multistage_blockwise_scaling.hpp

solrex · 2025-05-28T08:01:28Z

The following are the example benchmark results on L40S with CUDA 12.4 and CUTLASS main:

FP8:

$ ./examples/85_ada_ampere_gemm_with_blockwise_scaling/85a_ada_fp8_gemm_with_groupwise_scaling_cute
Problem Size: 1024x1024x1024x1
  Tile shape (M, N, K): _64, _128, _128
  ScaleGranularityM: 1 (ScaleMsPerTile: 64)
  ScaleGranularityN: 128 (ScaleNsPerTile: 1)
  Running... 
  Result MSE: 2.79446e-06, MRE: 12.0697, greatest error: 0.0196838
  Disposition: Passed
  Avg runtime: 0.00905421 ms
  GFLOPS: 237181

$ ./examples/85_ada_ampere_gemm_with_blockwise_scaling/85b_ada_fp8_gemm_with_blockwise_scaling_cute
  Problem Size: 1024x1024x1024x1
  Tile shape (M, N, K): _128, _128, _128
  ScaleGranularityM: 128 (ScaleMsPerTile: 1)
  ScaleGranularityN: 128 (ScaleNsPerTile: 1)
  Running... 
  Result MSE: 2.61817e-06, MRE: 11.7382, greatest error: 0.0210075
  Disposition: Passed
  Avg runtime: 0.0233175 ms
  GFLOPS: 92097.5

INT8: 

$ ./examples/85_ada_ampere_gemm_with_blockwise_scaling/85c_ampere_int8_gemm_with_groupwise_scaling_cute
  Problem Size: 1024x1024x1024x1
  Tile shape (M, N, K): _64, _128, _128
  ScaleGranularityM: 1 (ScaleMsPerTile: 64)
  ScaleGranularityN: 128 (ScaleNsPerTile: 1)
  Running... 
  Result MSE: 0, MRE: 81.7363, greatest error: 0
  Disposition: Passed
  Avg runtime: 0.00911155 ms
  GFLOPS: 235688

$ ./examples/85_ada_ampere_gemm_with_blockwise_scaling/85d_ampere_int8_gemm_with_blockwise_scaling_cute
  Problem Size: 1024x1024x1024x1
  Tile shape (M, N, K): _128, _128, _128
  ScaleGranularityM: 128 (ScaleMsPerTile: 1)
  ScaleGranularityN: 128 (ScaleNsPerTile: 1)
  Running... 
  Result MSE: 0, MRE: 77.9124, greatest error: 0
  Disposition: Passed
  Avg runtime: 0.0239155 ms
  GFLOPS: 89794.6

Add SM80/89 blockwise kernel, support:

8f69d8e

* FP8 blockwise/groupwise kernel for Ada(L20,L40S,4090) * INT8 blockwise/groupwise kernel for Ampere(A100/800)

solrex changed the title ~~Add SM80/89 blockwise scaling kernel, support FP8 block/groupwise on Ada, INT8 block/groupwise on Ampere~~ Add SM80/89 blockwise scaling kernel, support FP8 block/groupwise on Ada, INT8 on Ampere May 24, 2025

solrex added 3 commits May 26, 2025 10:52

Set the element types of EpilogueOp more clearly.

735f299

Add Traits for different block size.

886dc04

Avoid unnecessary copy in for loop.

5c58e77

solrex force-pushed the sm80-blockscale branch from 2b2a88b to 5c58e77 Compare May 26, 2025 18:03

solrex added 2 commits May 27, 2025 11:51

Fix scale factor residue calculation.

4b3e259

Avoid overflow calculation.

b194709

Rollback SFA/B copy thread num to 32, fix m*n (m>1,n>1) scale missing.

034e486

guyan364 reviewed May 28, 2025

View reviewed changes

include/cutlass/gemm/collective/sm80_mma_multistage_blockwise_scaling.hpp Outdated Show resolved Hide resolved

Use load_sf* flags to limit threads that perform clear sf*.

3521a01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add SM80/89 blockwise scaling kernel, support FP8 block/groupwise on Ada, INT8 on Ampere #2328

Add SM80/89 blockwise scaling kernel, support FP8 block/groupwise on Ada, INT8 on Ampere #2328

Uh oh!

solrex commented May 24, 2025

Uh oh!

hwu36 commented May 28, 2025

Uh oh!

Uh oh!

solrex commented May 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add SM80/89 blockwise scaling kernel, support FP8 block/groupwise on Ada, INT8 on Ampere #2328

Are you sure you want to change the base?

Add SM80/89 blockwise scaling kernel, support FP8 block/groupwise on Ada, INT8 on Ampere #2328

Uh oh!

Conversation

solrex commented May 24, 2025

Uh oh!

hwu36 commented May 28, 2025

Uh oh!

Uh oh!

solrex commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

solrex commented May 28, 2025 •

edited

Loading