Fix int_nbit int8 nobag CUDA kernel (#4421)

spcyppt · facebook-github-bot · commit 8325430166fa · 2025-06-30T23:10:31.000-07:00
Summary: Pull Request resolved: #4421 X-link: facebookresearch/FBGEMM#1491 **TLDR;** Fix int8 nobag in TBE inference CUDA kernel such that - output shape is {total_L, D + kINT8QparamsBytes} - kINT8QparamsBytes = 4 **Detail** For nobag int8, the output shape should be `{total_L, D + kINT8QparamsBytes}`, since `total_L` dimension already includes `T`. `T *` was unintentionally added in D36018114. `kINT8QparamsBytes` is 4 in CPU, since a half is used. However, 8 is used in CUDA. This diff removes `T*` from the output shape and change `kINT8QparamsBytes` to be 4 for CUDA kernel implementation to match CPU and production. There has been no issue because both our int8 nobag CUDA kernels are not currently used in production. ---- Note that this is currently used meta function is [fbgemm_int_nbit_split_embedding_codegen_lookup_function_meta](https://www.internalfb.com/code/fbsource/[d4f61c30f747f0a8c2e6d806904bc8ef3ee5ea42]/fbcode/caffe2/torch/fb/model_transform/splitting/split_dispatcher.py?lines=231%2C423), which has different logic for int8 and nobag cases. The discrepancy has not been an issue because: - Nobag - split_dispatcher: D = average D - FBGEMM: D = max(max_D of each dtype) -> The embedding dimensions are the same, so average D = max D. - Int8 Pooled - split_dispatcher: [B, total_D] here - FBGEMM: [B, total_D + T * 8] -> This is not being used in prod This will be a problem if embedding dimensions are mixed, or int8 pooled is going to be used. Reviewed By: q10 Differential Revision: D76488339 fbshipit-source-id: ae8ca9dcb9db01eec8aa25504d1a01202c7cd466
diff --git a/fbgemm_gpu/codegen/inference/embedding_forward_quantized_split_nbit_host_template.cu b/fbgemm_gpu/codegen/inference/embedding_forward_quantized_split_nbit_host_template.cu
@@ -130,13 +130,12 @@ __global__ void {{ type_map[emb_weight_type].enum_name }}_split_embedding{{ "_no
 
     // Construct output tensor
     Tensor output;
-    const int kINT8QparamsBytes = 8;
 
     SparseType o_dtype = static_cast<SparseType>(output_dtype);
     TORCH_CHECK(o_dtype == SparseType::FP32 || o_dtype == SparseType::FP16 || o_dtype == SparseType::BF16 || o_dtype == SparseType::INT8);
 
     {%- if not nobag %}
-
+    const int kINT8QparamsBytes = 8;
     int64_t total_adjusted_D = total_D;
     if (o_dtype == SparseType::INT8) {
         total_adjusted_D += T * kINT8QparamsBytes;
@@ -149,10 +148,11 @@ __global__ void {{ type_map[emb_weight_type].enum_name }}_split_embedding{{ "_no
     }
 
     {%- else %}
-
+    // TODO: Change to use half to match CPU/Meta implementation
+    const int kINT8QparamsBytes = 8; // using float for scale and bias
     int64_t adjusted_D = D;
     if (o_dtype == SparseType::INT8) {
-        adjusted_D += T * kINT8QparamsBytes;
+        adjusted_D += kINT8QparamsBytes;
     }
 
     if (total_L == 0) {