-
Notifications
You must be signed in to change notification settings - Fork 19
Description
Summary
Hello, while working on the assignment, I noticed that the kernel launch configuration for MatrixMultiplyKernelinsrc/combine.cu appears to have its grid dimensions for rows (m) and columns (p) swapped. This causes matrix multiplication tests to fail for non-square matrices.
Location of the Bug
File: src/combine.cu
Function: void MatrixMultiply(...)
Specific Line: The definition of dim3 gridDims(...)
Problem Description
The current kernel launch configuration is as follows:
C++
// Incorrect Configuration
dim3 gridDims((m + threadsPerBlock - 1) / threadsPerBlock, (p + threadsPerBlock - 1) / threadsPerBlock, batch);
In this setup, gridDims.x(the grid's x-dimension) is determined by the number of rowsm, while gridDims.yis determined by the number of columnsp.
However, based on the kernel's internal logic (row = blockIdx.y * ..., col = blockIdx.x * ...), the x-dimension of the grid should correspond to the columns (p), and the y-dimension should correspond to the rows (m).
The correct configuration should be:
C++
// Correct Configuration
dim3 gridDims((p + threadsPerBlock - 1) / threadsPerBlock, (m + threadsPerBlock - 1) / threadsPerBlock, batch);
Steps to Reproduce
This issue becomes apparent when performing multiplication on non-square matrices (where m != p).
- Use the official test suite provided with the assignment.
- Run the tests for CUDA matrix multiplication, specifically test_cuda_matmul_transpose.
- Due to the incorrect gridDims configuration, when computing a (50, 2) @ (2, 1) matrix multiplication (where m=50, p=1), the kernel is launched with a 2x1 grid instead of the correct 1x2 grid. This results in only one row of thread blocks being launched (blockIdx.y is always 0), which means only the first 32 rows of the output matrix get computed. This leads to an assertion error in the test.
Impact
This bug causes all CUDA multiplication operations involving non-square matrices to produce incorrect results and fail their corresponding unit tests.
Suggested Fix
I recommend changing the gridDimsdefinition in theMatrixMultiplyfunction withinsrc/combine.cu to the following:
C++
dim3 gridDims((p + threadsPerBlock - 1) / threadsPerBlock, (m + threadsPerBlock - 1) / threadsPerBlock, batch);
I had corrected this in my own code for HW2, which allowed all tests to pass. I noticed the issue persists in the new starter code for HW3 and wanted to report it to help improve the assignment materials for future students.
Thank you!