Skip to content

Bug: Configuration error of gridDims in MatrixMultiplyKernel #1

@1501454486

Description

@1501454486

Summary

Hello, while working on the assignment, I noticed that the kernel launch configuration for MatrixMultiplyKernelinsrc/combine.cu appears to have its grid dimensions for rows (m) and columns (p) swapped. This causes matrix multiplication tests to fail for non-square matrices.

Location of the Bug

File: src/combine.cu

Function: void MatrixMultiply(...)

Specific Line: The definition of dim3 gridDims(...)

Problem Description

The current kernel launch configuration is as follows:

C++

// Incorrect Configuration
dim3 gridDims((m + threadsPerBlock - 1) / threadsPerBlock, (p + threadsPerBlock - 1) / threadsPerBlock, batch);
In this setup, gridDims.x(the grid's x-dimension) is determined by the number of rowsm, while gridDims.yis determined by the number of columnsp.

However, based on the kernel's internal logic (row = blockIdx.y * ..., col = blockIdx.x * ...), the x-dimension of the grid should correspond to the columns (p), and the y-dimension should correspond to the rows (m).

The correct configuration should be:

C++

// Correct Configuration
dim3 gridDims((p + threadsPerBlock - 1) / threadsPerBlock, (m + threadsPerBlock - 1) / threadsPerBlock, batch);

Steps to Reproduce

This issue becomes apparent when performing multiplication on non-square matrices (where m != p).

  1. Use the official test suite provided with the assignment.
  2. Run the tests for CUDA matrix multiplication, specifically test_cuda_matmul_transpose.
  3. Due to the incorrect gridDims configuration, when computing a (50, 2) @ (2, 1) matrix multiplication (where m=50, p=1), the kernel is launched with a 2x1 grid instead of the correct 1x2 grid. This results in only one row of thread blocks being launched (blockIdx.y is always 0), which means only the first 32 rows of the output matrix get computed. This leads to an assertion error in the test.

Impact

This bug causes all CUDA multiplication operations involving non-square matrices to produce incorrect results and fail their corresponding unit tests.

Suggested Fix

I recommend changing the gridDimsdefinition in theMatrixMultiplyfunction withinsrc/combine.cu to the following:

C++

dim3 gridDims((p + threadsPerBlock - 1) / threadsPerBlock, (m + threadsPerBlock - 1) / threadsPerBlock, batch);
I had corrected this in my own code for HW2, which allowed all tests to pass. I noticed the issue persists in the new starter code for HW3 and wanted to report it to help improve the assignment materials for future students.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions