This repository focuses on exploring Wave Matrix Multiply-Accumulate (WMMA) intrinsics through ROCm HIP programming on AMD GPUs. While these implementations were created primarily for learning purposes, they may serve as helpful references for others interested in understanding WMMA and related GPU intrinsics.
This repository aims to:
- Provide practical examples of WMMA programming using ROCm HIP
- Demonstrate optimization techniques for matrix operations using WMMA
- Serve as a learning resource for developers working with AMD GPU matrix acceleration
An exploration of the ROCm Wave Matrix Multiply-Accumulate (WMMA) intrinsic, demonstrating how to implement and optimize matrix multiplication using ROCm HIP. This project extends beyond basic examples to support arbitrary matrix dimensions and includes performance comparisons between different implementation approaches.
A separate repository rocm_wmma_gemm has already been created (based on the fastest implementation) which includes a tuner (the implementation has been tuned for specific sizes already), and supports different input and output layouts (row-major and column-major).
- AMD ROCm installed with HIP support
- CMake version 3.10 or higher
- AMD RDNA3/RDNA3.5/RDNA4 GPU (required for WMMA support)
- Clone the repository:
git clone https://github.com/adelj88/rocm_wmma_samples.git cd rocm_wmma_samples
- Build all projects:
mkdir build cd build CXX=/opt/rocm/bin/hipcc cmake .. make
This repository will be expanded with more WMMA-focused examples and explorations. Planned additions include:
- Examples of WMMA usage in different computation patterns and kernel types
- Testing and validation on future RDNA4 hardware
- Performance comparisons across different GPU architectures
For project-specific plans and improvements, please see the individual project READMEs.
This project was inspired by:
This project is licensed under the MIT License - see the LICENSE file for details.