This folder contains various C++ and CUDA C++ implementations for performing square matrix multiplication, showcasing different approaches from basic CPU execution to optimized GPU computations using custom kernels and NVIDIA's CUBLAS library
The primary goal is to explore and compare the performance characteristics of these different methods, particularly for large matrices
-
Description: This file provides a standard, generic matrix multiplication algorithm executed entirely on the CPU
-
Testing: It includes 2 test cases, demonstrating multiplication for a 10000x10000 matrix. One is in sequential numbering while the other is in floating point operations, of course it took around ~ 120 minutes to calculate which is extremely slow.
-
Description: This file contains a basic implementation of matrix multiplication accelerated on an NVIDIA GPU using CUDA
-
Approach (Naive):
- Memory Transfer (Host to Device): Input matrices are copied from CPU main memory (RAM) to GPU global memory (VRAM)
- GPU Computation: A simple CUDA kernel performs the matrix multiplication on the GPU
- Memory Transfer (Device to Host): The resulting matrix is copied back from GPU VRAM to CPU RAM
-
Considerations: This version implements the fundamental steps of GPU computing but does not incorporate advanced optimization techniques. It primarily demonstrates the data flow and basic kernel execution, without deeply considering overheads like memory latency or thread divergence
-
Testing: ~Approximately 12 seconds, or about 3% of the entire GPU theoretical GFLOPS
This file presents two advanced approaches for GPU matrix multiplication: a custom-optimized CUDA kernel and the use of the NVIDIA CUBLAS library.
-
Description: This section of the file focuses on a CUDA kernel specifically optimized for better performance than the naive GPU version
-
Key Optimizations at Kernel Level:
-
Memory Coalescing: Data is read from global memory in a way that allows threads within a warp to access contiguous memory locations, maximizing memory transaction efficiency
-
Sequential Addressing: Threads are organized to access memory sequentially where possible, further improving global memory access patterns
-
Tiling (Shared Memory Usage): The matrices are divided into smaller blocks (tiles). These tiles are loaded into the GPU's fast on-chip shared memory. Computations are then performed on these tiles, significantly reducing reliance on slower global memory and increasing effective memory bandwidth by reusing data within shared memory ( The tiles are done by 32 x 32 )
-
-
Performance Note: Despite these optimizations, it was observed that this custom kernel might still only utilize a fraction (6%) of the GPU's theoretical maximum compute capability
-
CUBLAS: CUBLAS is a highly optimized Basic Linear Algebra Subprograms (BLAS) library provided by NVIDIA. It contains routines for matrix and vector operations that are fine-tuned for NVIDIA GPU architectures (# Vendor Locked In Syndrome)
-
Using the CUBLAS
cublassSgemm
function to calculate the matirx multiplication, it was discovered that we leverage approximately 48~ 50% of compute capability. Of course, this is only for production grade but regardless it is fairly impressive
A common metric to evaluate the performance of high-performance computing tasks like matrix multiplication is GFLOPS (Giga Floating-point Operations Per Second).
For a standard multiplication of two
- Each element of the resulting
$N \times N$ matrix requires$N$ multiplications and$N-1$ additions. - Total multiplications =
$N^2 \times N = N^3$ . - Total additions =
$N^2 \times (N-1) = N^3 - N^2$ . - Total operations =
$N^3 + (N^3 - N^2) = 2N^3 - N^2$ .
Formula to Calculate GFLOPS:
So, for matrix multiplication:
- Compile:
- CPU code (
mat_mul_cpu.cpp
): Use a C++ compiler (e.g., g++).g++ mat_mul_cpu.cpp -o mat_mul_cpu -O3
- GPU code (
.cu
files): Use the NVIDIA CUDA Compiler (nvcc
).(Ensure the CUDA toolkit path is correctly set up in your environment variables).# For naive GPU version nvcc mat_mul_gpu.cu -o mat_mul_gpu # For optimized GPU version (including CUBLAS) nvcc mat_mul_gpu_opti.cu -o mat_mul_gpu_opti -lcublas
- CPU code (
- Run:
./mat_mul_cpu ./mat_mul_gpu ./mat_mul_gpu_opti