Skip to content

A comprehensive hands-on project for learning GPU programming with CUDA and HIP, covering fundamental concepts through advanced optimization techniques.

License

Notifications You must be signed in to change notification settings

AIComputing101/gpu-programming-101

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GPU Programming 101 πŸš€

License: MIT CUDA ROCm Docker Examples CI

A comprehensive, hands-on educational project for mastering GPU programming with CUDA and HIP

From beginner fundamentals to production-ready optimization techniques

πŸ“‘ Table of Contents


πŸ“‹ Project Overview

GPU Programming 101 is a complete educational resource for learning modern GPU programming. This project provides:

  • 9 comprehensive modules covering beginner to expert topics
  • 70+ working code examples in both CUDA and HIP
  • Cross-platform support for NVIDIA and AMD GPUs
  • Production-ready development environment with Docker
  • Professional tooling including profilers, debuggers, and CI/CD

Perfect for students, researchers, and developers looking to master GPU computing.

πŸ—οΈ GPU Programming Architecture

Understanding how GPU programming works from high-level code to hardware execution is crucial for effective GPU development. This section provides a comprehensive overview of the CUDA and HIP ROCm software-hardware stack.

Architecture Overview Diagram

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                APPLICATION LAYER                                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  High-Level Code (C++/CUDA/HIP)                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚   CUDA C++ Code     β”‚    β”‚    HIP C++ Code     β”‚    β”‚   OpenCL/SYCL       β”‚    β”‚
β”‚  β”‚   (.cu files)       β”‚    β”‚   (.hip files)      β”‚    β”‚   (Cross-platform)   β”‚    β”‚
β”‚  β”‚                     β”‚    β”‚                     β”‚    β”‚                     β”‚    β”‚
β”‚  β”‚ __global__ kernels  β”‚    β”‚ __global__ kernels  β”‚    β”‚ kernel functions    β”‚    β”‚
β”‚  β”‚ cudaMalloc()        β”‚    β”‚ hipMalloc()         β”‚    β”‚ clCreateBuffer()    β”‚    β”‚
β”‚  β”‚ cudaMemcpy()        β”‚    β”‚ hipMemcpy()         β”‚    β”‚ clEnqueueNDRange()  β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                              COMPILATION LAYER                                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Compiler Frontend                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚      NVCC           β”‚    β”‚      HIP Clang      β”‚    β”‚    LLVM/Clang       β”‚    β”‚
β”‚  β”‚  (NVIDIA Compiler)  β”‚    β”‚   (AMD Compiler)    β”‚    β”‚   (Open Standard)   β”‚    β”‚
β”‚  β”‚                     β”‚    β”‚                     β”‚    β”‚                     β”‚    β”‚
β”‚  β”‚ β€’ Parse CUDA syntax β”‚    β”‚ β€’ Parse HIP syntax  β”‚    β”‚ β€’ Parse OpenCL/SYCL β”‚    β”‚
β”‚  β”‚ β€’ Host/Device split β”‚    β”‚ β€’ Host/Device split β”‚    β”‚ β€’ Generate SPIR-V   β”‚    β”‚
β”‚  β”‚ β€’ Generate PTX      β”‚    β”‚ β€’ Generate GCN ASM  β”‚    β”‚ β€’ Target backends   β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           INTERMEDIATE REPRESENTATION                               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚        PTX          β”‚    β”‚      GCN ASM        β”‚    β”‚      SPIR-V         β”‚    β”‚
β”‚  β”‚ (Parallel Thread    β”‚    β”‚  (Graphics Core     β”‚    β”‚  (Standard Portable β”‚    β”‚
β”‚  β”‚  Execution)         β”‚    β”‚   Next Assembly)    β”‚    β”‚   IR - Vulkan)      β”‚    β”‚
β”‚  β”‚                     β”‚    β”‚                     β”‚    β”‚                     β”‚    β”‚
β”‚  β”‚ β€’ Virtual ISA       β”‚    β”‚ β€’ AMD GPU ISA       β”‚    β”‚ β€’ Cross-platform    β”‚    β”‚
β”‚  β”‚ β€’ Device agnostic   β”‚    β”‚ β€’ RDNA/CDNA arch    β”‚    β”‚ β€’ Vendor neutral    β”‚    β”‚
β”‚  β”‚ β€’ JIT compilation   β”‚    β”‚ β€’ Direct execution  β”‚    β”‚ β€’ Multiple targets  β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                               DRIVER LAYER                                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚    CUDA Driver      β”‚    β”‚     ROCm Driver     β”‚    β”‚   OpenCL Driver     β”‚    β”‚
β”‚  β”‚                     β”‚    β”‚                     β”‚    β”‚                     β”‚    β”‚
β”‚  β”‚ β€’ PTX β†’ SASS JIT    β”‚    β”‚ β€’ GCN β†’ Machine     β”‚    β”‚ β€’ SPIR-V β†’ Native   β”‚    β”‚
β”‚  β”‚ β€’ Memory management β”‚    β”‚ β€’ Memory management β”‚    β”‚ β€’ Memory management β”‚    β”‚
β”‚  β”‚ β€’ Kernel launch     β”‚    β”‚ β€’ Kernel launch     β”‚    β”‚ β€’ Kernel launch     β”‚    β”‚
β”‚  β”‚ β€’ Context mgmt      β”‚    β”‚ β€’ Context mgmt      β”‚    β”‚ β€’ Context mgmt      β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        β”‚
                                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                              HARDWARE LAYER                                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                               β”‚
β”‚  β”‚    NVIDIA GPU       β”‚    β”‚      AMD GPU        β”‚                               β”‚
β”‚  β”‚                     β”‚    β”‚                     β”‚                               β”‚
β”‚  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚    β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ β”‚   SM (Cores)    β”‚ β”‚    β”‚ β”‚   CU (Cores)    β”‚ β”‚    β”‚   Intel Xe Cores    β”‚    β”‚
β”‚  β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚    β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚    β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚    β”‚
β”‚  β”‚ β”‚ β”‚FP32 | INT32 β”‚ β”‚ β”‚    β”‚ β”‚ β”‚FP32 | INT32 β”‚ β”‚ β”‚    β”‚ β”‚  Vector Engines β”‚ β”‚    β”‚
β”‚  β”‚ β”‚ β”‚FP64 | BF16  β”‚ β”‚ β”‚    β”‚ β”‚ β”‚FP64 | BF16  β”‚ β”‚ β”‚    β”‚ β”‚  Matrix Engines β”‚ β”‚    β”‚
β”‚  β”‚ β”‚ β”‚Tensor Cores β”‚ β”‚ β”‚    β”‚ β”‚ β”‚Matrix Cores β”‚ β”‚ β”‚    β”‚ β”‚  Ray Tracing    β”‚ β”‚    β”‚
β”‚  β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚    β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚    β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚    β”‚
β”‚  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚    β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚  β”‚                     β”‚    β”‚                     β”‚                               β”‚
β”‚  β”‚ Memory Hierarchy:   β”‚    β”‚ Memory Hierarchy:   β”‚    Memory Hierarchy:          β”‚
β”‚  β”‚ β€’ L1 Cache (KB)     β”‚    β”‚ β€’ L1 Cache (KB)     β”‚    β€’ L1 Cache                 β”‚
β”‚  β”‚ β€’ L2 Cache (MB)     β”‚    β”‚ β€’ L2 Cache (MB)     β”‚    β€’ L2 Cache                 β”‚
β”‚  β”‚ β€’ Global Mem (GB)   β”‚    β”‚ β€’ Global Mem (GB)   β”‚    β€’ Global Memory            β”‚
β”‚  β”‚ β€’ Shared Memory     β”‚    β”‚ β€’ LDS (Local Data   β”‚    β€’ Shared Local Memory      β”‚
β”‚  β”‚ β€’ Constant Memory   β”‚    β”‚   Store)            β”‚    β€’ Constant Memory          β”‚
β”‚  β”‚ β€’ Texture Memory    β”‚    β”‚ β€’ Constant Memory   β”‚                               β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Compilation Pipeline Deep Dive

1. Source Code β†’ Frontend Parsing

  • CUDA: NVCC separates host (CPU) and device (GPU) code, parses CUDA extensions
  • HIP: Clang-based compiler with HIP runtime API that maps to either CUDA or ROCm
  • OpenCL/SYCL: LLVM-based compilation with cross-platform intermediate representation

2. Frontend β†’ Intermediate Representation

High-Level Code                    Intermediate Form
─────────────────                 ───────────────────
__global__ void kernel()    β†’     PTX (NVIDIA)
{                                 GCN Assembly (AMD)  
    int id = threadIdx.x;         SPIR-V (OpenCL/Vulkan)
    output[id] = input[id] * 2;   LLVM IR (SYCL)
}

3. Runtime Compilation & Optimization

  • NVIDIA: PTX β†’ SASS (GPU-specific machine code) via JIT compilation
  • AMD: GCN Assembly β†’ GPU microcode via ROCm runtime
  • Optimizations: Register allocation, memory coalescing, instruction scheduling

4. Hardware Execution Model

Abstraction Level NVIDIA Term AMD Term Description
Thread Thread Work-item Single execution unit
Thread Group Warp (32 threads) Wavefront (64 threads) SIMD execution group
Thread Block Block Work-group Shared memory + synchronization
Grid Grid NDRange Collection of all thread blocks

5. Memory Architecture Mapping

Programming Model              Hardware Implementation
─────────────────              ─────────────────────────
Global Memory        β†’         GPU DRAM (HBM/GDDR)
Shared Memory        β†’         On-chip SRAM (48-164KB per SM/CU)
Local Memory         β†’         GPU DRAM (spilled registers)
Constant Memory      β†’         Cached read-only GPU DRAM
Texture Memory       β†’         Cached GPU DRAM with interpolation
Registers            β†’         On-chip register file (32K-64K per SM/CU)

Performance Implications

Understanding this architecture helps optimize GPU code:

  1. Memory Coalescing: Access patterns that align with hardware memory buses
  2. Occupancy: Balancing registers, shared memory, and thread blocks per SM/CU
  3. Divergence: Minimizing different execution paths within warps/wavefronts
  4. Latency Hiding: Using enough threads to hide memory access latency
  5. Memory Hierarchy: Optimal use of each memory type based on access patterns

This architectural knowledge is essential for writing efficient GPU code and is covered progressively throughout our modules.

✨ Key Features

Feature Description
🎯 Complete Curriculum 9 progressive modules from basics to advanced topics
πŸ’» Cross-Platform Full CUDA and HIP support for NVIDIA and AMD GPUs
🐳 Docker Ready Complete containerized development environment
πŸ”§ Production Quality Professional build systems, testing, and profiling
πŸ“Š Performance Focus Optimization techniques and benchmarking throughout
🌐 Community Driven Open source with comprehensive contribution guidelines

πŸš€ Quick Start

Option 1: Docker (Recommended)

Get started immediately without installing CUDA/ROCm on your host system:

# Clone the repository
git clone https://github.com/AIComputing101/gpu-programming-101.git
cd gpu-programming-101

# Auto-detect your GPU and start development environment
./docker/scripts/run.sh --auto

# Inside container: verify GPU access and start learning
/workspace/test-gpu.sh
cd modules/module1 && make && ./01_vector_addition_cuda

Option 2: Native Installation

For direct system installation:

# Prerequisites: CUDA 11.0+ or ROCm 5.0+, GCC 7+, Make

# Clone and build
git clone https://github.com/AIComputing101/gpu-programming-101.git
cd gpu-programming-101

# Verify your setup
make check-system

# Build and run first example
make module1
cd modules/module1/examples
./01_vector_addition_cuda

🎯 Learning Path

Choose your track based on your experience level:

πŸ‘Ά Beginner Track (Modules 1-3) - GPU fundamentals, memory management, first kernels πŸ”₯ Intermediate Track (Modules 4-5) - Advanced programming, performance optimization
πŸš€ Advanced Track (Modules 6-9) - Parallel algorithms, domain applications, production deployment

Each track builds on the previous one, so start with the appropriate level for your background.

πŸ“š Modules

Our comprehensive curriculum progresses from fundamental concepts to production-ready optimization techniques:

Module Level Duration Focus Area Key Topics Examples
Module 1 πŸ‘Ά Beginner 4-6h GPU Fundamentals Architecture, Memory, First Kernels 13
Module 2 πŸ‘Άβ†’πŸ”₯ 6-8h Memory Optimization Coalescing, Shared Memory, Texture 10
Module 3 πŸ”₯ Intermediate 6-8h Execution Models Warps, Occupancy, Synchronization 12
Module 4 πŸ”₯β†’πŸš€ 8-10h Advanced Programming Streams, Multi-GPU, Unified Memory 9
Module 5 πŸš€ Advanced 6-8h Performance Engineering Profiling, Bottleneck Analysis 5
Module 6 πŸš€ Advanced 8-10h Parallel Algorithms Reduction, Scan, Convolution 10
Module 7 πŸš€ Expert 8-10h Algorithmic Patterns Sorting, Graph Algorithms 4
Module 8 πŸš€ Expert 10-12h Domain Applications ML, Scientific Computing 4
Module 9 πŸš€ Expert 6-8h Production Deployment Libraries, Integration, Scaling 4

πŸ“ˆ Progressive Learning Path: 70+ Examples β€’ 50+ Hours β€’ Beginner to Expert

Learning Progression

Module 1: Hello GPU World          Module 6: Parallel Algorithms
    ↓                                 ↓
Module 2: Memory Mastery          Module 7: Advanced Patterns  
    ↓                                 ↓
Module 3: Execution Deep Dive     Module 8: Real Applications
    ↓                                 ↓
Module 4: Advanced Features       Module 9: Production Ready
    ↓                             
Module 5: Performance Tuning     

πŸ“š View All Modules β†’

πŸ› οΈ Prerequisites

Hardware Requirements

NVIDIA GPU Systems

  • Minimum GPU: GTX 1060 6GB, GTX 1650, RTX 2060 or better
  • Recommended GPU: RTX 3070/4070 (12GB+), RTX 3080/4080 (16GB+)
  • Professional/Advanced: RTX 4090 (24GB), RTX A6000 (48GB), Tesla/Quadro series
  • Architecture Support: Maxwell, Pascal, Volta, Turing, Ampere, Ada Lovelace, Hopper
  • Compute Capability: 5.0+ (Maxwell architecture or newer)

AMD GPU Systems

  • Minimum GPU: RX 580 8GB, RX 6600, RX 7600 or better
  • Recommended GPU: RX 6700 XT/7700 XT (12GB+), RX 6800 XT/7800 XT (16GB+)
  • Professional/Advanced: RX 7900 XTX (24GB), Radeon PRO W7800 (48GB), Instinct MI series
  • Architecture Support: RDNA2, RDNA3, RDNA4, GCN 5.0+, CDNA series
  • ROCm Compatibility: Officially supported AMD GPUs only

System Memory & CPU

  • Minimum RAM: 16GB system RAM
  • Recommended RAM: 32GB+ for advanced modules and multi-GPU setups
  • Professional Setup: 64GB+ for large-scale scientific computing
  • CPU Requirements:
    • Intel: Haswell (2013) or newer for PCIe atomics support
    • AMD: Zen 1 (2017) or newer for PCIe atomics support
  • Storage: 20GB+ free space for Docker containers and examples

Software Requirements

Operating System Support

  • Linux (Recommended): Ubuntu 22.04 LTS, RHEL 8/9, SLES 15 SP5
  • Windows: Windows 10/11 with WSL2 recommended for optimal compatibility
  • macOS: macOS 12+ (Metal Performance Shaders for basic GPU compute)

GPU Computing Platforms

  • CUDA Toolkit: 12.0+ (Docker uses CUDA 12.9.1)
    • Driver Requirements:
      • Linux: 550.54.14+ for CUDA 12.4+
      • Windows: 551.61+ for CUDA 12.4+
  • ROCm Platform: 6.0+ (Docker uses ROCm 6.4.3)
    • Driver Requirements: Latest AMDGPU-PRO or open-source AMDGPU drivers
    • Kernel Support: Linux kernel 5.4+ recommended

Development Environment

  • Compilers:
    • GCC: 9.0+ (GCC 11+ recommended for C++17 features)
    • Clang: 10.0+ (Clang 14+ recommended)
    • MSVC: 2019+ (2022 17.10+ for CUDA 12.4+ support)
  • Build Tools: Make 4.0+, CMake 3.18+ (optional)
  • Docker: 20.10+ with GPU runtime support (nvidia-container-toolkit or ROCm containers)

Additional Tools (Included in Docker)

  • Profiling: Nsight Compute, Nsight Systems (NVIDIA), rocprof (AMD)
  • Debugging: cuda-gdb, rocgdb, compute-sanitizer
  • Libraries: cuBLAS, cuFFT, rocBLAS, rocFFT (for advanced modules)

Performance Expectations by Hardware Tier

Hardware Tier Example GPUs VRAM Expected Performance Suitable Modules
Entry Level GTX 1060 6GB, RX 580 8GB 6-8GB 10-50x CPU speedup Modules 1-3
Mid-Range RTX 3060 Ti, RX 6700 XT 12GB 50-200x CPU speedup Modules 1-6
High-End RTX 4070 Ti, RX 7800 XT 16GB 100-500x CPU speedup All modules
Professional RTX 4090, RX 7900 XTX 24GB 200-1000x+ CPU speedup All modules + research

Programming Knowledge

  • C/C++: Intermediate level (pointers, memory management, basic templates)
  • Parallel Programming: Basic understanding of threads and synchronization helpful
  • Command Line: Comfortable with terminal/shell operations
  • Mathematics: Linear algebra and calculus basics beneficial for advanced modules
  • Version Control: Basic Git knowledge for contributing

Network Requirements (Docker Setup)

  • Internet Connection: Required for initial Docker image downloads (~8GB total)
  • Bandwidth: 50+ Mbps recommended for efficient container downloads
  • Storage: Additional 20GB for Docker images and build cache

🐳 Docker Development

Experience the full development environment with zero setup:

# Build development containers
./docker/scripts/build.sh --all

# Start interactive development
./docker/scripts/run.sh cuda    # For NVIDIA GPUs
./docker/scripts/run.sh rocm    # For AMD GPUs
./docker/scripts/run.sh --auto  # Auto-detect GPU type

Docker Benefits:

  • 🎯 Zero host configuration required
  • πŸ”§ Complete development environment (compilers, debuggers, profilers)
  • 🌐 Cross-platform testing (test your code on both CUDA and HIP)
  • πŸ“¦ Isolated and reproducible builds
  • 🧹 Easy cleanup when done

πŸ“– Complete Docker Guide β†’

πŸ”§ Build System

Project-Wide Commands

make all           # Build all modules
make test          # Run comprehensive tests  
make clean         # Clean all artifacts
make check-system  # Verify GPU setup
make status        # Show module completion status

Module-Specific Commands

cd modules/module1/examples
make               # Build all examples in module
make test          # Run module tests
make profile       # Performance profiling
make debug         # Debug builds with extra checks

Performance Expectations

Module Level Typical GPU Speedup Memory Efficiency Code Quality
Beginner 10-100x 60-80% Educational
Intermediate 50-500x 80-95% Optimized
Advanced 100-1000x 85-95% Production
Expert 500-5000x 95%+ Library-Quality

πŸ› Troubleshooting

Common Issues & Solutions

GPU Not Detected

# NVIDIA
nvidia-smi  # Should show your GPU
export PATH=/usr/local/cuda/bin:$PATH

# AMD  
rocm-smi   # Should show your GPU
export HIP_PLATFORM=amd

Compilation Errors

# Check CUDA installation
nvcc --version
make check-cuda

# Check HIP installation  
hipcc --version
make check-hip

Docker Issues

# Test Docker GPU access
./docker/scripts/test.sh

# Rebuild containers
./docker/scripts/build.sh --clean --all

πŸ“– Documentation

Document Description
README.md Main project documentation and getting started guide
CONTRIBUTING.md How to contribute to the project
Docker Guide Complete Docker setup and usage
Module READMEs Individual module documentation

🀝 Contributing

We welcome contributions from the community! This project thrives on:

  • πŸ“ New Examples: Implementing additional GPU algorithms
  • πŸ› Bug Fixes: Improving existing code and documentation
  • πŸ“š Documentation: Enhancing explanations and tutorials
  • πŸ”§ Optimizations: Performance improvements and best practices
  • 🌐 Platform Support: Cross-platform compatibility improvements

πŸ“– Contributing Guidelines β†’ β€’ πŸ› Report Issues β†’ β€’ πŸ’‘ Request Features β†’

πŸ† Community & Support

  • 🌟 Star this project if you find it helpful!
  • πŸ› Report bugs using our issue templates
  • πŸ’¬ Join discussions in GitHub Discussions
  • πŸ“§ Get help from the community and maintainers

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

TL;DR: βœ… Commercial use βœ… Modification βœ… Distribution βœ… Private use

πŸ“š Citation

If you use this project in your research, education, or publications, please cite it as:

BibTeX

@misc{gpu-programming-101,
  title={GPU Programming 101: A Comprehensive Educational Project for CUDA and HIP},
  author={{Stephen Shao}},
  year={2025},
  howpublished={\url{https://github.com/AIComputing101/gpu-programming-101}},
  note={A complete GPU programming educational resource with 70+ production-ready examples covering fundamentals through advanced optimization techniques for NVIDIA CUDA and AMD HIP platforms}
}

IEEE Format

Stephen Shao, "GPU Programming 101: A Comprehensive Educational Project for CUDA and HIP," GitHub, 2025. [Online]. Available: https://github.com/AIComputing101/gpu-programming-101

πŸ™ Acknowledgments

  • 🎯 NVIDIA and AMD for excellent GPU computing ecosystems
  • πŸ“š GPU computing community for sharing knowledge and best practices
  • 🏫 Educational institutions advancing parallel computing education
  • πŸ‘₯ Contributors who make this project better every day

Ready to unlock the power of GPU computing?

πŸš€ Get Started Now β€’ πŸ“š View Modules β€’ 🐳 Try Docker


⭐ Star this project β€’ 🍴 Fork and contribute β€’ πŸ“’ Share with others

Built with ❀️ for the AI Computing 101

About

A comprehensive hands-on project for learning GPU programming with CUDA and HIP, covering fundamental concepts through advanced optimization techniques.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Sponsor this project

 

Packages

No packages published