A comprehensive, hands-on educational project for mastering GPU programming with CUDA and HIP
From beginner fundamentals to production-ready optimization techniques
- π Project Overview
- ποΈ GPU Programming Architecture
- β¨ Key Features
- π Quick Start
- π― Learning Path
- π Modules
- π οΈ Prerequisites
- π³ Docker Development
- π§ Build System
- π Performance Expectations
- π Troubleshooting
- π Documentation
- π€ Contributing
- π License
GPU Programming 101 is a complete educational resource for learning modern GPU programming. This project provides:
- 9 comprehensive modules covering beginner to expert topics
- 70+ working code examples in both CUDA and HIP
- Cross-platform support for NVIDIA and AMD GPUs
- Production-ready development environment with Docker
- Professional tooling including profilers, debuggers, and CI/CD
Perfect for students, researchers, and developers looking to master GPU computing.
Understanding how GPU programming works from high-level code to hardware execution is crucial for effective GPU development. This section provides a comprehensive overview of the CUDA and HIP ROCm software-hardware stack.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β APPLICATION LAYER β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β High-Level Code (C++/CUDA/HIP) β
β βββββββββββββββββββββββ βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β CUDA C++ Code β β HIP C++ Code β β OpenCL/SYCL β β
β β (.cu files) β β (.hip files) β β (Cross-platform) β β
β β β β β β β β
β β __global__ kernels β β __global__ kernels β β kernel functions β β
β β cudaMalloc() β β hipMalloc() β β clCreateBuffer() β β
β β cudaMemcpy() β β hipMemcpy() β β clEnqueueNDRange() β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ βββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β COMPILATION LAYER β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Compiler Frontend β
β βββββββββββββββββββββββ βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β NVCC β β HIP Clang β β LLVM/Clang β β
β β (NVIDIA Compiler) β β (AMD Compiler) β β (Open Standard) β β
β β β β β β β β
β β β’ Parse CUDA syntax β β β’ Parse HIP syntax β β β’ Parse OpenCL/SYCL β β
β β β’ Host/Device split β β β’ Host/Device split β β β’ Generate SPIR-V β β
β β β’ Generate PTX β β β’ Generate GCN ASM β β β’ Target backends β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ βββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INTERMEDIATE REPRESENTATION β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β βββββββββββββββββββββββ βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β PTX β β GCN ASM β β SPIR-V β β
β β (Parallel Thread β β (Graphics Core β β (Standard Portable β β
β β Execution) β β Next Assembly) β β IR - Vulkan) β β
β β β β β β β β
β β β’ Virtual ISA β β β’ AMD GPU ISA β β β’ Cross-platform β β
β β β’ Device agnostic β β β’ RDNA/CDNA arch β β β’ Vendor neutral β β
β β β’ JIT compilation β β β’ Direct execution β β β’ Multiple targets β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ βββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DRIVER LAYER β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β βββββββββββββββββββββββ βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β CUDA Driver β β ROCm Driver β β OpenCL Driver β β
β β β β β β β β
β β β’ PTX β SASS JIT β β β’ GCN β Machine β β β’ SPIR-V β Native β β
β β β’ Memory management β β β’ Memory management β β β’ Memory management β β
β β β’ Kernel launch β β β’ Kernel launch β β β’ Kernel launch β β
β β β’ Context mgmt β β β’ Context mgmt β β β’ Context mgmt β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ βββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HARDWARE LAYER β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β NVIDIA GPU β β AMD GPU β β
β β β β β β
β β βββββββββββββββββββ β β βββββββββββββββββββ β βββββββββββββββββββββββ β
β β β SM (Cores) β β β β CU (Cores) β β β Intel Xe Cores β β
β β β βββββββββββββββ β β β β βββββββββββββββ β β β βββββββββββββββββββ β β
β β β βFP32 | INT32 β β β β β βFP32 | INT32 β β β β β Vector Engines β β β
β β β βFP64 | BF16 β β β β β βFP64 | BF16 β β β β β Matrix Engines β β β
β β β βTensor Cores β β β β β βMatrix Cores β β β β β Ray Tracing β β β
β β β βββββββββββββββ β β β β βββββββββββββββ β β β βββββββββββββββββββ β β
β β βββββββββββββββββββ β β βββββββββββββββββββ β βββββββββββββββββββββββ β
β β β β β β
β β Memory Hierarchy: β β Memory Hierarchy: β Memory Hierarchy: β
β β β’ L1 Cache (KB) β β β’ L1 Cache (KB) β β’ L1 Cache β
β β β’ L2 Cache (MB) β β β’ L2 Cache (MB) β β’ L2 Cache β
β β β’ Global Mem (GB) β β β’ Global Mem (GB) β β’ Global Memory β
β β β’ Shared Memory β β β’ LDS (Local Data β β’ Shared Local Memory β
β β β’ Constant Memory β β Store) β β’ Constant Memory β
β β β’ Texture Memory β β β’ Constant Memory β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- CUDA: NVCC separates host (CPU) and device (GPU) code, parses CUDA extensions
- HIP: Clang-based compiler with HIP runtime API that maps to either CUDA or ROCm
- OpenCL/SYCL: LLVM-based compilation with cross-platform intermediate representation
High-Level Code Intermediate Form
βββββββββββββββββ βββββββββββββββββββ
__global__ void kernel() β PTX (NVIDIA)
{ GCN Assembly (AMD)
int id = threadIdx.x; SPIR-V (OpenCL/Vulkan)
output[id] = input[id] * 2; LLVM IR (SYCL)
}
- NVIDIA: PTX β SASS (GPU-specific machine code) via JIT compilation
- AMD: GCN Assembly β GPU microcode via ROCm runtime
- Optimizations: Register allocation, memory coalescing, instruction scheduling
Abstraction Level | NVIDIA Term | AMD Term | Description |
---|---|---|---|
Thread | Thread | Work-item | Single execution unit |
Thread Group | Warp (32 threads) | Wavefront (64 threads) | SIMD execution group |
Thread Block | Block | Work-group | Shared memory + synchronization |
Grid | Grid | NDRange | Collection of all thread blocks |
Programming Model Hardware Implementation
βββββββββββββββββ βββββββββββββββββββββββββ
Global Memory β GPU DRAM (HBM/GDDR)
Shared Memory β On-chip SRAM (48-164KB per SM/CU)
Local Memory β GPU DRAM (spilled registers)
Constant Memory β Cached read-only GPU DRAM
Texture Memory β Cached GPU DRAM with interpolation
Registers β On-chip register file (32K-64K per SM/CU)
Understanding this architecture helps optimize GPU code:
- Memory Coalescing: Access patterns that align with hardware memory buses
- Occupancy: Balancing registers, shared memory, and thread blocks per SM/CU
- Divergence: Minimizing different execution paths within warps/wavefronts
- Latency Hiding: Using enough threads to hide memory access latency
- Memory Hierarchy: Optimal use of each memory type based on access patterns
This architectural knowledge is essential for writing efficient GPU code and is covered progressively throughout our modules.
Feature | Description |
---|---|
π― Complete Curriculum | 9 progressive modules from basics to advanced topics |
π» Cross-Platform | Full CUDA and HIP support for NVIDIA and AMD GPUs |
π³ Docker Ready | Complete containerized development environment |
π§ Production Quality | Professional build systems, testing, and profiling |
π Performance Focus | Optimization techniques and benchmarking throughout |
π Community Driven | Open source with comprehensive contribution guidelines |
Get started immediately without installing CUDA/ROCm on your host system:
# Clone the repository
git clone https://github.com/AIComputing101/gpu-programming-101.git
cd gpu-programming-101
# Auto-detect your GPU and start development environment
./docker/scripts/run.sh --auto
# Inside container: verify GPU access and start learning
/workspace/test-gpu.sh
cd modules/module1 && make && ./01_vector_addition_cuda
For direct system installation:
# Prerequisites: CUDA 11.0+ or ROCm 5.0+, GCC 7+, Make
# Clone and build
git clone https://github.com/AIComputing101/gpu-programming-101.git
cd gpu-programming-101
# Verify your setup
make check-system
# Build and run first example
make module1
cd modules/module1/examples
./01_vector_addition_cuda
Choose your track based on your experience level:
πΆ Beginner Track (Modules 1-3) - GPU fundamentals, memory management, first kernels
π₯ Intermediate Track (Modules 4-5) - Advanced programming, performance optimization
π Advanced Track (Modules 6-9) - Parallel algorithms, domain applications, production deployment
Each track builds on the previous one, so start with the appropriate level for your background.
Our comprehensive curriculum progresses from fundamental concepts to production-ready optimization techniques:
Module | Level | Duration | Focus Area | Key Topics | Examples |
---|---|---|---|---|---|
Module 1 | πΆ Beginner | 4-6h | GPU Fundamentals | Architecture, Memory, First Kernels | 13 |
Module 2 | πΆβπ₯ | 6-8h | Memory Optimization | Coalescing, Shared Memory, Texture | 10 |
Module 3 | π₯ Intermediate | 6-8h | Execution Models | Warps, Occupancy, Synchronization | 12 |
Module 4 | π₯βπ | 8-10h | Advanced Programming | Streams, Multi-GPU, Unified Memory | 9 |
Module 5 | π Advanced | 6-8h | Performance Engineering | Profiling, Bottleneck Analysis | 5 |
Module 6 | π Advanced | 8-10h | Parallel Algorithms | Reduction, Scan, Convolution | 10 |
Module 7 | π Expert | 8-10h | Algorithmic Patterns | Sorting, Graph Algorithms | 4 |
Module 8 | π Expert | 10-12h | Domain Applications | ML, Scientific Computing | 4 |
Module 9 | π Expert | 6-8h | Production Deployment | Libraries, Integration, Scaling | 4 |
π Progressive Learning Path: 70+ Examples β’ 50+ Hours β’ Beginner to Expert
Module 1: Hello GPU World Module 6: Parallel Algorithms
β β
Module 2: Memory Mastery Module 7: Advanced Patterns
β β
Module 3: Execution Deep Dive Module 8: Real Applications
β β
Module 4: Advanced Features Module 9: Production Ready
β
Module 5: Performance Tuning
- Minimum GPU: GTX 1060 6GB, GTX 1650, RTX 2060 or better
- Recommended GPU: RTX 3070/4070 (12GB+), RTX 3080/4080 (16GB+)
- Professional/Advanced: RTX 4090 (24GB), RTX A6000 (48GB), Tesla/Quadro series
- Architecture Support: Maxwell, Pascal, Volta, Turing, Ampere, Ada Lovelace, Hopper
- Compute Capability: 5.0+ (Maxwell architecture or newer)
- Minimum GPU: RX 580 8GB, RX 6600, RX 7600 or better
- Recommended GPU: RX 6700 XT/7700 XT (12GB+), RX 6800 XT/7800 XT (16GB+)
- Professional/Advanced: RX 7900 XTX (24GB), Radeon PRO W7800 (48GB), Instinct MI series
- Architecture Support: RDNA2, RDNA3, RDNA4, GCN 5.0+, CDNA series
- ROCm Compatibility: Officially supported AMD GPUs only
- Minimum RAM: 16GB system RAM
- Recommended RAM: 32GB+ for advanced modules and multi-GPU setups
- Professional Setup: 64GB+ for large-scale scientific computing
- CPU Requirements:
- Intel: Haswell (2013) or newer for PCIe atomics support
- AMD: Zen 1 (2017) or newer for PCIe atomics support
- Storage: 20GB+ free space for Docker containers and examples
- Linux (Recommended): Ubuntu 22.04 LTS, RHEL 8/9, SLES 15 SP5
- Windows: Windows 10/11 with WSL2 recommended for optimal compatibility
- macOS: macOS 12+ (Metal Performance Shaders for basic GPU compute)
- CUDA Toolkit: 12.0+ (Docker uses CUDA 12.9.1)
- Driver Requirements:
- Linux: 550.54.14+ for CUDA 12.4+
- Windows: 551.61+ for CUDA 12.4+
- Driver Requirements:
- ROCm Platform: 6.0+ (Docker uses ROCm 6.4.3)
- Driver Requirements: Latest AMDGPU-PRO or open-source AMDGPU drivers
- Kernel Support: Linux kernel 5.4+ recommended
- Compilers:
- GCC: 9.0+ (GCC 11+ recommended for C++17 features)
- Clang: 10.0+ (Clang 14+ recommended)
- MSVC: 2019+ (2022 17.10+ for CUDA 12.4+ support)
- Build Tools: Make 4.0+, CMake 3.18+ (optional)
- Docker: 20.10+ with GPU runtime support (nvidia-container-toolkit or ROCm containers)
- Profiling: Nsight Compute, Nsight Systems (NVIDIA), rocprof (AMD)
- Debugging: cuda-gdb, rocgdb, compute-sanitizer
- Libraries: cuBLAS, cuFFT, rocBLAS, rocFFT (for advanced modules)
Hardware Tier | Example GPUs | VRAM | Expected Performance | Suitable Modules |
---|---|---|---|---|
Entry Level | GTX 1060 6GB, RX 580 8GB | 6-8GB | 10-50x CPU speedup | Modules 1-3 |
Mid-Range | RTX 3060 Ti, RX 6700 XT | 12GB | 50-200x CPU speedup | Modules 1-6 |
High-End | RTX 4070 Ti, RX 7800 XT | 16GB | 100-500x CPU speedup | All modules |
Professional | RTX 4090, RX 7900 XTX | 24GB | 200-1000x+ CPU speedup | All modules + research |
- C/C++: Intermediate level (pointers, memory management, basic templates)
- Parallel Programming: Basic understanding of threads and synchronization helpful
- Command Line: Comfortable with terminal/shell operations
- Mathematics: Linear algebra and calculus basics beneficial for advanced modules
- Version Control: Basic Git knowledge for contributing
- Internet Connection: Required for initial Docker image downloads (~8GB total)
- Bandwidth: 50+ Mbps recommended for efficient container downloads
- Storage: Additional 20GB for Docker images and build cache
Experience the full development environment with zero setup:
# Build development containers
./docker/scripts/build.sh --all
# Start interactive development
./docker/scripts/run.sh cuda # For NVIDIA GPUs
./docker/scripts/run.sh rocm # For AMD GPUs
./docker/scripts/run.sh --auto # Auto-detect GPU type
Docker Benefits:
- π― Zero host configuration required
- π§ Complete development environment (compilers, debuggers, profilers)
- π Cross-platform testing (test your code on both CUDA and HIP)
- π¦ Isolated and reproducible builds
- π§Ή Easy cleanup when done
π Complete Docker Guide β
make all # Build all modules
make test # Run comprehensive tests
make clean # Clean all artifacts
make check-system # Verify GPU setup
make status # Show module completion status
cd modules/module1/examples
make # Build all examples in module
make test # Run module tests
make profile # Performance profiling
make debug # Debug builds with extra checks
Module Level | Typical GPU Speedup | Memory Efficiency | Code Quality |
---|---|---|---|
Beginner | 10-100x | 60-80% | Educational |
Intermediate | 50-500x | 80-95% | Optimized |
Advanced | 100-1000x | 85-95% | Production |
Expert | 500-5000x | 95%+ | Library-Quality |
GPU Not Detected
# NVIDIA
nvidia-smi # Should show your GPU
export PATH=/usr/local/cuda/bin:$PATH
# AMD
rocm-smi # Should show your GPU
export HIP_PLATFORM=amd
Compilation Errors
# Check CUDA installation
nvcc --version
make check-cuda
# Check HIP installation
hipcc --version
make check-hip
Docker Issues
# Test Docker GPU access
./docker/scripts/test.sh
# Rebuild containers
./docker/scripts/build.sh --clean --all
Document | Description |
---|---|
README.md | Main project documentation and getting started guide |
CONTRIBUTING.md | How to contribute to the project |
Docker Guide | Complete Docker setup and usage |
Module READMEs | Individual module documentation |
We welcome contributions from the community! This project thrives on:
- π New Examples: Implementing additional GPU algorithms
- π Bug Fixes: Improving existing code and documentation
- π Documentation: Enhancing explanations and tutorials
- π§ Optimizations: Performance improvements and best practices
- π Platform Support: Cross-platform compatibility improvements
π Contributing Guidelines β β’ π Report Issues β β’ π‘ Request Features β
- π Star this project if you find it helpful!
- π Report bugs using our issue templates
- π¬ Join discussions in GitHub Discussions
- π§ Get help from the community and maintainers
This project is licensed under the MIT License - see the LICENSE file for details.
TL;DR: β Commercial use β Modification β Distribution β Private use
If you use this project in your research, education, or publications, please cite it as:
@misc{gpu-programming-101,
title={GPU Programming 101: A Comprehensive Educational Project for CUDA and HIP},
author={{Stephen Shao}},
year={2025},
howpublished={\url{https://github.com/AIComputing101/gpu-programming-101}},
note={A complete GPU programming educational resource with 70+ production-ready examples covering fundamentals through advanced optimization techniques for NVIDIA CUDA and AMD HIP platforms}
}
Stephen Shao, "GPU Programming 101: A Comprehensive Educational Project for CUDA and HIP," GitHub, 2025. [Online]. Available: https://github.com/AIComputing101/gpu-programming-101
- π― NVIDIA and AMD for excellent GPU computing ecosystems
- π GPU computing community for sharing knowledge and best practices
- π« Educational institutions advancing parallel computing education
- π₯ Contributors who make this project better every day
Ready to unlock the power of GPU computing?
π Get Started Now β’ π View Modules β’ π³ Try Docker
β Star this project β’ π΄ Fork and contribute β’ π’ Share with others
Built with β€οΈ for the AI Computing 101