GPU Programming 101 🚀

A comprehensive, hands-on educational project for mastering GPU programming with CUDA and HIP

From beginner fundamentals to production-ready optimization techniques

📑 Table of Contents

📋 Project Overview
🏗️ GPU Programming Architecture
✨ Key Features
🚀 Quick Start
🎯 Learning Path
📚 Modules
🛠️ Prerequisites
🐳 Docker Development
🔧 Build System
📊 Performance Expectations
🐛 Troubleshooting
📖 Documentation
🤝 Contributing
📄 License

📋 Project Overview

GPU Programming 101 is a complete educational resource for learning modern GPU programming. This project provides:

9 comprehensive modules covering beginner to expert topics
70+ working code examples in both CUDA and HIP
Cross-platform support for NVIDIA and AMD GPUs
Production-ready development environment with Docker
Professional tooling including profilers, debuggers, and CI/CD

Perfect for students, researchers, and developers looking to master GPU computing.

🏗️ GPU Programming Architecture

Understanding how GPU programming works from high-level code to hardware execution is crucial for effective GPU development. This section provides a comprehensive overview of the CUDA and HIP ROCm software-hardware stack.

Architecture Overview Diagram

┌─────────────────────────────────────────────────────────────────────────────────────┐
│                                APPLICATION LAYER                                    │
├─────────────────────────────────────────────────────────────────────────────────────┤
│  High-Level Code (C++/CUDA/HIP)                                                   │
│  ┌─────────────────────┐    ┌─────────────────────┐    ┌─────────────────────┐    │
│  │   CUDA C++ Code     │    │    HIP C++ Code     │    │   OpenCL/SYCL       │    │
│  │   (.cu files)       │    │   (.hip files)      │    │   (Cross-platform)   │    │
│  │                     │    │                     │    │                     │    │
│  │ __global__ kernels  │    │ __global__ kernels  │    │ kernel functions    │    │
│  │ cudaMalloc()        │    │ hipMalloc()         │    │ clCreateBuffer()    │    │
│  │ cudaMemcpy()        │    │ hipMemcpy()         │    │ clEnqueueNDRange()  │    │
│  └─────────────────────┘    └─────────────────────┘    └─────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│                              COMPILATION LAYER                                     │
├─────────────────────────────────────────────────────────────────────────────────────┤
│  Compiler Frontend                                                                 │
│  ┌─────────────────────┐    ┌─────────────────────┐    ┌─────────────────────┐    │
│  │      NVCC           │    │      HIP Clang      │    │    LLVM/Clang       │    │
│  │  (NVIDIA Compiler)  │    │   (AMD Compiler)    │    │   (Open Standard)   │    │
│  │                     │    │                     │    │                     │    │
│  │ • Parse CUDA syntax │    │ • Parse HIP syntax  │    │ • Parse OpenCL/SYCL │    │
│  │ • Host/Device split │    │ • Host/Device split │    │ • Generate SPIR-V   │    │
│  │ • Generate PTX      │    │ • Generate GCN ASM  │    │ • Target backends   │    │
│  └─────────────────────┘    └─────────────────────┘    └─────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│                           INTERMEDIATE REPRESENTATION                               │
├─────────────────────────────────────────────────────────────────────────────────────┤
│  ┌─────────────────────┐    ┌─────────────────────┐    ┌─────────────────────┐    │
│  │        PTX          │    │      GCN ASM        │    │      SPIR-V         │    │
│  │ (Parallel Thread    │    │  (Graphics Core     │    │  (Standard Portable │    │
│  │  Execution)         │    │   Next Assembly)    │    │   IR - Vulkan)      │    │
│  │                     │    │                     │    │                     │    │
│  │ • Virtual ISA       │    │ • AMD GPU ISA       │    │ • Cross-platform    │    │
│  │ • Device agnostic   │    │ • RDNA/CDNA arch    │    │ • Vendor neutral    │    │
│  │ • JIT compilation   │    │ • Direct execution  │    │ • Multiple targets  │    │
│  └─────────────────────┘    └─────────────────────┘    └─────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│                               DRIVER LAYER                                         │
├─────────────────────────────────────────────────────────────────────────────────────┤
│  ┌─────────────────────┐    ┌─────────────────────┐    ┌─────────────────────┐    │
│  │    CUDA Driver      │    │     ROCm Driver     │    │   OpenCL Driver     │    │
│  │                     │    │                     │    │                     │    │
│  │ • PTX → SASS JIT    │    │ • GCN → Machine     │    │ • SPIR-V → Native   │    │
│  │ • Memory management │    │ • Memory management │    │ • Memory management │    │
│  │ • Kernel launch     │    │ • Kernel launch     │    │ • Kernel launch     │    │
│  │ • Context mgmt      │    │ • Context mgmt      │    │ • Context mgmt      │    │
│  └─────────────────────┘    └─────────────────────┘    └─────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│                              HARDWARE LAYER                                        │
├─────────────────────────────────────────────────────────────────────────────────────┤
│  ┌─────────────────────┐    ┌─────────────────────┐                               │
│  │    NVIDIA GPU       │    │      AMD GPU        │                               │
│  │                     │    │                     │                               │
│  │ ┌─────────────────┐ │    │ ┌─────────────────┐ │    ┌─────────────────────┐    │
│  │ │   SM (Cores)    │ │    │ │   CU (Cores)    │ │    │   Intel Xe Cores    │    │
│  │ │ ┌─────────────┐ │ │    │ │ ┌─────────────┐ │ │    │ ┌─────────────────┐ │    │
│  │ │ │FP32 | INT32 │ │ │    │ │ │FP32 | INT32 │ │ │    │ │  Vector Engines │ │    │
│  │ │ │FP64 | BF16  │ │ │    │ │ │FP64 | BF16  │ │ │    │ │  Matrix Engines │ │    │
│  │ │ │Tensor Cores │ │ │    │ │ │Matrix Cores │ │ │    │ │  Ray Tracing    │ │    │
│  │ │ └─────────────┘ │ │    │ │ └─────────────┘ │ │    │ └─────────────────┘ │    │
│  │ └─────────────────┘ │    │ └─────────────────┘ │    └─────────────────────┘    │
│  │                     │    │                     │                               │
│  │ Memory Hierarchy:   │    │ Memory Hierarchy:   │    Memory Hierarchy:          │
│  │ • L1 Cache (KB)     │    │ • L1 Cache (KB)     │    • L1 Cache                 │
│  │ • L2 Cache (MB)     │    │ • L2 Cache (MB)     │    • L2 Cache                 │
│  │ • Global Mem (GB)   │    │ • Global Mem (GB)   │    • Global Memory            │
│  │ • Shared Memory     │    │ • LDS (Local Data   │    • Shared Local Memory      │
│  │ • Constant Memory   │    │   Store)            │    • Constant Memory          │
│  │ • Texture Memory    │    │ • Constant Memory   │                               │
│  └─────────────────────┘    └─────────────────────┘                               │
└─────────────────────────────────────────────────────────────────────────────────────┘

Compilation Pipeline Deep Dive

1. Source Code → Frontend Parsing

CUDA: NVCC separates host (CPU) and device (GPU) code, parses CUDA extensions
HIP: Clang-based compiler with HIP runtime API that maps to either CUDA or ROCm
OpenCL/SYCL: LLVM-based compilation with cross-platform intermediate representation

2. Frontend → Intermediate Representation

High-Level Code                    Intermediate Form
─────────────────                 ───────────────────
__global__ void kernel()    →     PTX (NVIDIA)
{                                 GCN Assembly (AMD)  
    int id = threadIdx.x;         SPIR-V (OpenCL/Vulkan)
    output[id] = input[id] * 2;   LLVM IR (SYCL)
}

3. Runtime Compilation & Optimization

NVIDIA: PTX → SASS (GPU-specific machine code) via JIT compilation
AMD: GCN Assembly → GPU microcode via ROCm runtime
Optimizations: Register allocation, memory coalescing, instruction scheduling

4. Hardware Execution Model

Abstraction Level	NVIDIA Term	AMD Term	Description
Thread	Thread	Work-item	Single execution unit
Thread Group	Warp (32 threads)	Wavefront (64 threads)	SIMD execution group
Thread Block	Block	Work-group	Shared memory + synchronization
Grid	Grid	NDRange	Collection of all thread blocks

5. Memory Architecture Mapping

Programming Model              Hardware Implementation
─────────────────              ─────────────────────────
Global Memory        →         GPU DRAM (HBM/GDDR)
Shared Memory        →         On-chip SRAM (48-164KB per SM/CU)
Local Memory         →         GPU DRAM (spilled registers)
Constant Memory      →         Cached read-only GPU DRAM
Texture Memory       →         Cached GPU DRAM with interpolation
Registers            →         On-chip register file (32K-64K per SM/CU)

Performance Implications

Understanding this architecture helps optimize GPU code:

Memory Coalescing: Access patterns that align with hardware memory buses
Occupancy: Balancing registers, shared memory, and thread blocks per SM/CU
Divergence: Minimizing different execution paths within warps/wavefronts
Latency Hiding: Using enough threads to hide memory access latency
Memory Hierarchy: Optimal use of each memory type based on access patterns

This architectural knowledge is essential for writing efficient GPU code and is covered progressively throughout our modules.

✨ Key Features

Feature	Description
🎯 Complete Curriculum	9 progressive modules from basics to advanced topics
💻 Cross-Platform	Full CUDA and HIP support for NVIDIA and AMD GPUs
🐳 Docker Ready	Complete containerized development environment
🔧 Production Quality	Professional build systems, testing, and profiling
📊 Performance Focus	Optimization techniques and benchmarking throughout
🌐 Community Driven	Open source with comprehensive contribution guidelines

🚀 Quick Start

Option 1: Docker (Recommended)

Get started immediately without installing CUDA/ROCm on your host system:

# Clone the repository
git clone https://github.com/AIComputing101/gpu-programming-101.git
cd gpu-programming-101

# Auto-detect your GPU and start development environment
./docker/scripts/run.sh --auto

# Inside container: verify GPU access and start learning
/workspace/test-gpu.sh
cd modules/module1 && make && ./01_vector_addition_cuda

Option 2: Native Installation

For direct system installation:

# Prerequisites: CUDA 11.0+ or ROCm 5.0+, GCC 7+, Make

# Clone and build
git clone https://github.com/AIComputing101/gpu-programming-101.git
cd gpu-programming-101

# Verify your setup
make check-system

# Build and run first example
make module1
cd modules/module1/examples
./01_vector_addition_cuda

🎯 Learning Path

Choose your track based on your experience level:

👶 Beginner Track (Modules 1-3) - GPU fundamentals, memory management, first kernels 🔥 Intermediate Track (Modules 4-5) - Advanced programming, performance optimization
🚀 Advanced Track (Modules 6-9) - Parallel algorithms, domain applications, production deployment

Each track builds on the previous one, so start with the appropriate level for your background.

📚 Modules

Our comprehensive curriculum progresses from fundamental concepts to production-ready optimization techniques:

Module	Level	Duration	Focus Area	Key Topics	Examples
Module 1	👶 Beginner	4-6h	GPU Fundamentals	Architecture, Memory, First Kernels	13
Module 2	👶→🔥	6-8h	Memory Optimization	Coalescing, Shared Memory, Texture	10
Module 3	🔥 Intermediate	6-8h	Execution Models	Warps, Occupancy, Synchronization	12
Module 4	🔥→🚀	8-10h	Advanced Programming	Streams, Multi-GPU, Unified Memory	9
Module 5	🚀 Advanced	6-8h	Performance Engineering	Profiling, Bottleneck Analysis	5
Module 6	🚀 Advanced	8-10h	Parallel Algorithms	Reduction, Scan, Convolution	10
Module 7	🚀 Expert	8-10h	Algorithmic Patterns	Sorting, Graph Algorithms	4
Module 8	🚀 Expert	10-12h	Domain Applications	ML, Scientific Computing	4
Module 9	🚀 Expert	6-8h	Production Deployment	Libraries, Integration, Scaling	4

📈 Progressive Learning Path: 70+ Examples • 50+ Hours • Beginner to Expert

Learning Progression

Module 1: Hello GPU World          Module 6: Parallel Algorithms
    ↓                                 ↓
Module 2: Memory Mastery          Module 7: Advanced Patterns  
    ↓                                 ↓
Module 3: Execution Deep Dive     Module 8: Real Applications
    ↓                                 ↓
Module 4: Advanced Features       Module 9: Production Ready
    ↓                             
Module 5: Performance Tuning

📚 View All Modules →

🛠️ Prerequisites

Hardware Requirements

NVIDIA GPU Systems

Minimum GPU: GTX 1060 6GB, GTX 1650, RTX 2060 or better
Recommended GPU: RTX 3070/4070 (12GB+), RTX 3080/4080 (16GB+)
Professional/Advanced: RTX 4090 (24GB), RTX A6000 (48GB), Tesla/Quadro series
Architecture Support: Maxwell, Pascal, Volta, Turing, Ampere, Ada Lovelace, Hopper
Compute Capability: 5.0+ (Maxwell architecture or newer)

AMD GPU Systems

Minimum GPU: RX 580 8GB, RX 6600, RX 7600 or better
Recommended GPU: RX 6700 XT/7700 XT (12GB+), RX 6800 XT/7800 XT (16GB+)
Professional/Advanced: RX 7900 XTX (24GB), Radeon PRO W7800 (48GB), Instinct MI series
Architecture Support: RDNA2, RDNA3, RDNA4, GCN 5.0+, CDNA series
ROCm Compatibility: Officially supported AMD GPUs only

System Memory & CPU

Minimum RAM: 16GB system RAM
Recommended RAM: 32GB+ for advanced modules and multi-GPU setups
Professional Setup: 64GB+ for large-scale scientific computing
CPU Requirements:
- Intel: Haswell (2013) or newer for PCIe atomics support
- AMD: Zen 1 (2017) or newer for PCIe atomics support
Storage: 20GB+ free space for Docker containers and examples

Software Requirements

Operating System Support

Linux (Recommended): Ubuntu 22.04 LTS, RHEL 8/9, SLES 15 SP5
Windows: Windows 10/11 with WSL2 recommended for optimal compatibility
macOS: macOS 12+ (Metal Performance Shaders for basic GPU compute)

GPU Computing Platforms

CUDA Toolkit: 12.0+ (Docker uses CUDA 12.9.1)
- Driver Requirements:
  - Linux: 550.54.14+ for CUDA 12.4+
  - Windows: 551.61+ for CUDA 12.4+
ROCm Platform: 6.0+ (Docker uses ROCm 6.4.3)
- Driver Requirements: Latest AMDGPU-PRO or open-source AMDGPU drivers
- Kernel Support: Linux kernel 5.4+ recommended

Development Environment

Compilers:
- GCC: 9.0+ (GCC 11+ recommended for C++17 features)
- Clang: 10.0+ (Clang 14+ recommended)
- MSVC: 2019+ (2022 17.10+ for CUDA 12.4+ support)
Build Tools: Make 4.0+, CMake 3.18+ (optional)
Docker: 20.10+ with GPU runtime support (nvidia-container-toolkit or ROCm containers)

Additional Tools (Included in Docker)

Profiling: Nsight Compute, Nsight Systems (NVIDIA), rocprof (AMD)
Debugging: cuda-gdb, rocgdb, compute-sanitizer
Libraries: cuBLAS, cuFFT, rocBLAS, rocFFT (for advanced modules)

Performance Expectations by Hardware Tier

Hardware Tier	Example GPUs	VRAM	Expected Performance	Suitable Modules
Entry Level	GTX 1060 6GB, RX 580 8GB	6-8GB	10-50x CPU speedup	Modules 1-3
Mid-Range	RTX 3060 Ti, RX 6700 XT	12GB	50-200x CPU speedup	Modules 1-6
High-End	RTX 4070 Ti, RX 7800 XT	16GB	100-500x CPU speedup	All modules
Professional	RTX 4090, RX 7900 XTX	24GB	200-1000x+ CPU speedup	All modules + research

Programming Knowledge

C/C++: Intermediate level (pointers, memory management, basic templates)
Parallel Programming: Basic understanding of threads and synchronization helpful
Command Line: Comfortable with terminal/shell operations
Mathematics: Linear algebra and calculus basics beneficial for advanced modules
Version Control: Basic Git knowledge for contributing

Network Requirements (Docker Setup)

Internet Connection: Required for initial Docker image downloads (~8GB total)
Bandwidth: 50+ Mbps recommended for efficient container downloads
Storage: Additional 20GB for Docker images and build cache

🐳 Docker Development

Experience the full development environment with zero setup:

# Build development containers
./docker/scripts/build.sh --all

# Start interactive development
./docker/scripts/run.sh cuda    # For NVIDIA GPUs
./docker/scripts/run.sh rocm    # For AMD GPUs
./docker/scripts/run.sh --auto  # Auto-detect GPU type

Docker Benefits:

🎯 Zero host configuration required
🔧 Complete development environment (compilers, debuggers, profilers)
🌐 Cross-platform testing (test your code on both CUDA and HIP)
📦 Isolated and reproducible builds
🧹 Easy cleanup when done

📖 Complete Docker Guide →

🔧 Build System

Project-Wide Commands

make all           # Build all modules
make test          # Run comprehensive tests  
make clean         # Clean all artifacts
make check-system  # Verify GPU setup
make status        # Show module completion status

Module-Specific Commands

cd modules/module1/examples
make               # Build all examples in module
make test          # Run module tests
make profile       # Performance profiling
make debug         # Debug builds with extra checks

Performance Expectations

Module Level	Typical GPU Speedup	Memory Efficiency	Code Quality
Beginner	10-100x	60-80%	Educational
Intermediate	50-500x	80-95%	Optimized
Advanced	100-1000x	85-95%	Production
Expert	500-5000x	95%+	Library-Quality

🐛 Troubleshooting

Common Issues & Solutions

GPU Not Detected

# NVIDIA
nvidia-smi  # Should show your GPU
export PATH=/usr/local/cuda/bin:$PATH

# AMD  
rocm-smi   # Should show your GPU
export HIP_PLATFORM=amd

Compilation Errors

# Check CUDA installation
nvcc --version
make check-cuda

# Check HIP installation  
hipcc --version
make check-hip

Docker Issues

# Test Docker GPU access
./docker/scripts/test.sh

# Rebuild containers
./docker/scripts/build.sh --clean --all

📖 Documentation

Document	Description
README.md	Main project documentation and getting started guide
CONTRIBUTING.md	How to contribute to the project
Docker Guide	Complete Docker setup and usage
Module READMEs	Individual module documentation

🤝 Contributing

We welcome contributions from the community! This project thrives on:

📝 New Examples: Implementing additional GPU algorithms
🐛 Bug Fixes: Improving existing code and documentation
📚 Documentation: Enhancing explanations and tutorials
🔧 Optimizations: Performance improvements and best practices
🌐 Platform Support: Cross-platform compatibility improvements

📖 Contributing Guidelines → • 🐛 Report Issues → • 💡 Request Features →

🏆 Community & Support

🌟 Star this project if you find it helpful!
🐛 Report bugs using our issue templates
💬 Join discussions in GitHub Discussions
📧 Get help from the community and maintainers

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

TL;DR: ✅ Commercial use ✅ Modification ✅ Distribution ✅ Private use

📚 Citation

If you use this project in your research, education, or publications, please cite it as:

BibTeX

@misc{gpu-programming-101,
  title={GPU Programming 101: A Comprehensive Educational Project for CUDA and HIP},
  author={{Stephen Shao}},
  year={2025},
  howpublished={\url{https://github.com/AIComputing101/gpu-programming-101}},
  note={A complete GPU programming educational resource with 70+ production-ready examples covering fundamentals through advanced optimization techniques for NVIDIA CUDA and AMD HIP platforms}
}

IEEE Format

Stephen Shao, "GPU Programming 101: A Comprehensive Educational Project for CUDA and HIP," GitHub, 2025. [Online]. Available: https://github.com/AIComputing101/gpu-programming-101

🙏 Acknowledgments

🎯 NVIDIA and AMD for excellent GPU computing ecosystems
📚 GPU computing community for sharing knowledge and best practices
🏫 Educational institutions advancing parallel computing education
👥 Contributors who make this project better every day

Ready to unlock the power of GPU computing?

🚀 Get Started Now • 📚 View Modules • 🐳 Try Docker

⭐ Star this project • 🍴 Fork and contribute • 📢 Share with others

Built with ❤️ for the AI Computing 101

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github		.github
docker		docker
modules		modules
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Uh oh!

License

AIComputing101/gpu-programming-101

Folders and files

Latest commit

History

Repository files navigation

GPU Programming 101 🚀

📑 Table of Contents

📋 Project Overview

🏗️ GPU Programming Architecture

Architecture Overview Diagram

Compilation Pipeline Deep Dive

1. Source Code → Frontend Parsing

2. Frontend → Intermediate Representation

3. Runtime Compilation & Optimization

4. Hardware Execution Model

5. Memory Architecture Mapping

Performance Implications

✨ Key Features

🚀 Quick Start

Option 1: Docker (Recommended)

Option 2: Native Installation

🎯 Learning Path

📚 Modules

Learning Progression

🛠️ Prerequisites

Hardware Requirements

NVIDIA GPU Systems

AMD GPU Systems

System Memory & CPU

Software Requirements

Operating System Support

GPU Computing Platforms

Development Environment

Additional Tools (Included in Docker)

Performance Expectations by Hardware Tier

Programming Knowledge

Network Requirements (Docker Setup)

🐳 Docker Development

🔧 Build System

Project-Wide Commands

Module-Specific Commands

Performance Expectations

🐛 Troubleshooting

Common Issues & Solutions

📖 Documentation

🤝 Contributing

🏆 Community & Support

📄 License

📚 Citation

BibTeX

IEEE Format

🙏 Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Sponsor this project

Uh oh!

Packages 0

Languages

Packages