Skip to content

abdo-ashraf/Optimizing-GPU-Utilization-at-Training

Repository files navigation

Optimizing GPU Utilization in Deep Learning Training

🚀 Project Overview

This repository provides a comprehensive exploration of GPU optimization techniques for PyTorch models, focusing on improving training efficiency and performance. By implementing and comparing various optimization strategies, the project offers practical insights into enhancing deep learning training workflows.

📋 Table of Contents

  1. Hardware Specifications
  2. Project Structure
  3. Installation
  4. Usage
  5. Optimization Techniques
  6. Performance Considerations
  7. Experimental Results
  8. References
  9. License
  10. Contributing

💻 Hardware Specifications

Experimental Environment:

  • GPU: NVIDIA RTX 3050 Ti (4GB VRAM)
  • CPU: Intel Core i5-11400H
  • RAM: 16GB

🗂 Project Structure

Notebooks

  • Optimizing_GPU_Utilization.ipynb: contains a full explanation for each optimization implemented.

Key Scripts

  • no_optimization.py: Baseline implementation without optimizations
  • tensorFloat32.py: TensorFloat-32 (TF32) precision optimization
  • brainFloat16.py: BFloat16 precision optimization
  • torch_compile.py: Torch JIT compilation optimization
  • flash_attention.py: FlashAttention implementation
  • fused_optimizer.py: Fused optimizer optimization
  • 8-bit_optimizer.py: 8-bit Adam optimizer for reduced memory usage

Utility Components

  • Utils/: Contains model and data setup utilities
  • Makefile: Automation script for running experiments
  • requirements.txt: Project dependencies

🔧 Installation

Prerequisites

  • Python 3.12+
  • CUDA-enabled GPU
  • pip package manager

Dependencies Installation

pip install -r requirements.txt

🚀 Usage

This project includes a Makefile that simplifies running experiments and generating comparisons.

Mandatory Parameters

When running experiments, you must specify three mandatory parameters:

  • STEPS=n: Number of training steps to perform
  • BATCH_SIZE=b: Size of each training batch
  • PREFIX=path: Output directory for results and plots

Running Individual Optimization Techniques

make baseline STEPS=50 BATCH_SIZE=256 PREFIX=./out         # No optimization
make tf32 STEPS=50 BATCH_SIZE=256 PREFIX=./out             # TensorFloat32
make bf16 STEPS=50 BATCH_SIZE=256 PREFIX=./out             # BrainFloat16
make torch_compile STEPS=50 BATCH_SIZE=256 PREFIX=./out    # Torch Compile
make flash STEPS=50 BATCH_SIZE=256 PREFIX=./out            # FlashAttention
make fused STEPS=50 BATCH_SIZE=256 PREFIX=./out            # Fused Optimizer
make 8bit STEPS=50 BATCH_SIZE=256 PREFIX=./out             # 8-bit Optimizer

Generate Comparison Plots

After running one or more experiments:

make plots STEPS=50 BATCH_SIZE=256 PREFIX=./out

Running All Optimizations and Generate Comparison Plots

make all STEPS=50 BATCH_SIZE=256 PREFIX=./out

Additional Commands

make help
make reset            # Reset results file and plots
make clean            # Remove generated files
make init_results     # Initialize results.csv file at `RESULTS_FILE` given path

🔬 Optimizations Implemented

  1. No Optimization: Baseline implementation
  2. TensorFloat-32 (TF32):
    • Improved precision for matrix multiplications
    • Balanced performance and accuracy
  3. BrainFloat16 (BF16):
    • Reduced memory usage
    • Faster training on supported hardware
  4. Torch Compile:
    • Just-in-time (JIT) compilation
    • Reduced overhead
  5. FlashAttention:
    • Efficient attention mechanism
    • Improved performance for transformer models
  6. Fused Optimizer:
    • Reduced GPU kernel launches
    • Enhanced training efficiency
  7. 8-bit Optimizer:
    • Reduced memory footprint
    • Potential training speed improvement

📊 Performance Considerations

  • Choose optimization techniques based on your specific hardware and model architecture
  • Some techniques may have compilation overhead
  • Performance gains vary depending on model complexity and hardware

📈 Experimental Results

Mean Relative Speedup Comparison

The following plot shows the mean relative speedup comparison for different optimization techniques compared to the baseline (no optimization). These results were generated using a batch size of 256 and 150 training steps. This plot helps in visualizing the performance gains achieved by each optimization method.

Mean Relative Speedup Comparison

By combining BF16, Torch compile, FlashAttention, and Fused Optimizer, I was able to reduce the average iteration time from 472.88 ms (no optimization) to 159.66 ms, making it ~3× faster! (Excluding compilation steps)

📚 References

📄 License

This project is licensed under the MIT License. See LICENSE for details.

🤝 Contributing

Contributions are welcome! Please submit pull requests or open issues to discuss potential improvements.

About

Exploring various optimization techniques to improve GPU efficiency during model training.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published