Optimizing GPU Utilization in Deep Learning Training

🚀 Project Overview

This repository provides a comprehensive exploration of GPU optimization techniques for PyTorch models, focusing on improving training efficiency and performance. By implementing and comparing various optimization strategies, the project offers practical insights into enhancing deep learning training workflows.

💻 Hardware Specifications

Experimental Environment:

GPU: NVIDIA RTX 3050 Ti (4GB VRAM)
CPU: Intel Core i5-11400H
RAM: 16GB

🗂 Project Structure

Notebooks

Optimizing_GPU_Utilization.ipynb: contains a full explanation for each optimization implemented.

Key Scripts

no_optimization.py: Baseline implementation without optimizations
tensorFloat32.py: TensorFloat-32 (TF32) precision optimization
brainFloat16.py: BFloat16 precision optimization
torch_compile.py: Torch JIT compilation optimization
flash_attention.py: FlashAttention implementation
fused_optimizer.py: Fused optimizer optimization
8-bit_optimizer.py: 8-bit Adam optimizer for reduced memory usage

Utility Components

Utils/: Contains model and data setup utilities
Makefile: Automation script for running experiments
requirements.txt: Project dependencies

🔧 Installation

Prerequisites

Python 3.12+
CUDA-enabled GPU
pip package manager

Dependencies Installation

pip install -r requirements.txt

🚀 Usage

This project includes a Makefile that simplifies running experiments and generating comparisons.

Mandatory Parameters

When running experiments, you must specify three mandatory parameters:

STEPS=n: Number of training steps to perform
BATCH_SIZE=b: Size of each training batch
PREFIX=path: Output directory for results and plots

Running Individual Optimization Techniques

make baseline STEPS=50 BATCH_SIZE=256 PREFIX=./out         # No optimization
make tf32 STEPS=50 BATCH_SIZE=256 PREFIX=./out             # TensorFloat32
make bf16 STEPS=50 BATCH_SIZE=256 PREFIX=./out             # BrainFloat16
make torch_compile STEPS=50 BATCH_SIZE=256 PREFIX=./out    # Torch Compile
make flash STEPS=50 BATCH_SIZE=256 PREFIX=./out            # FlashAttention
make fused STEPS=50 BATCH_SIZE=256 PREFIX=./out            # Fused Optimizer
make 8bit STEPS=50 BATCH_SIZE=256 PREFIX=./out             # 8-bit Optimizer

Generate Comparison Plots

After running one or more experiments:

make plots STEPS=50 BATCH_SIZE=256 PREFIX=./out

Running All Optimizations and Generate Comparison Plots

make all STEPS=50 BATCH_SIZE=256 PREFIX=./out

Additional Commands

make help
make reset            # Reset results file and plots
make clean            # Remove generated files
make init_results     # Initialize results.csv file at `RESULTS_FILE` given path

🔬 Optimizations Implemented

No Optimization: Baseline implementation
TensorFloat-32 (TF32):
- Improved precision for matrix multiplications
- Balanced performance and accuracy
BrainFloat16 (BF16):
- Reduced memory usage
- Faster training on supported hardware
Torch Compile:
- Just-in-time (JIT) compilation
- Reduced overhead
FlashAttention:
- Efficient attention mechanism
- Improved performance for transformer models
Fused Optimizer:
- Reduced GPU kernel launches
- Enhanced training efficiency
8-bit Optimizer:
- Reduced memory footprint
- Potential training speed improvement

📊 Performance Considerations

Choose optimization techniques based on your specific hardware and model architecture
Some techniques may have compilation overhead
Performance gains vary depending on model complexity and hardware

📈 Experimental Results

Mean Relative Speedup Comparison

The following plot shows the mean relative speedup comparison for different optimization techniques compared to the baseline (no optimization). These results were generated using a batch size of 256 and 150 training steps. This plot helps in visualizing the performance gains achieved by each optimization method.

By combining BF16, Torch compile, FlashAttention, and Fused Optimizer, I was able to reduce the average iteration time from 472.88 ms (no optimization) to 159.66 ms, making it ~3× faster! (Excluding compilation steps)

📚 References

📄 License

This project is licensed under the MIT License. See LICENSE for details.

🤝 Contributing

Contributions are welcome! Please submit pull requests or open issues to discuss potential improvements.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
Scripts		Scripts
out/256B_150N_experiment		out/256B_150N_experiment
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
Optimizing_GPU_Utilization.ipynb		Optimizing_GPU_Utilization.ipynb
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Optimizing GPU Utilization in Deep Learning Training

🚀 Project Overview

📋 Table of Contents

💻 Hardware Specifications

🗂 Project Structure

Notebooks

Key Scripts

Utility Components

🔧 Installation

Prerequisites

Dependencies Installation

🚀 Usage

Mandatory Parameters

Running Individual Optimization Techniques

Generate Comparison Plots

Running All Optimizations and Generate Comparison Plots

Additional Commands

🔬 Optimizations Implemented

📊 Performance Considerations

📈 Experimental Results

Mean Relative Speedup Comparison

📚 References

📄 License

🤝 Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

abdo-ashraf/Optimizing-GPU-Utilization-at-Training

Folders and files

Latest commit

History

Repository files navigation

Optimizing GPU Utilization in Deep Learning Training

🚀 Project Overview

📋 Table of Contents

💻 Hardware Specifications

🗂 Project Structure

Notebooks

Key Scripts

Utility Components

🔧 Installation

Prerequisites

Dependencies Installation

🚀 Usage

Mandatory Parameters

Running Individual Optimization Techniques

Generate Comparison Plots

Running All Optimizations and Generate Comparison Plots

Additional Commands

🔬 Optimizations Implemented

📊 Performance Considerations

📈 Experimental Results

Mean Relative Speedup Comparison

📚 References

📄 License

🤝 Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages