Parallel_Transformer

g++ -o transformer transformer.cpp -std=c++11

./transformer

Overview

This project provides a highly optimized implementation of Multi-Head Self-Attention, a key component of Transformer models, for GPU architectures. Using CUDA, the project explores and evaluates two parallelization strategies: horizontal parallelization using native C++ threads and vertical parallelization using CUDA streams and events. The goal is to achieve substantial speedups for Transformer workloads, which are widely used in Natural Language Processing (NLP) and Computer Vision (CV) tasks.

Background

Transformer models, and their multi-head attention mechanisms, have emerged as a powerful class of architectures for tasks in NLP and beyond. While these models demonstrate remarkable performance and flexibility, their computational demands can be substantial.

This project focuses on the core computational kernel of the Transformer—the multi-head self-attention operation—and examines how different parallelization strategies affect performance. We begin with a naive CPU implementation that serves as a baseline, illustrating the computational steps of the attention kernel in a straightforward, single-threaded manner.

We then introduce an MPI + OpenMP version that leverages distributed memory parallelism across multiple processes as well as shared-memory parallelism within each process. Finally, we develop a CUDA-accelerated GPU version that uses efficient GPU kernels and streams to exploit the massive parallelism of modern graphics processors.

By comparing these three approaches, this project highlights the trade-offs in complexity, scalability, and speedup that emerge when moving from a simple CPU-only solution to more advanced parallelization strategies, ultimately guiding practitioners in selecting the most suitable optimization approach for their specific system and workload.

Overview

MHA involves four steps: 1.Query-Key-Value (QKV) Matrices: Each input sequence is transformed into Query (Q), Key (K), and Value (V) matrices via matrix multiplications. 2.Attention Scores: Attention scores are computed as Scores=Q×KT , normalized using softmax. 3.Attention Output: The output is computed as Output=Scores×V for each attention head. 4.Concatenation: Results from all heads are concatenated and passed through a linear layer. Each attention head is independent of others, making it suitable for task-level parallelism.So we use MPI to assign 8 heads to each process.

Each matrix operation (e.g., Q × K^T, softmax, Scores × V, matrix Transpose) involves data-parallel computations.So we Use OpenMP to parallelizing matrix operations, distribute matrix rows across CPU threads.

Baseline Version (pure C++)
A single-core, non-optimized reference implementation,Use pure C++ to implement a transformer.
MPI + OpenMP parallelization:
- MPI: Distribute the workload (e.g., different input batches or heads) across multiple compute nodes/processes.
- OpenMP: Further parallelize computations within each MPI process across multiple CPU threads.
CUDA GPU acceleration:
Implement and optimize the key operations on GPUs to exploit massive parallelism, and use CUDA streams to parallelize computations across multiple heads.

This comparison will be accomplished by time performance data monitored by python scripts and the generated diagrams.

Methods

A single head attention kernel can be described as:

$$ Softmax\left(\frac{Q \times K^T}{\sqrt{d_k}}\right) \times V $$

The Multi Head Attention Kernel includes three core operations:

General Matrix Multiplication (GEMM)
Softmax
Transpose

The core of a multi head attention kernel is the single head attention kernel. The chronological order of a single head attention kernel can be described as:

$X * W_Q \rightarrow Q(Q$ calculation $)$
$X * W_K \rightarrow K$ (K calculation)
$X * W_V \rightarrow V(V$ calculation $)$
$Q * K^T\left(K^T\right.$ calculation $)$
Softmax $\left(\frac{Q * K^T}{\sqrt{d_k}}\right)$
Softmax $\left(\frac{Q * K^T}{\sqrt{d_k}}\right) \times V( Softmax * Vcalculation )$

Summary of Results

Baseline Performance

This image shows the baseline performance of the system without optimization.

OpenMP Performance

This image demonstrates the performance improvement using OpenMP parallelization.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
CUDA_transformer		CUDA_transformer
MPI_transformer		MPI_transformer
README_IMAGE		README_IMAGE
baseline		baseline
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
dataset_vectors.txt		dataset_vectors.txt
speedup_comparison.png		speedup_comparison.png
test.cu		test.cu
timings.csv		timings.csv
transformer		transformer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Parallel_Transformer

Overview

Background

Overview

Methods