Paper Title: Chimera: Communication Fusion for Hybrid Parallelism in Large Language Models

Overview of Simulation Framework

Directory Structure

src: contains all the source code and experimemt scripts for simulation

communication: simulation code for communication part
- backend: source code and config files of BookSim2 simulator
- collective_communication: implementation of five collective communication patterns
components: python interface to connect BookSim and ScaleSim-v2, and build the simulation framework
computation: simulation code for computation part
- backend: source code and config files of ScaleSim-v2 simulator
- gpt2: model description and modeling of gpt2 and MoE models
User/Chimera: contains the experiment scripts for baselines and fused LLM hybrid parallelism, as well as the picture scripts
utils: some other tool files

Real_Perf: contains all the source code and experimemt scripts for real machine tests
doc: some figures in README

Hardware Requirements:

For simulation experiments:

One machine with Ubuntu 22.04 (Other Ubuntu version should work as well)

For real-machine tests:

One 8×RTX4080 GPU node with PCIe 4.0 interconnection (Other machines can also work while the results vary with the machines and fabrics)

Software Requirements:

For simulation experiments:

Python 3.9
ScaleSim-v2 and BookSim2 (We have already included both within this repository)

For real-machine tests:

PyTorch 2.5.1
CUDA 12.4
NCCL 2.21.5

Simulation Setup

Overall Workflow

In ./src/User/Chimera/experiment directory, there are four ISCA25_* directories which match Synthetic (Figure 10), Forward Pass (Figure 11), End-to-End results (Figure 11(a)), and End-to-End results with pipeline overlapping (Figure 11(b)), respectively. Each directory contains its corresponding configuration files (cfg), experiment scripts (py), and results (txt). After finishing all the experiments, the figure scripts in ./src/User/Chimera/experiment/Pictures use results in the four txt directories to reproduce the figures.

Update directory path

python update_cfg.py

Compile the source code

build a new conda environment

conda create --name Chimera python=3.9
conda activate Chimera
pip install -r requirements.txt
conda install -c conda-forge flex bison

compile BookSim2

cd ./src/communication/backend/booksim2/src
make libdbg
cd ../../../../../

Reproduce the simulation results of the paper

cd ./src/User/Chimera/experiment
bash run-all.sh

Reproduce the real-machine results of the paper

We use one 8×RTX4080 GPU node for the real machine tests which have the topology like below (use nvidia-smi topo -m). Different topologies could generate different results.

To reproduce the real-machine test results

cd ./Real_Perf
bash run-all.sh

Reproduce all the pictures

To reproduce the pictures, please reproduce all the results before.

cd ./src/User/Chimera/experiment/Pictures
conda activate Chimera
bash plot-figure.sh

Notes

Performance variation for real-machine results

For machines with different topologies (e.g., NVSwitch fully-connected system), the results can be rather different (even negative speedups with fusion). We have observed such results in an 8xH20 GPU node. One possible reason for that is the NVSwitch bandwidth degradation with more GPU neighbors. On NVSwitch systems, bandwidth can vary significantly depending on the number of concurrent GPU neighbors (i.e., how many devices are utilizing the NVSwitch simultaneously), especially during large data transmissions[1]. As the number of active devices increases, the available bandwidth per device may drop to as low as half of that available to a single device. In the affected experiments, we fuse several local communication operators (e.g., All-to-All in PP+EP, All-Gather in PP+SP, each over 4 devices) into a global operator (e.g., 8-device M2MS). Although this fusion reduces communication volume, it results in longer transmission time on NVSwitch machines, thus explaining the performance differences.
There are two methods to reproduce our results in this kind of machines: (1) limiting the number of GPUs to 4 to mitigate the bandwidth contention, and (2) disabling NVLink to force communication over PCIe.
Reference[1]: [NSDI 2023] TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches (https://www.usenix.org/conference/nsdi23/presentation/shah)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Paper Title: Chimera: Communication Fusion for Hybrid Parallelism in Large Language Models

Overview of Simulation Framework

Directory Structure

Hardware Requirements:

Software Requirements:

Simulation Setup

Overall Workflow

Update directory path

Compile the source code

Reproduce the simulation results of the paper

Reproduce the real-machine results of the paper

Reproduce all the pictures

Notes

Performance variation for real-machine results

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Real_Perf		Real_Perf
doc		doc
src		src
README.md		README.md
requirements.txt		requirements.txt
update_cfg.py		update_cfg.py

redbird-arch/isca2025-chimera-artifact

Folders and files

Latest commit

History

Repository files navigation

Paper Title: Chimera: Communication Fusion for Hybrid Parallelism in Large Language Models

Overview of Simulation Framework

Directory Structure

Hardware Requirements:

Software Requirements:

Simulation Setup

Overall Workflow

Update directory path

Compile the source code

Reproduce the simulation results of the paper

Reproduce the real-machine results of the paper

Reproduce all the pictures

Notes

Performance variation for real-machine results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages