- Artifact for the Paper "Colocating ML Inference and Training with Fast GPU Memory Handover"
- Project Structure
- Hardware Requirements
- Build and Install
- Run and Evaluate
$ tree --dirsfirst -L 2 .
├── client
├── cmake # CMake helper files
├── common # Common libraries for inference/training
├── environment # Docker and conda environment files
├── eval
│ ├── runner # Automatic evaluation runner
│ └── ... # Evaluation scripts for test cases
├── log # Running logs
├── proto # gRPC proto
├── pytorch # PyTorch plugin
├── scripts
├── server # Inference server
│ ├── models # Contains inference models
│ └── ...
├── train # PyTorch training scripts
├── third_party/mpool... # GPU memory pool
└── ...
- 4 x NVIDIA V100 (16GB)
- 1 x NVIDIA A100 (80GB)
Option 1: Pull from Docker Hub
Pull the pre-built Docker images from Docker Hub. The script ./scripts/docker.sh
is provided as a wrapper for Docker commands.
docker pull siriusinftra/sirius:latest
docker pull siriusinftra/triton-trt-um:latest # Triton TensorRT UM backend
bash ./scripts/docker.sh
The project is located at /gpu-col
within the Docker container. TVM and Triton models are pre-installed in this image.
Before running the system, activate the conda environment (e.g., conda activate colserve
).
To evaluate Sirius, refer to Run Benchmark and Artifact Evaluation for more details.
Option 2: Build from Dockerfile
- Clone the repository and build the Docker image. The
build_docker.sh
script will clone dependencies intoinftra-docker-build
, which serves as the Docker build context.
[Optional] Copy TVM and Triton models to
inftra-docker-build/tvm-models
andinftra-docker-build/triton-models
respectively. These will be copied into the Docker image.
git clone --recurse-submodules git@github.com:SiriusInfTra/Sirius.git gpu-col
bash ./gpu-col/scripts/build_docker.sh
- Build Triton TensorRT UM Docker image.
bash ./gpu-col/scripts/build_triton_trt_um_docker.sh
Software Requirements: cmake>=3.24
, gcc>=9.4
, nvcc>=11.6
, ninja
Create Environment and Build System:
- Prepare a new conda environment, install Python packages, and then clone the repository.
conda create -n colserve python=3.12
conda activate colserve
conda install -y conda-forge::python-devtools nvitop conda-forge::c-ares
pip install -r environment/requirements.txt
export SIRIUS_HOME=/path/to/clone/repo
git clone --recurse-submodules git@github.com:SiriusInfTra/Sirius.git $SIRIUS_HOME
- Install
Boost>=1.80
by compiling from source (Boost installed via apt/conda might require a higher GCC version).
export BOOST_HOME=/path/to/install/boost
$SIRIUS_HOME/scripts/install_boost.sh $BOOST_HOME
-
Clone and build TVM for inference, and PyTorch and TorchVision for training. Ensure the CUDA backend is enabled. Pay attention to the PyTorch
GLIBCXX_USE_CXX11_ABI
flag, which can cause ABI issues. To accelerate the build, set theTORCH_CUDA_ARCH_LIST
flag to your GPU's compute capability (e.g.,TORCH_CUDA_ARCH_LIST=7.0
for V100). -
Set the
TVM_HOME
environment variable. Verify by runningecho $TVM_HOME
andecho $CONDA_PREFIX
. Then, configure CMake.
export TVM_HOME=/path/to/tvm
export TORCH_HOME=/path/to/pytorch
export BOOST_HOME=/path/to/boost
$SIRIUS_HOME/scripts/build_sirius.sh $SIRIUS_HOME $TVM_HOME $TORCH_HOME $BOOST_HOME
- [Only required for Triton UM+MPS] Set up Triton TensorRT backend with Unified Memory support. Clone and build Triton TensorRT UM Backend.
export TRITON_TRT_UM_HOME=/path/to/triton_tensorrt_um
export TRITON_TRT_INSTALL_HOME=/path/to/triton_tensorrt_um_install # e.g., $SIRIUS_HOME/triton/tensorrt_um/install
bash $SIRIUS_HOME/scripts/build_triton_trt_um.sh $TRITON_TRT_UM_HOME $TRITON_TRT_INSTALL_HOME
- [Only required for LLM] Install vLLM by compiling from source, clone xFormer and vLLM.
export VLLM_HOME=/path/to/vllm
export XFORMER_HOME=/path/to/xformer
bash $SIRIUS_HOME/scripts/build_vllm.sh $VLLM_HOME $XFORMER_HOME
TVM Models
Compile models using TVM (refer to ./util/prepare_model_store). TVM models (i.e., mod.json
, mod.params
, and mod.so
) are stored in server/models
, as shown below.
server/models
├── densenet161-b1
├── distilbert_base-b1
├── distilgpt2-b1
├── efficientnet_v2_s-b1
├── efficientvit_b2-b1
└── resnet152-b1
Triton Models
Compile Triton models using TensorRT (refer to ./util/onnx). Triton models are stored in server/triton_models
. Each model has a directory containing the Triton compiled model (model.plan
and config.pbtxt
), as shown below.
├── densenet161
├── distilbert_base
├── distilgpt2
├── efficientnet_v2_s
├── efficientvit_b2
├── resnet152
│ ├── 1
│ │ └── model.plan
│ └── config.pbtxt
└── config.conf
config.conf
is used to configure the memory usage (in MiB) for each model.
resnet152 = 345
distilgpt2 = 349
efficientvit_b2 = 143
efficientnet_v2_s = 114
densenet161 = 107
distilbert_base = 278
LLM
Download Llama2 from Hugging Face.
from transformers import AutoConfig, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b-hf")
config = AutoConfig.from_pretrained('Qwen/Qwen2-0.5B')
model = AutoModelForCausalLM.from_pretrained('Qwen/Qwen2-0.5B', config=config)
The evaluation is fully automated by the script at ./eval/runner. This script will automatically launch GPU MPS, Sirius's inference server, PyTorch training tasks, and inference workloads.
For example, to evaluate Sirius with the Light workload:
source ./scripts/set_cuda_device.sh 0
python eval/overall_v2.py --uniform-v2 --uniform-v2-wkld-types NormalLight \
--sirius --skip-set-mps-pct
The evaluation results will be saved in a directory like log/overall-uniform-v2-1gpu-YYYYMMDD-HHMM/colsys-NormalLight
.
Please refer to ./artifact-evaluation/README.md for more details on the artifact evaluation process.