HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference

Introduction

HybriMoE is a hybrid CPU-GPU scheduling and cache management system to improve the efficiency of MoE inference. It addresses the high latency overhead associated with on-demand expert loading and unbalanced hardware utilization through three key innovations:

Hybrid MoE CPU-GPU Scheduling: An efficient hybrid scheduling algorithm for MoE inference that dynamically balances workloads across GPUs and CPUs.
Impact-driven prefetching: A prefetching mechanism that simulates the potential impact of preloading experts from subsequent layers and prioritizes those with the higher expected gains.
MoE-specialized Cache Management: An expert score-based caching strategy that prioritizes high-demand experts across layers to minimize cache misses.

Installation

CUDA 12.1 and above, if you didn't have it yet, you may install from here.

# Adding CUDA to PATH
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export CUDA_PATH=/usr/local/cuda

conda create --name hybrimoe python=3.11
conda activate hybrimoe # you may need to run ‘conda init’ and reopen shell first

init source code

git clone https://github.com/PKU-SEC-Lab/HybriMoE
cd HybriMoE
git submodule init
git submodule update

Compile and install (for Linux)
```
bash install.sh
```

Downloading Model Weights

mkdir DeepSeek-V2-Lite-Chat-GGUF
cd DeepSeek-V2-Lite-Chat-GGUF

wget https://huggingface.co/mzwing/DeepSeek-V2-Lite-Chat-GGUF/resolve/main/DeepSeek-V2-Lite-Chat.Q4_K_M.gguf -O DeepSeek-V2-Lite-Chat.Q4_K_M.gguf

Running Example

python ktransformers/local_chat.py --model_path deepseek-ai/DeepSeek-V2-Lite-Chat --gguf_path ./DeepSeek-V2-Lite-Chat-GGUF --cache_size 16 --prefetch_size 0 --optimize_rule_path ktransformers/optimize/optimize_rules/DeepSeek-V2-Chat.yaml

Citation

If our work assists your research, feel free to give us a star ⭐ or cite us using:

@article{zhong2025hybrimoe,
  title={HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference},
  author={Zhong, Shuzhang and Sun, Yanfan and Liang, Ling and Wang, Runsheng and Huang, Ru and Li, Meng},
  journal={arXiv preprint arXiv:2504.05897},
  year={2025}
}

Contact

If you have any questions, please raise a GitHub issue or contact us via email zsz@stu.pku.edu.cn.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
doc		doc
ktransformers		ktransformers
merge_tensors		merge_tensors
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
book.toml		book.toml
install.bat		install.bat
install.sh		install.sh
pyproject.toml		pyproject.toml
requirements-local_chat.txt		requirements-local_chat.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference

Introduction

Installation

Downloading Model Weights

Running Example

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

PKU-SEC-Lab/HybriMoE

Folders and files

Latest commit

History

Repository files navigation

HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference

Introduction

Installation

Downloading Model Weights

Running Example

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages