🎻 DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference [paper]

(This repository is a proof-of-concept and still under heavy construction)

DAOP is an on-device MoE inference engine to optimize parallel GPU-CPU execution. DAOP dynamically allocates experts between CPU and GPU based on per-sequence activation patterns, and selectively pre-calculates predicted experts on CPUs to minimize transfer latency. This approach enables efficient resource utilization across various expert cache ratios while maintaining model accuracy through a novel graceful degradation mechanism. It allows you to run unquantized Mixtral-8x7B model (>90GB of parameters) with >4.5 token/s and unquantized Phi-3.5 MoE model (>80GB of parameters) with >8.2 token/s on a single A6000 GPU.

Update

[2024/12] We published an arxiv preprint
[2025/02] We released the repository.

Usage

pip install -r requirements.txt
bash testing.sh

Contact

Please feel free to contact us if there are any issues or discrepancies with the experimental data.

Citation

If you use DAOP in your research, please cite the following paper.

@article{zhang2024daop,
  title={DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference},
  author={Zhang, Yujie and Aggarwal, Shivam and Mitra, Tulika},
  journal={arXiv preprint arXiv:2501.10375},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
DAOP		DAOP
data		data
fiddler		fiddler
model		model
script		script
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
testing.sh		testing.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🎻 DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference [paper]

Update

Usage

Contact

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ecolab-nus/DAOP

Folders and files

Latest commit

History

Repository files navigation

🎻 DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference [paper]

Update

Usage

Contact

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages