LLMEngineOnWafer

Large Language Model Inference Engine on Wafer-Scale Chip

Setup

conda create -n splitwise-sim python=3.11
conda activate splitwise-sim
cd LLMEngineV3
pip install -r requirements.txt

Run command

python run.py trace.filename=AzureLLMInferenceTrace_code  
python run.py trace.filename=AzureLLMInferenceTrace_conv  
python run.py trace.filename=test_trace

DONE

TODO

添加llama2-7b、llama2-13b、opt-7b、opt-13b、opt-66B模型，修改对应的instance个数和tp粒度
开发对应的baseline1版本，只有改加速器架构，完全复用默认splitwise的调度逻辑，证明直接迁移不合适！
保存能耗及其分解功耗信息到splitwise中的csv中
prefill和decode 初始化位置(目前是按顺序简单分配)，根据跳数计算kv cache传输用时，资源交换要考虑位置
资源交换算法设计**(优先级最后，splitwise_aa在跑code数据集的时候基本上也拼不了batch并且没出现混合池)**

V3

重写了 KVJSQScheduler，增加了 pre_sel_batch() 函数以及重写了 schedule() 函数以支持我们的调度策略。使用时将 configs/applications/solo.yaml 中换成 scheduler: kv_jsq
重写了 ORCAInstance ，增加了 max_batch_tokens 限制以及重写了 select_batch() , 同时 SplitwiseInstance 调用 ORCAInstance.select_batch() 以支持我们的策略（无抢占）目前跑code数据集有bug（已经hack过去实现功能），看代码注释
优化了逻辑，乱序遍历，并且pending_tokens=0时直接选择

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
LLMEngineV3		LLMEngineV3
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
tree.txt		tree.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLMEngineOnWafer

Setup

Run command

DONE

TODO

V3

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

PJLAB-CHIP/LLMEngineOnWafer

Folders and files

Latest commit

History

Repository files navigation

LLMEngineOnWafer

Setup

Run command

DONE

TODO

V3

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages