[BugFix][WIP] Fix accuray problems with deepseek in situation of ep=1, etp>1 #863

whx-sjtu · 2025-05-14T16:08:25Z

This PR tries to fix accuracy problem with deepseek in pure expert-tensor-parallel situation. There are two problems in total:

Fix a bug which incorrectly sets the value of tp_rank when ep_size=1. This code was introduced by @ganyi1996ppo, and I'm not very sure if I can directly delete this code without influencing other funcionalities, especially in data-parallel situation. CC @ganyi1996ppo @yiz-liu
Another problem is related with torch_npu.npu_moe_finalize_routing in fused_experts, and I'm working on solving it.

Signed-off-by: whx-sjtu <2952154980@qq.com>

Signed-off-by: zzzzwwjj <1183291235@qq.com>

Signed-off-by: whx-sjtu <2952154980@qq.com>

github-actions bot added module:ops module:quantization labels May 14, 2025

ApsarasX mentioned this pull request May 16, 2025

[Bugfix][Model] Fix fusedmoe and make modelrunner_v1 compatible with latest vllm #867

Merged

MengqingCao mentioned this pull request May 20, 2025

[Bugfix] Fix deepseek V0 percision issue and add acc ci for it #905

Open

whx-sjtu and others added 2 commits May 20, 2025 22:49

fix etp rank related accuracy problem

a990949

Signed-off-by: whx-sjtu <2952154980@qq.com>

fix: fix deepseek accuracy when ep_size=1

f05ee46

Signed-off-by: zzzzwwjj <1183291235@qq.com>

whx-sjtu force-pushed the fix_etp_acc branch from d1f650a to f05ee46 Compare May 20, 2025 14:50

fix ci problems

b23c5fe

Signed-off-by: whx-sjtu <2952154980@qq.com>

Provide feedback