-
Notifications
You must be signed in to change notification settings - Fork 983
Description
Thanks for such great work !
Currently I'm trying to use DeepEP for multi-node MoE model deployment, though I have read through related issues and docs, there are still several confusions that require official confirmations, to make sure I do not misunderstand.
- How to validate the DeepEP environment
For now, i am using the official testcases (intra / inter / low latency). And I suppose that if all those tests succeed, DeepEP env should be good to go, is such assumption right ?
- About NVSHMEM
According to NVSHMEM install guide, it involves two stages 1. install NVSHMEM binary, 2. Enable NVSHMEM IBGDA support. And for IBGDA suppport step, we choose either 2.1 Configure NVIDIA driver or 2.2 Install GDRCopy and load the gdrdrv kernel module.
- Is the second step
Enable NVSHMEM IBGDA supporta necessary steps for multi-node deployment?
For 2.1 Configure NVIDIA driver, we need to modify driver config and reboot on host machine. But if we choose 2.2 Install GDRCopy and load the gdrdrv kernel module approach.
-
Do we still need to reboot the host machine?
-
Can we wrap all NVSHMEM / GDRCopy related things into a docker image, without having to perform anything special on the host machine?
-
How can I verify IBGDA support setups? Use the official testcases (intra / inter / low latency) ?
Thanks in advance, any help would be appreciated.