Questions about environment setups

Thanks for such great work !

Currently I'm trying to use DeepEP for multi-node MoE model deployment, though I have read through related issues and docs, there are still several confusions that require official confirmations, to make sure I do not misunderstand.

1. How to validate the DeepEP environment

For now, i am using the [official testcases (intra / inter / low latency)](https://github.com/deepseek-ai/DeepEP/tree/main/tests). And I suppose that if all those tests succeed, DeepEP env should be good to go, is such assumption right ?

2. About NVSHMEM

According to [NVSHMEM install guide](https://github.com/deepseek-ai/DeepEP/blob/main/third-party/README.md), it involves two stages `1. install NVSHMEM binary`, `2. Enable NVSHMEM IBGDA support`. And for IBGDA suppport step, we choose either `2.1 Configure NVIDIA driver` or `2.2 Install GDRCopy and load the gdrdrv kernel module`.

- Is the second step `Enable NVSHMEM IBGDA support` a necessary steps for multi-node deployment?

For `2.1 Configure NVIDIA driver`, we need to modify driver config and reboot on host machine. But if we choose  `2.2 Install GDRCopy and load the gdrdrv kernel module` approach.

-  Do we still need to reboot the host machine?

- Can we wrap all NVSHMEM / GDRCopy related things into a docker image, without having to perform anything special on the host machine?

- How can I verify IBGDA support setups? Use the official testcases  (intra / inter / low latency) ?


Thanks in advance, any help would be appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions about environment setups #486

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions about environment setups #486

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions