Skip to content

Questions about environment setups #486

@CUHKSZzxy

Description

@CUHKSZzxy

Thanks for such great work !

Currently I'm trying to use DeepEP for multi-node MoE model deployment, though I have read through related issues and docs, there are still several confusions that require official confirmations, to make sure I do not misunderstand.

  1. How to validate the DeepEP environment

For now, i am using the official testcases (intra / inter / low latency). And I suppose that if all those tests succeed, DeepEP env should be good to go, is such assumption right ?

  1. About NVSHMEM

According to NVSHMEM install guide, it involves two stages 1. install NVSHMEM binary, 2. Enable NVSHMEM IBGDA support. And for IBGDA suppport step, we choose either 2.1 Configure NVIDIA driver or 2.2 Install GDRCopy and load the gdrdrv kernel module.

  • Is the second step Enable NVSHMEM IBGDA support a necessary steps for multi-node deployment?

For 2.1 Configure NVIDIA driver, we need to modify driver config and reboot on host machine. But if we choose 2.2 Install GDRCopy and load the gdrdrv kernel module approach.

  • Do we still need to reboot the host machine?

  • Can we wrap all NVSHMEM / GDRCopy related things into a docker image, without having to perform anything special on the host machine?

  • How can I verify IBGDA support setups? Use the official testcases (intra / inter / low latency) ?

Thanks in advance, any help would be appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions