- 
                Notifications
    You must be signed in to change notification settings 
- Fork 95
Closed
Description
#639 added new test cases for multi-node:
- all_gather
- alltoall
When running the NCCL tests on two P4 instances, it failed with OOM:
        [1,6]<stdout>:multi-node-alltoall-perf-worker-0: Test NCCL failure alltoall.cu:62 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
        [1,6]<stdout>: .. multi-node-alltoall-perf-worker-0 pid 411: Test failure common.cu:381
        [1,6]<stdout>: .. multi-node-alltoall-perf-worker-0 pid 411: Test failure common.cu:590
        [1,6]<stdout>: .. multi-node-alltoall-perf-worker-0 pid 411: Test failure alltoall.cu:97
        [1,6]<stdout>: .. multi-node-alltoall-perf-worker-0 pid 411: Test failure common.cu:623
        [1,6]<stdout>: .. multi-node-alltoall-perf-worker-0 pid 411: Test failure common.cu:1073
        [1,6]<stdout>: .. multi-node-alltoall-perf-worker-0 pid 411: Test failure common.cu:886
        [1,0]<stdout>:multi-node-alltoall-perf-worker-0:402:458 [0] NCCL INFO [Service thread] Connection closed by localRank 6
        [1,0]<stdout>:
        [1,0]<stdout>:multi-node-alltoall-perf-worker-0:402:402 [0] enqueue.cc:1451 NCCL WARN Cuda failure 2 'out of memory'
all_gather succeeded. The issue is that all-to-all is a more memory intensive case and the test parameters are too aggressive for running this test case on P4 instances.
Metadata
Metadata
Assignees
Labels
No labels