-
Notifications
You must be signed in to change notification settings - Fork 95
Closed
Description
#639 added new test cases for multi-node:
all_gatheralltoall
When running the NCCL tests on two P4 instances, it failed with OOM:
[1,6]<stdout>:multi-node-alltoall-perf-worker-0: Test NCCL failure alltoall.cu:62 'unhandled cuda error (run with NCCL_DEBUG=INFO for details) / '
[1,6]<stdout>: .. multi-node-alltoall-perf-worker-0 pid 411: Test failure common.cu:381
[1,6]<stdout>: .. multi-node-alltoall-perf-worker-0 pid 411: Test failure common.cu:590
[1,6]<stdout>: .. multi-node-alltoall-perf-worker-0 pid 411: Test failure alltoall.cu:97
[1,6]<stdout>: .. multi-node-alltoall-perf-worker-0 pid 411: Test failure common.cu:623
[1,6]<stdout>: .. multi-node-alltoall-perf-worker-0 pid 411: Test failure common.cu:1073
[1,6]<stdout>: .. multi-node-alltoall-perf-worker-0 pid 411: Test failure common.cu:886
[1,0]<stdout>:multi-node-alltoall-perf-worker-0:402:458 [0] NCCL INFO [Service thread] Connection closed by localRank 6
[1,0]<stdout>:
[1,0]<stdout>:multi-node-alltoall-perf-worker-0:402:402 [0] enqueue.cc:1451 NCCL WARN Cuda failure 2 'out of memory'
all_gather succeeded. The issue is that all-to-all is a more memory intensive case and the test parameters are too aggressive for running this test case on P4 instances.
Metadata
Metadata
Assignees
Labels
No labels