Skip to content

Conversation

@ytsssun
Copy link
Contributor

@ytsssun ytsssun commented Sep 24, 2025

Issue #, if available:
#690

Description of changes:
Use default maxBytes and ncclBuffSize when running all-to-all test case on P4 instances.

Test done
Ran the all-to-all test case on P4 with this change. The test passed:

    mpi_test.go:123: Multi node job completed
--- PASS: TestMPIJobPytorchTraining (40.41s)
    --- PASS: TestMPIJobPytorchTraining/multi-node:alltoall_perf (40.41s)
        --- PASS: TestMPIJobPytorchTraining/multi-node:alltoall_perf/MPIJob_succeeds (40.31s)
PASS
ok      github.com/aws/aws-k8s-tester/test/cases/nvidia 52.345s

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@ndbaker1 ndbaker1 linked an issue Sep 25, 2025 that may be closed by this pull request
Signed-off-by: Yutong Sun <yutongsu@amazon.com>
@ytsssun ytsssun force-pushed the fix-all-to-all-for-p4 branch from acb86fe to 4c6b91d Compare October 1, 2025 21:42
@ytsssun
Copy link
Contributor Author

ytsssun commented Oct 1, 2025

Addressed the comment

@ndbaker1 ndbaker1 merged commit 7bd6f5c into aws:main Oct 2, 2025
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-node NCCL case all-to-all failed on P4 instance

2 participants