Bugfix/e2e neuron nccom #521

mattcjo · 2024-12-13T20:02:46Z

Description

This PR fixes the calculation logic for neuronPerNode, neuronCorePerNode, and efaPerNode in the checkNodeTypes function. Previously, the logic only calculated capacity from the first node (nodes.Items[0]), leading to incorrect capacity values like neuronCorePerNode = 0.

This would lead to the following error when directly trying to run the Neuron multi-node tests:

kubectl logs pod/multi-node-nccom-test-launcher-6hrgq

Defaulted container "nccom-test-launcher" out of: nccom-test-launcher, init-worker-ips (init)
 * Starting OpenBSD Secure Shell server sshd
   ...done.
[sudo] password for ubuntu: Invalid number of worker threads (0)

What Changed?

Accurate Capacity Calculation: The logic now iterates over all nodes to calculate the total capacity for neuron, neuroncore, and EFA resources.
Per-Node Capacity Calculation: Per-node values are calculated as (total capacity / node count) to ensure proper values for multi-node clusters.
Improved Error Handling: If nodes are not found or capacity details are missing, the function now returns clear error messages.
Logging Improvements: Informational and warning logs were added for missing capacity details, node counts, and final per-node capacity values.

Root Cause

The root cause of the issue was an incorrect calculation of the neuronCorePerNode capacity due to relying on only one node for capacity details instead of iterating over all cluster nodes. This caused neuronCorePerNode to be set to 0, which was later used in environment variables like NEURON_CORE_PER_NODE=0, leading to incorrect capacity assumptions inside pods.

Solution

Loop over all nodes to calculate the total capacity for neuron, neuroncore, and EFA across the cluster.
Divide total capacity by node count to compute per-node capacities for each attribute.
Log any nodes missing capacity information.
Add robust error handling to fail early if no nodes or capacity information is found.

Testing

NOTE: The logs have been truncated to just show the critical pieces

go test -v ./... \
  -run ^TestNeuronNodes$/multi-node \
  -timeout 60m \
  -neuronTestImage=632572741643.dkr.ecr.us-east-1.amazonaws.com/aws-k8s-tester/neuron-test:latest \
  -nodeType=trn1.32xlarge \
  -efaEnabled=true
2024/12/13 20:00:34 [INFO] Processing node ip-192-168-79-55.ec2.internal
2024/12/13 20:00:34 [INFO] Processing node ip-192-168-87-201.ec2.internal
2024/12/13 20:00:34 [INFO] Total Nodes: 2
2024/12/13 20:00:34 [INFO] Total Neuron Count: 32, Neuron Per Node: 16
2024/12/13 20:00:34 [INFO] Total Neuron Core Count: 64, Neuron Core Per Node: 32
2024/12/13 20:00:34 [INFO] Total EFA Count: 16, EFA Per Node: 8
=== RUN   TestNeuronNodes
=== RUN   TestNeuronNodes/multi-node
    neuron_test.go:116: Applying multi node manifest
    neuron_test.go:121: Applied manifest successfully
=== RUN   TestNeuronNodes/multi-node/NCCOM_test_succeeds
    neuron_test.go:132: Waiting for MPIJob to complete
.
.
.
        [1,0]<stdout>:nccom_neff_allr_x10_536870912_fp32_r64
        [1,0]<stdout>:+---+[1,0]<stdout>:----+---------+---------+------------+-------+--------+---------+---------+---------+--------+[1,0]<stdout>:---------+---------+-------+
        [1,0]<stdout>:  B   NC  [1,0]<stdout>: NC USED   WEIGHTS   MODE         INF/S   IRES/S   L(1)      L(50)     L(99)     CCL(1)  [1,0]<stdout>: CCL(50)   CCL(99)   %USER
        [1,0]<stdout>:  1   [1,0]<stdout>:1    32        dynamic   [1,0]<stdout>:LIBMODE_CC   3.71    3.71     8614880 [1,0]<stdout>:  8615152   8615377 [1,0]<stdout>:  64458    64867     66149   [1,0]<stdout>:  N/A
        [1,0]<stdout>:+---+----+---------+---------+------------+-------+[1,0]<stdout>:--------+---------+---------+---------+--------+---------+---------+-------+
               size(B)    count(elems)    type    time:avg(us)    algbw(GB/s)    busbw(GB/s)
                     8               2    fp32          565.39           0.00           0.00
                    16               4    fp32          654.02           0.00           0.00
                    32               8    fp32          551.89           0.00           0.00
                    64              16    fp32          531.51           0.00           0.00
                   128              32    fp32          309.51           0.00           0.00
                   256              64    fp32          548.34           0.00           0.00
                   512             128    fp32          546.11           0.00           0.00
                  1024             256    fp32          896.31           0.00           0.00
                  2048             512    fp32           890.1           0.00           0.00
                  4096            1024    fp32           670.8           0.01           0.01
                  8192            2048    fp32          787.22           0.01           0.02
                 16384            4096    fp32         1014.55           0.02           0.03
                 32768            8192    fp32          896.11           0.03           0.07
                 65536           16384    fp32          800.97           0.08           0.15
                131072           32768    fp32         1026.83           0.12           0.23
                262144           65536    fp32          562.85           0.43           0.85
                524288          131072    fp32          695.69           0.70           1.38
               1048576          262144    fp32           502.7           1.94           3.82
               2097152          524288    fp32          340.44           5.74          11.29
               4194304         1048576    fp32          468.29           8.34          16.42
               8388608         2097152    fp32          670.48          11.65          22.94
              16777216         4194304    fp32          977.58          15.98          31.47
              33554432         8388608    fp32         1305.05          23.95          47.14
              67108864        16777216    fp32         2485.16          25.15          49.51
             134217728        33554432    fp32         5143.83          24.30          47.84
             268435456        67108864    fp32         8303.74          30.11          59.27
             536870912       134217728    fp32        16477.71          30.34          59.74
            1073741824       268435456    fp32        32753.48          30.53          60.11
            2147483648       536870912    fp32        64997.61          30.77          60.58
        Avg bus bandwidth:	16.3070GB/s

--- PASS: TestNeuronNodes (376.58s)
    --- PASS: TestNeuronNodes/multi-node (376.58s)
        --- PASS: TestNeuronNodes/multi-node/NCCOM_test_succeeds (376.24s)
PASS
ok  	github.com/aws/aws-k8s-tester/e2e2/test/cases/neuron	396.687s

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…y multi-node is ran.

e2e2/test/cases/neuron/main_test.go

mattcjo added 3 commits December 13, 2024 19:48

Update checkNodeTypes to correctly set neuronCorePerNode for when onl…

aef2fd1

…y multi-node is ran.

Remove unnecessary WARN statements

8808916

Change WARN to INFO log

6f5c88b

mattcjo requested a review from Pavani-Panakanti December 13, 2024 20:50

ndbaker1 requested changes Dec 13, 2024

View reviewed changes

e2e2/test/cases/neuron/main_test.go Outdated Show resolved Hide resolved

e2e2/test/cases/neuron/main_test.go Outdated Show resolved Hide resolved

e2e2/test/cases/neuron/main_test.go Outdated Show resolved Hide resolved

Update based on comments in PR - aws#521

5b8ffb3

mattcjo requested a review from ndbaker1 December 13, 2024 21:19

ndbaker1 approved these changes Dec 13, 2024

View reviewed changes

mattcjo merged commit a55aaac into aws:main Dec 13, 2024
5 checks passed

mattcjo deleted the bugfix/e2e-neuron-nccom branch December 13, 2024 21:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bugfix/e2e neuron nccom #521

Bugfix/e2e neuron nccom #521

Uh oh!

mattcjo commented Dec 13, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bugfix/e2e neuron nccom #521

Bugfix/e2e neuron nccom #521

Uh oh!

Conversation

mattcjo commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What Changed?

Root Cause

Solution

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mattcjo commented Dec 13, 2024 •

edited

Loading