Bugfix/e2e neuron nccom #521
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR fixes the calculation logic for neuronPerNode, neuronCorePerNode, and efaPerNode in the checkNodeTypes function. Previously, the logic only calculated capacity from the first node (nodes.Items[0]), leading to incorrect capacity values like neuronCorePerNode = 0.
This would lead to the following error when directly trying to run the Neuron
multi-nodetests:What Changed?
Root Cause
The root cause of the issue was an incorrect calculation of the neuronCorePerNode capacity due to relying on only one node for capacity details instead of iterating over all cluster nodes. This caused neuronCorePerNode to be set to 0, which was later used in environment variables like NEURON_CORE_PER_NODE=0, leading to incorrect capacity assumptions inside pods.
Solution
Testing
NOTE: The logs have been truncated to just show the critical pieces
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.