Skip to content

Conversation

@mattcjo
Copy link
Contributor

@mattcjo mattcjo commented Aug 7, 2024

Issue #, if available:

Description of changes:

This test being added will run an E2E BERT training test. The validation for this test was done on a cluster consisting of p3.16xlarge instance type. The cluster has four nodes in total.

The results of running the training test can be seen below. These logs were obtained from the master pod that coordinated the E2E BERT training job.

[1,31]<stdout>:Process 31 - Training time: 10.09 seconds
[1,31]<stdout>:Process 31 - Throughput: 9.91 samples/second
[1,29]<stdout>:Process 29 - Training time: 10.05 seconds
[1,29]<stdout>:Process 29 - Throughput: 9.95 samples/second
[1,28]<stdout>:Process 28 - Training time: 10.09 seconds
[1,28]<stdout>:Process 28 - Throughput: 9.91 samples/second
[1,25]<stdout>:Process 25 - Training time: 10.04 seconds
[1,25]<stdout>:Process 25 - Throughput: 9.96 samples/second
[1,27]<stdout>:Process 27 - Training time: 10.10 seconds
[1,27]<stdout>:Process 27 - Throughput: 9.90 samples/second
[1,20]<stdout>:Process 20 - Training time: 10.09 seconds
[1,20]<stdout>:Process 20 - Throughput: 9.91 samples/second
[1,3]<stdout>:Process 3 - Training time: 10.07 seconds
[1,3]<stdout>:Process 3 - Throughput: 9.93 samples/second
[1,0]<stdout>:Process 0 - Training time: 10.03 seconds
[1,0]<stdout>:Process 0 - Throughput: 9.97 samples/second
[1,23]<stdout>:Process 23 - Training time: 10.04 seconds
[1,23]<stdout>:Process 23 - Throughput: 9.96 samples/second
[1,24]<stdout>:Process 24 - Training time: 10.10 seconds
[1,24]<stdout>:Process 24 - Throughput: 9.90 samples/second
[1,2]<stdout>:Process 2 - Training time: 10.14 seconds
[1,2]<stdout>:Process 2 - Throughput: 9.86 samples/second
[1,5]<stdout>:Process 5 - Training time: 10.08 seconds
[1,5]<stdout>:Process 5 - Throughput: 9.92 samples/second
[1,21]<stdout>:Process 21 - Training time: 10.08 seconds
[1,21]<stdout>:Process 21 - Throughput: 9.92 samples/second
[1,22]<stdout>:Process 22 - Training time: 10.07 seconds
[1,22]<stdout>:Process 22 - Throughput: 9.93 samples/second
[1,30]<stdout>:Process 30 - Training time: 10.09 seconds
[1,30]<stdout>:Process 30 - Throughput: 9.91 samples/second
[1,1]<stdout>:Process 1 - Training time: 10.07 seconds
[1,1]<stdout>:Process 1 - Throughput: 9.93 samples/second
[1,17]<stdout>:Process 17 - Training time: 10.11 seconds
[1,17]<stdout>:Process 17 - Throughput: 9.89 samples/second
[1,12]<stdout>:Process 12 - Training time: 10.01 seconds
[1,12]<stdout>:Process 12 - Throughput: 9.99 samples/second
[1,6]<stdout>:Process 6 - Training time: 10.04 seconds
[1,6]<stdout>:Process 6 - Throughput: 9.96 samples/second
[1,18]<stdout>:Process 18 - Training time: 10.12 seconds
[1,18]<stdout>:Process 18 - Throughput: 9.88 samples/second
[1,7]<stdout>:Process 7 - Training time: 10.11 seconds
[1,7]<stdout>:Process 7 - Throughput: 9.89 samples/second
[1,15]<stdout>:Process 15 - Training time: 10.14 seconds
[1,15]<stdout>:Process 15 - Throughput: 9.86 samples/second
[1,19]<stdout>:Process 19 - Training time: 10.12 seconds
[1,19]<stdout>:Process 19 - Throughput: 9.89 samples/second
[1,14]<stdout>:Process 14 - Training time: 9.96 seconds
[1,14]<stdout>:Process 14 - Throughput: 10.04 samples/second
[1,13]<stdout>:Process 13 - Training time: 10.05 seconds
[1,13]<stdout>:Process 13 - Throughput: 9.95 samples/second
[1,16]<stdout>:Process 16 - Training time: 10.10 seconds
[1,16]<stdout>:Process 16 - Throughput: 9.90 samples/second
[1,26]<stdout>:Process 26 - Training time: 10.11 seconds
[1,26]<stdout>:Process 26 - Throughput: 9.89 samples/second
[1,10]<stdout>:Process 10 - Training time: 10.12 seconds
[1,10]<stdout>:Process 10 - Throughput: 9.88 samples/second
[1,11]<stdout>:Process 11 - Training time: 10.10 seconds
[1,11]<stdout>:Process 11 - Throughput: 9.90 samples/second
[1,8]<stdout>:Process 8 - Training time: 10.09 seconds
[1,8]<stdout>:Process 8 - Throughput: 9.91 samples/second
[1,4]<stdout>:Process 4 - Training time: 10.05 seconds
[1,4]<stdout>:Process 4 - Throughput: 9.95 samples/second
[1,9]<stdout>:Process 9 - Training time: 10.08 seconds
[1,9]<stdout>:Process 9 - Throughput: 9.92 samples/second

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

mattcjo and others added 30 commits June 26, 2024 21:15
…ce to be consistent with the other test images
Comment on lines +74 to +80
resources:
requests:
nvidia.com/gpu: 8
vpc.amazonaws.com/efa: 0
limits:
nvidia.com/gpu: 8
vpc.amazonaws.com/efa: 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't hard code this, since we might need to run it in different node configurations (e.g. node type, node count).
Check here for a reference on how not to hardcode this.
https://github.com/aws/aws-k8s-tester/blob/main/e2e2/test/cases/nvidia/main_test.go#L98-L144
https://github.com/aws/aws-k8s-tester/blob/main/e2e2/test/cases/nvidia/manifests/mpi-job-nccl-test-multi-node.yaml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can parameterize it for future proofing. Right now all tests will be ran on an instance with 8 NVIDIA GPUs, but I have no problem with this. Will make the update.

return ctx
}).
Teardown(func(ctx context.Context, t *testing.T, cfg *envconf.Config) context.Context {
// Delete the manifest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can print out the logs before deleting them, it will help us troubleshoot the test failures better.
Reference: https://github.com/aws/aws-k8s-tester/blob/main/e2e2/test/cases/nvidia/mpi_test.go#L128-L135


for epoch in range(1): # Short run for testing
ddp_model.train()
for batch in train_dataloader:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is broken.

[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "/app/train.py", line 138, in <module>
[1,0]<stderr>:    main()
[1,0]<stderr>:  File "/app/train.py", line 123, in main
[1,0]<stderr>:    num_gpus_per_node = int(os.environ["NUM_GPUS_PER_NODE"]) 
[1,0]<stderr>:  File "/usr/local/lib/python3.10/os.py", line 680, in __getitem__
[1,0]<stderr>:    raise KeyError(key) from None
[1,0]<stderr>:KeyError: 'NUM_GPUS_PER_NODE'
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "/app/train.py", line 138, in <module>
[1,1]<stderr>:    main()
[1,1]<stderr>:  File "/app/train.py", line 123, in main
[1,1]<stderr>:    num_gpus_per_node = int(os.environ["NUM_GPUS_PER_NODE"]) 
[1,1]<stderr>:  File "/usr/local/lib/python3.10/os.py", line 680, in __getitem__
[1,1]<stderr>:    raise KeyError(key) from None
[1,1]<stderr>:KeyError: 'NUM_GPUS_PER_NODE'
[1,2]<stderr>:Traceback (most recent call last):
[1,2]<stderr>:  File "/app/train.py", line 138, in <module>
[1,2]<stderr>:    main()
[1,2]<stderr>:  File "/app/train.py", line 123, in main
[1,2]<stderr>:    num_gpus_per_node = int(os.environ["NUM_GPUS_PER_NODE"]) 
[1,2]<stderr>:  File "/usr/local/lib/python3.10/os.py", line 680, in __getitem__
[1,2]<stderr>:    raise KeyError(key) from None
[1,2]<stderr>:KeyError: 'NUM_GPUS_PER_NODE'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also you forget to add the test binary to Dockerfile.kubetest2

@mattcjo mattcjo closed this Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants