-
Notifications
You must be signed in to change notification settings - Fork 34
Add OCI example tests for OFED userspace tools and iperf3 #477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The iperf3 tests are not passing for me on an image with iperf3 installed. Aside from that, everything else works as expected and looks good.
If the iperf tests work on someone else's known, correct configuration, please record that here.
I am not able to reproduce the
tox -e integration-tests -- examples/oracle/oracle-example-cluster-test.py -k "TestOracleClusterPerformance"
integration-tests: commands[0]> .tox/integration-tests/bin/python -m pytest --log-cli-level=INFO -svv examples/oracle/oracle-example-cluster-test.py -k TestOracleClusterPerformance
========================================================================================================================= test session starts ==========================================================================================================================
platform linux -- Python 3.12.3, pytest-8.3.4, pluggy-1.5.0 -- /home/davide/repo/tools/pycloudlib/.tox/integration-tests/bin/python
cachedir: .tox/integration-tests/.pytest_cache
rootdir: /home/davide/repo/tools/pycloudlib
configfile: pyproject.toml
plugins: xdist-3.6.1, mock-3.14.0, cov-6.0.0
collecting ...
------------------------------------------------------------------------------------------------------------------------- live log collection --------------------------------------------------------------------------------------------------------------------------
INFO oci.circuit_breaker:__init__.py:27 Default Auth client Circuit breaker strategy enabled
collected 12 items / 11 deselected / 1 selected
examples/oracle/oracle-example-cluster-test.py::TestOracleClusterPerformance::test_iperf3
---------------------------------------------------------------------------------------------------------------------------- live log setup ----------------------------------------------------------------------------------------------------------------------------
INFO pycloudlib.cloud.OCI:cloud.py:354 No public key path provided, using: /home/davide/.ssh/id_ed25519.pub
INFO oracle-example-cluster-test:oracle-example-cluster-test.py:143 Instance ocid1.instance.oc1.phx.anyhqljsniwq6syc5nsrp6gpjsdamswpjs6aluliyypm2v5k25h3ackuc3ia already has a secondary VNIC, not attaching one.
INFO oracle-example-cluster-test:oracle-example-cluster-test.py:143 Instance ocid1.instance.oc1.phx.anyhqljsniwq6sycbiiymnuolvlp3hm7iryh5xr5rwzqh4vzz74gpkp43tea already has a secondary VNIC, not attaching one.
---------------------------------------------------------------------------------------------------------------------------- live log call -----------------------------------------------------------------------------------------------------------------------------
INFO pycloudlib.instance:instance.py:285 executing: sh -c 'iperf3 -s -1'
INFO pycloudlib.instance:instance.py:106 Using ipv4 address: 129.146.167.132
INFO paramiko.transport:transport.py:1944 Connected (version 2.0, client OpenSSH_9.6p1)
INFO paramiko.transport:transport.py:1944 Authentication (publickey) successful!
INFO pycloudlib.instance:instance.py:285 executing: sh -c 'iperf3 -c 10.0.1.99 -P 40 -Z | grep SUM'
INFO pycloudlib.instance:instance.py:106 Using ipv4 address: 129.146.4.55
INFO paramiko.transport:transport.py:1944 Connected (version 2.0, client OpenSSH_9.6p1)
INFO paramiko.transport:transport.py:1944 Authentication (publickey) successful!
INFO oracle-example-cluster-test:oracle-example-cluster-test.py:450 iperf3 output: [SUM] 0.00-1.00 sec 5.39 GBytes 46.3 Gbits/sec 3326
[SUM] 1.00-2.00 sec 5.36 GBytes 46.0 Gbits/sec 2993
[SUM] 2.00-3.00 sec 5.36 GBytes 46.0 Gbits/sec 2759
[SUM] 3.00-4.00 sec 5.35 GBytes 46.0 Gbits/sec 2745
[SUM] 4.00-5.00 sec 5.36 GBytes 46.0 Gbits/sec 2633
[SUM] 5.00-6.00 sec 5.35 GBytes 46.0 Gbits/sec 2931
[SUM] 6.00-7.00 sec 5.36 GBytes 46.0 Gbits/sec 2641
[SUM] 7.00-8.00 sec 5.35 GBytes 46.0 Gbits/sec 2662
[SUM] 8.00-9.00 sec 5.36 GBytes 46.0 Gbits/sec 2883
[SUM] 9.00-10.01 sec 5.36 GBytes 45.8 Gbits/sec 2857
[SUM] 0.00-10.01 sec 53.6 GBytes 46.0 Gbits/sec 28430 sender
[SUM] 0.00-10.01 sec 53.6 GBytes 46.0 Gbits/sec receiver
iperf3 measured throughput: 46.0
PASSED
=========================================================================================================================== warnings summary ===========================================================================================================================
examples/oracle/oracle-example-cluster-test.py: 132 warnings
/home/davide/repo/tools/pycloudlib/.tox/integration-tests/lib/python3.12/site-packages/oci/base_client.py:77: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
return " " + str(datetime.utcnow()) + ": "
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================================================================================================== 1 passed, 11 deselected, 132 warnings in 23.71s ============================================================================================================
integration-tests: OK (24.44=setup[0.04]+cmd[24.41] seconds)
congratulations :) (24.50 seconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Davide has verified that the iperf3 tests work on his image, and I have verified that all other tests work as expected and look good to me, so I think this should be good to merge.
@MitchellAugustin thank you for reviewing this! |
d9f103a
to
2472cd1
Compare
2472cd1
to
b8c7175
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This PR adds OCI example tests for OFED userspace tools and iperf3 while refactoring RDMA test logic.
- Introduces a new helper function (ensure_second_vnics_ready) to verify secondary VNICs are present and refactors the RDMA test fixture.
- Adds tests to validate the installation and basic output checks for mst, mlxconfig, mlxfwmanager, flint, mlxfwreset, and iperf3 performance.
Reviewed Changes
File | Description |
---|---|
examples/oracle/oracle-example-cluster-test.py | Refactored RDMA tests; added new tests for OFED CLI tools and iperf3 |
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
examples/oracle/oracle-example-cluster-test.py:129
- Typo detected in the skip message, 'beiing' should be corrected to 'being'.
pytest.skip("The image beiing used is not RDMA ready")
dfbc975
to
3afdefc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes look great. thanks for working with me to improve some of the various docstrings and organization of the code. 💙
3afdefc
to
82a4446
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for delayed comments, finishing my review now!
@MitchellAugustin @a-dubs while validating the latest version of the tests I noticed that I cannot reach the same |
c01a13d
to
529269b
Compare
I think it is OK to lower the accepted threshold here since we aren't tailoring this test to reach line rate for any specific device. In general though, to reach line rate via iperf3 at speeds above 40Gbps, you may need to use multiple iperf3 streams/processes. This is something that we need to do when testing the DGXes at 100Gbps+, since individual iperf3 threads aren't necessarily capable of rates that high. |
@MitchellAugustin I am currently using the |
ah sorry, somehow I missed that |
Add test cases to the Oracle cluster example, to validate the presence of Nvidia firmware CLI tools and to confirm that the throughput measured by iperf3 is acceptable.
529269b
to
6766df2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Description
In the context of testing OFED on OCI instances I prepared a list of CLI tools that must be installed by the package.
I was suggested to add tests to validate that these commands are installed in the
example/oracle/oracle-example-cluster-test.py
. Note that these tests simply check that no error is returned, and that stdout includes a few expected keywords.The CLI tools covered by the tests are:
mst
mlxconfig
mlxfwmanager
flint
mlxfwreset
I also added a test to check what transmission performance is measured by
iperf3
, the expected throughput gives some room below the 50GBps supported by the physical interface.Additional Context and Relevant Issues
The original manual testing of the OFED packages comes from ATLA-29
Test Steps
I manually executed the newly added tests on running instances where OFED and iperf3 were already installed. All tests pass.