-
Notifications
You must be signed in to change notification settings - Fork 37
Adds bandwidth.yml playbook for NVIDIA nvbandwidth #834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
bandwidth.yml is ran via cudatests.yml
454756e to
cc61ed3
Compare
ansible/adhoc/cudatests.yml
Outdated
| gather_facts: true | ||
| tags: cuda_samples | ||
| tasks: | ||
| - ansible.builtin.import_role: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we don't even run devicequery, I think we should just remove this task entirely TBH. But leave the role pending thinking more!
ansible/roles/cuda/defaults/main.yml
Outdated
| cuda_persistenced_state: started | ||
| # variables for nvbandwidth (for bandwidth.yml tasks run in cudatests.yml) | ||
| cuda_bandwidth_path: "/var/lib/{{ ansible_user }}/cuda_bandwidth" | ||
| cuda_bandwidth_release_url: "https://github.com/NVIDIA/nvbandwidth/archive/refs/tags/v0.8.tar.gz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather break the version out here and then use that var in the creates: on the "Download CUDA bandwith test release" task.
|
|
||
| - name: Build CUDA bandwidth test | ||
| ansible.builtin.shell: | ||
| cmd: source /cvmfs/software.eessi.io/versions/2023.06/init/bash && module load Boost/1.82.0-GCC-12.3.0 && . /etc/profile.d/sh.local && cmake .. && make -j {{ ansible_processor_vcpus }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can do this more readably using one of the many multiline yaml options, e.g.:
| cmd: source /cvmfs/software.eessi.io/versions/2023.06/init/bash && module load Boost/1.82.0-GCC-12.3.0 && . /etc/profile.d/sh.local && cmake .. && make -j {{ ansible_processor_vcpus }} | |
| cmd: >- | |
| source /cvmfs/software.eessi.io/versions/2023.06/init/bash && | |
| module load Boost/1.82.0-GCC-12.3.0 && | |
| . /etc/profile.d/sh.local&& | |
| cmake ..&& | |
| make -j {{ ansible_processor_vcpus }} |
| chdir: "{{ cuda_bandwidth_path }}/nvbandwidth-0.8/build/" | ||
| register: cuda_bandwidth_output | ||
|
|
||
| - name: Save CUDA bandwidth output to bandwidth_results.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no useful summary we can do here? So someone not familar with the system can get a quick idea of "it works" or "it doesn't"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you can run test cases by running ./nvbandwidth -t <testcase name> e.g ./nvbandwidth -t device_to_device_memcpy_read_ce where as just running ./nvbandwidth will run all the testcases. The example I gave here runs alot quicker and gives a shorter output than running all the testcases. Would it be useful then to but a task before name: Run CUDA bandwidth test which just runs the testcase device_to_device_memcpy_read_ce to the console to show the user that it works?
| - name: Save CUDA bandwidth output to bandwidth_results.txt | ||
| ansible.builtin.copy: | ||
| content: "{{ cuda_bandwidth_output.stdout }}" | ||
| dest: "{{ appliances_environment_root }}/cudatests/bandwidth_results.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When cuda group contains multiple nodes they will all write to the same file.
09ef693 to
abb3185
Compare
abb3185 to
358c47e
Compare
358c47e to
4663791
Compare
removes samples.yml tasks from adhoc/cudatest.yml
1361e3b to
0692a33
Compare
Modifies
ansible/adhoc/cudatests.ymlto run the NVIDIA nvbandwidth test. This replaces the olderbandwidthTestCUDA Samples utility removed in #687.