Skip to content

Conversation

@claudia-lola
Copy link
Contributor

@claudia-lola claudia-lola commented Oct 22, 2025

Modifies ansible/adhoc/cudatests.yml to run the NVIDIA nvbandwidth test. This replaces the older bandwidthTest CUDA Samples utility removed in #687.

@claudia-lola claudia-lola requested a review from a team as a code owner October 22, 2025 14:59
@claudia-lola claudia-lola self-assigned this Oct 22, 2025
@sjpb sjpb changed the title Adds bandwidth.yml playbook to download, build, and run nvbandwidth. Adds bandwidth.yml playbook for NVIDIA nvbandwidth Oct 22, 2025
gather_facts: true
tags: cuda_samples
tasks:
- ansible.builtin.import_role:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we don't even run devicequery, I think we should just remove this task entirely TBH. But leave the role pending thinking more!

cuda_persistenced_state: started
# variables for nvbandwidth (for bandwidth.yml tasks run in cudatests.yml)
cuda_bandwidth_path: "/var/lib/{{ ansible_user }}/cuda_bandwidth"
cuda_bandwidth_release_url: "https://github.com/NVIDIA/nvbandwidth/archive/refs/tags/v0.8.tar.gz"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather break the version out here and then use that var in the creates: on the "Download CUDA bandwith test release" task.


- name: Build CUDA bandwidth test
ansible.builtin.shell:
cmd: source /cvmfs/software.eessi.io/versions/2023.06/init/bash && module load Boost/1.82.0-GCC-12.3.0 && . /etc/profile.d/sh.local && cmake .. && make -j {{ ansible_processor_vcpus }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this more readably using one of the many multiline yaml options, e.g.:

Suggested change
cmd: source /cvmfs/software.eessi.io/versions/2023.06/init/bash && module load Boost/1.82.0-GCC-12.3.0 && . /etc/profile.d/sh.local && cmake .. && make -j {{ ansible_processor_vcpus }}
cmd: >-
source /cvmfs/software.eessi.io/versions/2023.06/init/bash &&
module load Boost/1.82.0-GCC-12.3.0 &&
. /etc/profile.d/sh.local&&
cmake ..&&
make -j {{ ansible_processor_vcpus }}

chdir: "{{ cuda_bandwidth_path }}/nvbandwidth-0.8/build/"
register: cuda_bandwidth_output

- name: Save CUDA bandwidth output to bandwidth_results.txt
Copy link
Collaborator

@sjpb sjpb Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no useful summary we can do here? So someone not familar with the system can get a quick idea of "it works" or "it doesn't"?

Copy link
Contributor Author

@claudia-lola claudia-lola Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you can run test cases by running ./nvbandwidth -t <testcase name> e.g ./nvbandwidth -t device_to_device_memcpy_read_ce where as just running ./nvbandwidth will run all the testcases. The example I gave here runs alot quicker and gives a shorter output than running all the testcases. Would it be useful then to but a task before name: Run CUDA bandwidth test which just runs the testcase device_to_device_memcpy_read_ce to the console to show the user that it works?

- name: Save CUDA bandwidth output to bandwidth_results.txt
ansible.builtin.copy:
content: "{{ cuda_bandwidth_output.stdout }}"
dest: "{{ appliances_environment_root }}/cudatests/bandwidth_results.txt"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When cuda group contains multiple nodes they will all write to the same file.

@claudia-lola claudia-lola force-pushed the bandwidth-test branch 2 times, most recently from 09ef693 to abb3185 Compare October 28, 2025 15:40
@claudia-lola claudia-lola requested a review from sjpb October 28, 2025 16:51
removes samples.yml tasks from adhoc/cudatest.yml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants