Skip to content

Commit 8a24386

Browse files
authored
Frontier module update (#842)
1 parent 59d9431 commit 8a24386

File tree

11 files changed

+69
-28
lines changed

11 files changed

+69
-28
lines changed

.github/pull_request_template.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,5 +54,5 @@ To make sure the code is performing as expected on GPU devices, I have:
5454
- [ ] Ran the code on MI200+ GPUs and ensure the new features performed as expected (the GPU results match the CPU results)
5555
- [ ] Enclosed the new feature via `nvtx` ranges so that they can be identified in profiles
5656
- [ ] Ran a Nsight Systems profile using `./mfc.sh run XXXX --gpu -t simulation --nsys`, and have attached the output file (`.nsys-rep`) and plain text results to this PR
57-
- [ ] Ran an Omniperf profile using `./mfc.sh run XXXX --gpu -t simulation --omniperf`, and have attached the output file and plain text results to this PR.
57+
- [ ] Ran a Rocprof Systems profile using `./mfc.sh run XXXX --gpu -t simulation --rsys --hip-trace`, and have attached the output file and plain text results to this PR.
5858
- [ ] Ran my code using various numbers of different GPUs (1, 2, and 8, for example) in parallel and made sure that the results scale similarly to what happens if you run without the new code/feature

.github/workflows/frontier/build.sh

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,9 @@
11
#!/bin/bash
22

3+
build_opts=""
4+
if [ "$1" == "gpu" ]; then
5+
build_opts="--gpu"
6+
fi
7+
38
. ./mfc.sh load -c f -m g
4-
./mfc.sh test --dry-run -j 8 --gpu
9+
./mfc.sh test --dry-run -j 8 $build_opts

.github/workflows/frontier/submit.sh

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,16 +13,29 @@ else
1313
exit 1
1414
fi
1515

16+
if [ "$2" == "cpu" ]; then
17+
sbatch_device_opts="\
18+
#SBATCH -n 32 # Number of cores required"
19+
elif [ "$2" == "gpu" ]; then
20+
sbatch_device_opts="\
21+
#SBATCH -n 8 # Number of cores required"
22+
else
23+
usage
24+
exit 1
25+
fi
26+
27+
1628
job_slug="`basename "$1" | sed 's/\.sh$//' | sed 's/[^a-zA-Z0-9]/-/g'`-$2"
1729

1830
sbatch <<EOT
1931
#!/bin/bash
2032
#SBATCH -JMFC-$job_slug # Job name
2133
#SBATCH -A CFD154 # charge account
2234
#SBATCH -N 1 # Number of nodes required
23-
#SBATCH -n 8 # Number of cores required
35+
$sbatch_device_opts
2436
#SBATCH -t 01:59:00 # Duration of the job (Ex: 15 mins)
2537
#SBATCH -o$job_slug.out # Combined output and error messages file
38+
#SBATCH -p extended # Extended partition for shorter queues
2639
#SBATCH -q debug # Use debug QOS - only one job per user allowed in queue!
2740
#SBATCH -W # Do not exit until the submitted job terminates.
2841

.github/workflows/frontier/test.sh

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,8 @@
33
gpus=`rocm-smi --showid | awk '{print $1}' | grep -Eo '[0-9]+' | uniq | tr '\n' ' '`
44
ngpus=`echo "$gpus" | tr -d '[:space:]' | wc -c`
55

6-
./mfc.sh test --max-attempts 3 -j $ngpus -- -c frontier
7-
6+
if [ "$job_device" == "gpu" ]; then
7+
./mfc.sh test --max-attempts 3 -j $ngpus -- -c frontier
8+
else
9+
./mfc.sh test --max-attempts 3 -j 32 -- -c frontier
10+
fi

.github/workflows/test.yml

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -97,9 +97,6 @@ jobs:
9797
matrix:
9898
device: ['cpu', 'gpu']
9999
lbl: ['gt', 'frontier']
100-
exclude:
101-
- device: cpu
102-
lbl: frontier
103100
runs-on:
104101
group: phoenix
105102
labels: ${{ matrix.lbl }}
@@ -116,7 +113,7 @@ jobs:
116113

117114
- name: Build
118115
if: matrix.lbl == 'frontier'
119-
run: bash .github/workflows/frontier/build.sh
116+
run: bash .github/workflows/frontier/build.sh ${{ matrix.device }}
120117

121118
- name: Test
122119
if: matrix.lbl == 'frontier'

docs/documentation/running.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -98,13 +98,13 @@ Learn more about NVIDIA Nsight Compute [here](https://docs.nvidia.com/nsight-com
9898

9999

100100
#### AMD GPUs
101-
- Rocprof (ROC): `./mfc.sh run ... -t simulation --roc --hip-trace [rocprof flags]` allows one to visualize MFC's system-wide performance with [Perfetto UI](https://ui.perfetto.dev/).
101+
- Rocprof Systems (RSYS): `./mfc.sh run ... -t simulation --rsys --hip-trace [rocprof flags]` allows one to visualize MFC's system-wide performance with [Perfetto UI](https://ui.perfetto.dev/).
102102
When used, `--roc` will run the simulation and generate files in the case directory for all targets.
103103
`results.json` can then be imported in [Perfetto's UI](https://ui.perfetto.dev/).
104104
Learn more about AMD Rocprof [here](https://rocm.docs.amd.com/projects/rocprofiler/en/docs-5.5.1/rocprof.html)
105105
It is best to run case files with few timesteps to keep the report file sizes manageable.
106-
- Omniperf (OMNI): `./mfc.sh run ... -t simulation --omni [omniperf flags]` allows one to conduct kernel-level profiling with [AMD's Omniperf](https://rocm.docs.amd.com/projects/omniperf/en/latest/index.html).
107-
When used, `--omni` will output profiling information for all subroutines, including rooflines, cache usage, register usage, and more, after the simulation is run.
106+
- Rocprof Compute (RCU): `./mfc.sh run ... -t simulation --rcu -n <name> [rocprof-compute flags]` allows one to conduct kernel-level profiling with [ROCm Compute Profiler](https://rocm.docs.amd.com/projects/rocprofiler-compute/en/latest/what-is-rocprof-compute.html).
107+
When used, `--rcu` will output profiling information for all subroutines, including rooflines, cache usage, register usage, and more, after the simulation is run.
108108
Adding this argument will moderately slow down the simulation and run the MFC executable several times.
109109
For this reason, it should only be used with case files with few timesteps.
110110

toolchain/mfc/args.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -113,8 +113,8 @@ def add_common_arguments(p, mask = None):
113113
run.add_argument("--clean", action="store_true", default=False, help="Clean the case before running.")
114114
run.add_argument("--ncu", nargs=argparse.REMAINDER, type=str, help="Profile with NVIDIA Nsight Compute.")
115115
run.add_argument("--nsys", nargs=argparse.REMAINDER, type=str, help="Profile with NVIDIA Nsight Systems.")
116-
run.add_argument("--omni", nargs=argparse.REMAINDER, type=str, help="Profile with ROCM omniperf.")
117-
run.add_argument("--roc", nargs=argparse.REMAINDER, type=str, help="Profile with ROCM rocprof.")
116+
run.add_argument("--rcu", nargs=argparse.REMAINDER, type=str, help="Profile with ROCM rocprof-compute.")
117+
run.add_argument("--rsys", nargs=argparse.REMAINDER, type=str, help="Profile with ROCM rocprof-systems.")
118118

119119
# BENCH
120120
add_common_arguments(bench)

toolchain/mfc/run/run.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -45,17 +45,17 @@ def __profiler_prepend() -> typing.List[str]:
4545

4646
return ["nsys", "profile", "--stats=true", "--trace=mpi,nvtx,openacc"] + ARG("nsys")
4747

48-
if ARG("omni") is not None:
49-
if not does_command_exist("omniperf"):
50-
raise MFCException("Failed to locate [bold red]ROCM Omniperf[/bold red] (omniperf).")
48+
if ARG("rcu") is not None:
49+
if not does_command_exist("rocprof-compute"):
50+
raise MFCException("Failed to locate [bold red]ROCM rocprof-compute[/bold red] (rocprof-compute).")
5151

52-
return ["omniperf", "profile"] + ARG("omni") + ["--"]
52+
return ["rocprof-compute", "profile", "-n", ARG("name").replace('-', '_').replace('.', '_')] + ARG("rcu") + ["--"]
5353

54-
if ARG("roc") is not None:
54+
if ARG("rsys") is not None:
5555
if not does_command_exist("rocprof"):
56-
raise MFCException("Failed to locate [bold red]ROCM rocprof[/bold red] (rocprof).")
56+
raise MFCException("Failed to locate [bold red]ROCM rocprof-systems[/bold red] (rocprof-systems).")
5757

58-
return ["rocprof"] + ARG("roc")
58+
return ["rocprof"] + ARG("rsys")
5959

6060
return []
6161

toolchain/modules

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -48,9 +48,9 @@ p-gpu nvhpc/24.5 hpcx/2.19-cuda cuda/12.1.1
4848
p-gpu MFC_CUDA_CC=70,75,80,89,90 NVHPC_CUDA_HOME=$CUDA_HOME CC=nvc CXX=nvc++ FC=nvfortran
4949

5050
f OLCF Frontier
51-
f-all cce/18.0.0 cpe/24.07 rocm/6.1.3 cray-mpich/8.1.28
52-
f-all cray-fftw cray-hdf5 cray-python omniperf
53-
f-gpu craype-accel-amd-gfx90a
51+
f-all cpe/25.03 rocm/6.3.1
52+
f-all cray-fftw cray-hdf5 cray-python
53+
f-gpu craype-accel-amd-gfx90a rocprofiler-compute/3.0.0
5454

5555
d NCSA Delta
5656
d-all python/3.11.6

toolchain/pyproject.toml

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,19 @@ dependencies = [
3939

4040
# Chemistry
4141
"cantera==3.1.0",
42-
"pyrometheus == 1.0.3"
42+
"pyrometheus == 1.0.3",
43+
44+
# Frontier Profiling
45+
"astunparse==1.6.2",
46+
"colorlover",
47+
"dash>=1.12.0",
48+
"pymongo",
49+
"tabulate",
50+
"tqdm",
51+
"dash-svg",
52+
"dash-bootstrap-components",
53+
"kaleido",
54+
"plotille"
4355
]
4456

4557
[tool.hatch.metadata]

0 commit comments

Comments
 (0)