-
Notifications
You must be signed in to change notification settings - Fork 37
AthenaPK scaling instructions
Joshua S Brown edited this page Feb 5, 2021
·
6 revisions
- Assumes a Power9 node with 4x V100
- Recommended environment: SpectrumMPI and GCC host compiler
# get source
git clone https://gitlab.com/theias/hpc/jmstone/athena-parthenon/athenapk.git athenaPK
cd athenaPK
# change to branch for scaling test
git checkout pgrete/pack-in-one
# get submodules (mainly Kokkos and Parthenon)
git submodule init
git submodule update
# Configure and build. Reusing Summit machine file (same architecture)
mkdir build-cuda-mpi && cd build-cuda-mpi
cmake -DMACHINE_CFG=$(pwd)/../external/parthenon/cmake/machinecfg/Summit.cmake ..
make -j8 athenaPK
# get source
git clone https://gitlab.com/theias/hpc/jmstone/athena-parthenon/athenapk.git athenaPK
cd athenaPK
# change to branch for scaling test
git checkout pgrete/pack-in-one
# get submodules (mainly Kokkos and Parthenon)
git submodule init
git submodule update
cmake -S. -B build -DCMAKE_TOOLCHAIN_FILE=$(pwd)/external/parthenon/cmake/machinecfg/RZAnsel.cmake
cmake --build build
```
## Scaling instructions
### Static, uniform mesh scalig
- For static meshes we'll use a workload of 256^3 cells per GPU
- Adjust launch command as needed (e.g., use `-M "-gpu"` parameter of `jsrun` instead of `MY_SPECTRUM_OPTIONS` environment variable)
```bash
# enable Cuda aware MPI
export MY_SPECTRUM_OPTIONS="--gpu"
# make Kokkos pick GPUs round robin
export KOKKOS_NUM_DEVICES=4
cd build-cuda-mpi
# mesh dimensions
export MB=256
export MX=256
export MY=256
export MZ=256
ibrun -n 1 ./src/athenaPK -i ../inputs/advection_3d.in parthenon/meshblock/nx1=$MB parthenon/meshblock/nx2=$MB parthenon/meshblock/nx3=$MB parthenon/time/nlim=10 parthenon/mesh/nx1=$MX parthenon/mesh/nx2=$MY parthenon/mesh/nx3=$MZ parthenon/mesh/refinement=none
# should be about 2.2e8 zone-cycles/wsec_step
export MX=512
ibrun -n 2 ./src/athenaPK -i ../inputs/advection_3d.in parthenon/meshblock/nx1=$MB parthenon/meshblock/nx2=$MB parthenon/meshblock/nx3=$MB parthenon/time/nlim=10 parthenon/mesh/nx1=$MX parthenon/mesh/nx2=$MY parthenon/mesh/nx3=$MZ parthenon/mesh/refinement=none
# should be about 4.4e8 zone-cycles/wsec_step
export MY=512
ibrun -n 4 ./src/athenaPK -i ../inputs/advection_3d.in parthenon/meshblock/nx1=$MB parthenon/meshblock/nx2=$MB parthenon/meshblock/nx3=$MB parthenon/time/nlim=10 parthenon/mesh/nx1=$MX parthenon/mesh/nx2=$MY parthenon/mesh/nx3=$MZ parthenon/mesh/refinement=none
# should be about 8.6e8 zone-cycles/wsec_step
# Test with overdecomposition
export MB=128
ibrun -n 4 ./src/athenaPK -i ../inputs/advection_3d.in parthenon/meshblock/nx1=$MB parthenon/meshblock/nx2=$MB parthenon/meshblock/nx3=$MB parthenon/time/nlim=10 parthenon/mesh/nx1=$MX parthenon/mesh/nx2=$MY parthenon/mesh/nx3=$MZ parthenon/mesh/refinement=none
# should be about 9.5e8 zone-cycles/wsec_step
# And much more overdecomposition
export MB=32
ibrun -n 4 ./src/athenaPK -i ../inputs/advection_3d.in parthenon/meshblock/nx1=$MB parthenon/meshblock/nx2=$MB parthenon/meshblock/nx3=$MB parthenon/time/nlim=10 parthenon/mesh/nx1=$MX parthenon/mesh/nx2=$MY parthenon/mesh/nx3=$MZ parthenon/mesh/refinement=none
# should be about 2.2e8 zone-cycles/wsec_step
# And now with process<->GPU overdecomposition (requires MPS): Using 32 on a single host for 4 GPUs
ibrun -n 32 ./src/athenaPK -i ../inputs/advection_3d.in parthenon/meshblock/nx1=$MB parthenon/meshblock/nx2=$MB parthenon/meshblock/nx3=$MB parthenon/time/nlim=10 parthenon/mesh/nx1=$MX parthenon/mesh/nx2=$MY parthenon/mesh/nx3=$MZ parthenon/mesh/refinement=none
# should be about 3.2e8 zone-cycles/wsec_step
```
### To be continued...