Skip to content

AthenaPK scaling instructions

Joshua S Brown edited this page Feb 5, 2021 · 6 revisions

Prerequisites

  • Assumes a Power9 node with 4x V100
  • Recommended environment: SpectrumMPI and GCC host compiler
# get source
git clone https://gitlab.com/theias/hpc/jmstone/athena-parthenon/athenapk.git athenaPK
cd athenaPK

# change to branch for scaling test
git checkout pgrete/pack-in-one

# get submodules (mainly Kokkos and Parthenon)
git submodule init
git submodule update

# Configure and build. Reusing Summit machine file (same architecture)
mkdir build-cuda-mpi && cd build-cuda-mpi
cmake -DMACHINE_CFG=$(pwd)/../external/parthenon/cmake/machinecfg/Summit.cmake ..
make -j8 athenaPK

Building on RZAnasel

# get source
git clone https://gitlab.com/theias/hpc/jmstone/athena-parthenon/athenapk.git athenaPK
cd athenaPK

# change to branch for scaling test
git checkout pgrete/pack-in-one

# get submodules (mainly Kokkos and Parthenon)
git submodule init
git submodule update

cmake -S. -B build -DCMAKE_TOOLCHAIN_FILE=$(pwd)/external/parthenon/cmake/machinecfg/RZAnsel.cmake
cmake --build build
```

## Scaling instructions

### Static, uniform mesh scalig

- For static meshes we'll use a workload of 256^3 cells per GPU
- Adjust launch command as needed (e.g., use `-M "-gpu"` parameter of `jsrun` instead of `MY_SPECTRUM_OPTIONS` environment variable)
```bash
# enable Cuda aware MPI
export MY_SPECTRUM_OPTIONS="--gpu"
# make Kokkos pick GPUs round robin
export KOKKOS_NUM_DEVICES=4

cd build-cuda-mpi

# mesh dimensions
export MB=256
export MX=256
export MY=256
export MZ=256


ibrun -n 1 ./src/athenaPK -i ../inputs/advection_3d.in parthenon/meshblock/nx1=$MB parthenon/meshblock/nx2=$MB parthenon/meshblock/nx3=$MB parthenon/time/nlim=10 parthenon/mesh/nx1=$MX parthenon/mesh/nx2=$MY parthenon/mesh/nx3=$MZ parthenon/mesh/refinement=none
# should be about 2.2e8 zone-cycles/wsec_step

export MX=512
ibrun -n 2 ./src/athenaPK -i ../inputs/advection_3d.in parthenon/meshblock/nx1=$MB parthenon/meshblock/nx2=$MB parthenon/meshblock/nx3=$MB parthenon/time/nlim=10 parthenon/mesh/nx1=$MX parthenon/mesh/nx2=$MY parthenon/mesh/nx3=$MZ parthenon/mesh/refinement=none
# should be about 4.4e8 zone-cycles/wsec_step

export MY=512
ibrun -n 4 ./src/athenaPK -i ../inputs/advection_3d.in parthenon/meshblock/nx1=$MB parthenon/meshblock/nx2=$MB parthenon/meshblock/nx3=$MB parthenon/time/nlim=10 parthenon/mesh/nx1=$MX parthenon/mesh/nx2=$MY parthenon/mesh/nx3=$MZ parthenon/mesh/refinement=none
# should be about 8.6e8 zone-cycles/wsec_step

# Test with overdecomposition
export MB=128
ibrun -n 4 ./src/athenaPK -i ../inputs/advection_3d.in parthenon/meshblock/nx1=$MB parthenon/meshblock/nx2=$MB parthenon/meshblock/nx3=$MB parthenon/time/nlim=10 parthenon/mesh/nx1=$MX parthenon/mesh/nx2=$MY parthenon/mesh/nx3=$MZ parthenon/mesh/refinement=none
# should be about 9.5e8 zone-cycles/wsec_step

# And much more overdecomposition
export MB=32
ibrun -n 4 ./src/athenaPK -i ../inputs/advection_3d.in parthenon/meshblock/nx1=$MB parthenon/meshblock/nx2=$MB parthenon/meshblock/nx3=$MB parthenon/time/nlim=10 parthenon/mesh/nx1=$MX parthenon/mesh/nx2=$MY parthenon/mesh/nx3=$MZ parthenon/mesh/refinement=none
# should be about 2.2e8 zone-cycles/wsec_step

# And now with process<->GPU overdecomposition (requires MPS): Using 32 on a single host for 4 GPUs
ibrun -n 32 ./src/athenaPK -i ../inputs/advection_3d.in parthenon/meshblock/nx1=$MB parthenon/meshblock/nx2=$MB parthenon/meshblock/nx3=$MB parthenon/time/nlim=10 parthenon/mesh/nx1=$MX parthenon/mesh/nx2=$MY parthenon/mesh/nx3=$MZ parthenon/mesh/refinement=none
# should be about 3.2e8 zone-cycles/wsec_step
```

### To be continued...
Clone this wiki locally