Skip to content

TACC Open Hackathon 2024

Philipp Grete edited this page Oct 14, 2024 · 15 revisions

Some notes for organizing our efforts

!!! Need at least three people for every day

Agenda (all times CST)

  • Tues Oct 8 10 AM – 11:30 AM online
    • Meet with mentor
  • Tues Oct 15 9 AM – 5 PM online
    • Cluster intro
    • Introductory team presentations
    • Work with mentor
  • Tues Oct 22 – Thurs Oct 24 9 AM – 5 PM hybrid
    • Work on code with mentor

Our Goals

Primary

Improve MPI scaling for Parthenon applications with many separately enrolled fields

  • Ideas

    • Use smaller fixed-space communication buffers that greedily fill and send repeatedly until all data is exchanged
    • Use contiguous buffers large enough to accommodate all fields (not respecting sparsity)
    • Others?
  • Example problem: [parthenon_vibe, advection, fine_advection]

    • Modify example to vary number of separately enrolled fields at runtime

Secondary

Improve buffer kernel performance for few (large) blocks

Sample input (using plain advection example)

<parthenon/job>
problem_id = advection

<parthenon/mesh>
refinement = none

nx1 = 256
x1min = -0.5
x1max = 0.5
ix1_bc = periodic
ox1_bc = periodic

nx2 = 256
x2min = -0.5
x2max = 0.5
ix2_bc = periodic
ox2_bc = periodic

nx3 = 256
x3min = -0.5
x3max = 0.5
ix3_bc = periodic
ox3_bc = periodic

<parthenon/meshblock>
nx1 = 128
nx2 = 128
nx3 = 128

<parthenon/time>
nlim = 25
tlim = 1.0
integrator = rk2
ncycle_out_mesh = -10000

<Advection>
cfl = 0.45
vx = 1.0
vy = 1.0
vz = 1.0
profile = hard_sphere

refine_tol = 0.3    # control the package specific refinement tagging function
derefine_tol = 0.03
compute_error = false
num_vars = 1 # number of variables
vec_size = 10 # size of each variable
fill_derived = false # whether to fill one-copy test vars

Current performance

Sample performance on a single GH200 (ran above with block sizes of 64, 128 and 256):

nb64.out:|-> 6.62e-02 sec 3.6% 100.0% 0.0% ------ 51 boundary_communication.cpp::96::SendBoundBufs [for]
nb128.out:|-> 1.44e-01 sec 11.0% 100.0% 0.0% ------ 51 boundary_communication.cpp::96::SendBoundBufs [for]
nb256.out:|-> 5.45e-01 sec 25.9% 100.0% 0.0% ------ 51 boundary_communication.cpp::96::SendBoundBufs [for]
nb64.out:|-> 8.81e-02 sec 4.8% 100.0% 0.0% ------ 51 boundary_communication.cpp::274::SetBounds [for]
nb128.out:|-> 1.69e-01 sec 12.9% 100.0% 0.0% ------ 51 boundary_communication.cpp::274::SetBounds [for]
nb256.out:|-> 6.44e-01 sec 30.6% 100.0% 0.0% ------ 51 boundary_communication.cpp::274::SetBounds [for]

Diagnose (and improve?) particle efficiency at scale

  • Example problem: particles-example

Multigrid performance

  • Example problem:

NCCL/RCCL evaluation

  • This would be a heavy lift to fully implement

  • Example problem:

CUDA asynchronous memory copies

  • Example problem:

Team

Ben Ryan

  • Secondary goal interests
    • Particle scaling

Luke

  • Secondary goal interests
    • Multigrid parallel performance

Philipp

  • Secondary goal interests
    • Improve buffer kernel performance for few (large) blocks

Patrick

  • Secondary goal interests

Alex

  • Secondary goal interests

Nirmal

  • Secondary goal interests

Ben Prather

  • Secondary goal interests
    • Single-meshblock bottlenecks
    • Interface for downstreams to add CUDA async copies?

Jonah

  • Secondary goal interests
Clone this wiki locally