WIP: Prototype for "poor man's" load balancing #5405

franzpoeschel · 2025-07-01T13:02:35Z

PIConGPU is at its core not a load-balanced code and there is good reason not to change that.
However, there are use cases which require at least some basic form of load balancing; and a chance for implementing a simple (and maybe even somewhat simplistic) load balancing scheme is found as a somewhat low-hanging fruit from the existing checkpoint-restart mechanism.

Schematic idea for load balancing:

Halt the simulation
Virtually write a checkpoint (see below)
Restart from that same checkpoint, but repartition the domain in the process, i.e. load data in a more balanced decomposition

Virtual checkpoint: The openPMD plugin can read/write data to/from disk, but also from a stream. Disk access can hence be avoided by in-situ restarting and streaming the data to "yourself", and to apply a different domain decomposition in the process.

Note that (1) in-situ restarting and (2) load-balanced restarting are orthogonal features, they just both play into this use case.

There is currently no use in reviewing code structure. This is experimental. I'm opening the PR to have a place for discussing the idea.

Plan:

In-situ restarting from a virtual checkpoint
- I'm currently abusing the --checkpoint.period option for this, there is no CLI control atm
Writing checkpoints with higher granularity: One patch per supercell, not per MPI process
Restarting from checkpoints with "dynamic" particle patches, i.e. multiple patches that fall into the local subdomain. This by now already allows relaunching PIConGPU with a different number of MPI processes.
- Patches that fall only partially into the local subdomain are currently ignored. This is fine for the first prototype, as we don't need to go below the supercell level for load balancing, at least for now.
Actually load-balance: This involves:
1. Figuring out a new domain decomposition (no canonical answer for this)
2. Somehow changing the simulations's domain decomposition from within the IO plugin (I will probably need help with that)

Minor side-Todos:

This works with the SST engine. SSC might be more adequate, but I could not make it work at first try, so I gave up for now.
Maybe check if we can control the startup of the SST engine more finely. This currently needs to use fully threaded MPI; if we could control MPI operations a bit more finely at that moment, we might be able to relax that.
Take stuff like id_provider, rng states out of restarting

These above points will (hopefully) result in a very basic, yet still unoptimized and also somewhat incomplete load-balancing implementation. These below points are an outline for work after this PR:

Particle patches: These will be the main scaling issue of the first version of load balancing. Two reasons for this:
- Communication patterns: Particle patches are written by everyone, and every reader reads everything. This is not a problem when writing checkpointing files since files are a form of aggregation. In streaming however, this implies an N->N communication pattern.
  1. Simple solution: Do not stream the particle patches. They are small and can go to a file on the parallel filesystem.
  2. Overengineered solution: Aggregate and deaggregate particle patches. I see no use in this for now.
  3. Solution without using the parallel file system: Use ADIOS2 Local Values (see this PR). Those participate in metadata aggregation, but are restricted to a single value per process. That needs not be a problem, see below.
- Number of particle patches: Since we now write patches at a much finer granularity, there are many more of them, by orders of magnitude. The solution here would be to write particle patches in two levels:
  1. Lower level: One patch per supercell
  2. Higher level: One patch per process. The patch does not point into the particle regions, but into the lower level patches regions. Only the higher level needs to be communicated in an N->N pattern, but it is small and can be written as an ADIOS2 LocalValue dataset, so it is not a problem.
Sizes of IO operations: Currently, writing and reading a checkpoint takes a long time since each supercell has a separate IO operation. We can expect this to overload the cluster networks at some point. IO operations should be merged as much as possible, in reading and writing -- while staying logically distinct through particle patches.
Memory consumption: Is manageable, but large. When using the SST engine in ADIOS2, the major memory consumption will be the host-side streaming buffers which need to be filled with the entire simulation volume and particles, i.e. roughly a clone of the GPU's simulation volume to CPU. Writing to and reading from openPMD can by now be controlled very finely due to René's work on slicing.
We should investigate if we can make the BP5 serializer of SST use the node SSDs as a streaming buffer instead of the RAM. This would be a project in ADIOS2, but it would benefit other streaming use cases, too. No idea though if memmaps and RDMA play nice together.
Workarounds until then:
1. Don't completely fill up the GPU memory.
2. If you want to completely fill up the GPU memory, use an actual disk checkpoint instead.
PML fields: Have weird geometries, ignored for now.
Other plugins that participate in checkpoint-restart: Not considered at all for now.

Note to myself:

FABRIC_PROVIDER=shm FABRIC_IFACE=shm PIC_USE_THREADED_MPI=MPI_THREAD_MULTIPLE picongpu -s 100 -g 32 32 32 -d 1 1 1 --checkpoint.period 50:50 --checkpoint.openPMD.ext sst --checkpoint.openPMD.json '{"adios2":{"engine":{"parameters":{"DataTransport": "fabric"}}}}'

UCX_TLS=shm UCX_POSIX_USE_PROC_LINK=n PIC_USE_THREADED_MPI=MPI_THREAD_MULTIPLE picongpu -s 100 -g 32 32 32 -d 1 1 1 --checkpoint.period 50:50 --checkpoint.openPMD.ext sst --checkpoint.openPMD.json '{"adios2":{"engine":{"parameters":{"DataTransport": "ucx"}}}}'

without chunks

franzpoeschel added 17 commits June 20, 2025 14:15

Split Series into write and read

909567b

Run checkpoint concurrently with restart

35e23be

In-Situ restart working for a single step

91a0037

Reopen stream: Now load balancing works multiple times

6d7cd97

Some error safety in thread synchronization

0cda7e4

Check for MPI_THREAD_MULTIPLE

8e72bc1

Couple of fixes

de54a0a

Extract particle patches writing

e8fce01

Extract writeParticlesByChunks

7f08290

Reimplement simplified particles writing

1b9d890

without chunks

Dump every single supercell

d8f814b

Basic version of particle patches

b203d7f

Little fix for parallel flushing

d31404f

Fix some indexing details

add37a9

Select particle patches more finely when restarting

a5ec903

Restart from multiple patches

4fd42af

Fix for empty datasets

6a8ac9d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Prototype for "poor man's" load balancing #5405

WIP: Prototype for "poor man's" load balancing #5405

Uh oh!

franzpoeschel commented Jul 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

WIP: Prototype for "poor man's" load balancing #5405

Are you sure you want to change the base?

WIP: Prototype for "poor man's" load balancing #5405

Uh oh!

Conversation

franzpoeschel commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

franzpoeschel commented Jul 1, 2025 •

edited

Loading