Skip to content

WIP: Prototype for "poor man's" load balancing #5405

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 17 commits into
base: dev
Choose a base branch
from

Conversation

franzpoeschel
Copy link
Contributor

@franzpoeschel franzpoeschel commented Jul 1, 2025

PIConGPU is at its core not a load-balanced code and there is good reason not to change that.
However, there are use cases which require at least some basic form of load balancing; and a chance for implementing a simple (and maybe even somewhat simplistic) load balancing scheme is found as a somewhat low-hanging fruit from the existing checkpoint-restart mechanism.

Schematic idea for load balancing:

  1. Halt the simulation
  2. Virtually write a checkpoint (see below)
  3. Restart from that same checkpoint, but repartition the domain in the process, i.e. load data in a more balanced decomposition

Virtual checkpoint: The openPMD plugin can read/write data to/from disk, but also from a stream. Disk access can hence be avoided by in-situ restarting and streaming the data to "yourself", and to apply a different domain decomposition in the process.

Note that (1) in-situ restarting and (2) load-balanced restarting are orthogonal features, they just both play into this use case.

There is currently no use in reviewing code structure. This is experimental. I'm opening the PR to have a place for discussing the idea.

Plan:

  • In-situ restarting from a virtual checkpoint
    • I'm currently abusing the --checkpoint.period option for this, there is no CLI control atm
  • Writing checkpoints with higher granularity: One patch per supercell, not per MPI process
  • Restarting from checkpoints with "dynamic" particle patches, i.e. multiple patches that fall into the local subdomain. This by now already allows relaunching PIConGPU with a different number of MPI processes.
    • Patches that fall only partially into the local subdomain are currently ignored. This is fine for the first prototype, as we don't need to go below the supercell level for load balancing, at least for now.
  • Actually load-balance: This involves:
    1. Figuring out a new domain decomposition (no canonical answer for this)
    2. Somehow changing the simulations's domain decomposition from within the IO plugin (I will probably need help with that)

Minor side-Todos:

  • This works with the SST engine. SSC might be more adequate, but I could not make it work at first try, so I gave up for now.
  • Maybe check if we can control the startup of the SST engine more finely. This currently needs to use fully threaded MPI; if we could control MPI operations a bit more finely at that moment, we might be able to relax that.
  • Take stuff like id_provider, rng states out of restarting

These above points will (hopefully) result in a very basic, yet still unoptimized and also somewhat incomplete load-balancing implementation. These below points are an outline for work after this PR:

  • Particle patches: These will be the main scaling issue of the first version of load balancing. Two reasons for this:
    • Communication patterns: Particle patches are written by everyone, and every reader reads everything. This is not a problem when writing checkpointing files since files are a form of aggregation. In streaming however, this implies an N->N communication pattern.
      1. Simple solution: Do not stream the particle patches. They are small and can go to a file on the parallel filesystem.
      2. Overengineered solution: Aggregate and deaggregate particle patches. I see no use in this for now.
      3. Solution without using the parallel file system: Use ADIOS2 Local Values (see this PR). Those participate in metadata aggregation, but are restricted to a single value per process. That needs not be a problem, see below.
    • Number of particle patches: Since we now write patches at a much finer granularity, there are many more of them, by orders of magnitude. The solution here would be to write particle patches in two levels:
      1. Lower level: One patch per supercell
      2. Higher level: One patch per process. The patch does not point into the particle regions, but into the lower level patches regions. Only the higher level needs to be communicated in an N->N pattern, but it is small and can be written as an ADIOS2 LocalValue dataset, so it is not a problem.
  • Sizes of IO operations: Currently, writing and reading a checkpoint takes a long time since each supercell has a separate IO operation. We can expect this to overload the cluster networks at some point. IO operations should be merged as much as possible, in reading and writing -- while staying logically distinct through particle patches.
  • Memory consumption: Is manageable, but large. When using the SST engine in ADIOS2, the major memory consumption will be the host-side streaming buffers which need to be filled with the entire simulation volume and particles, i.e. roughly a clone of the GPU's simulation volume to CPU. Writing to and reading from openPMD can by now be controlled very finely due to René's work on slicing.
    We should investigate if we can make the BP5 serializer of SST use the node SSDs as a streaming buffer instead of the RAM. This would be a project in ADIOS2, but it would benefit other streaming use cases, too. No idea though if memmaps and RDMA play nice together.
    Workarounds until then:
    1. Don't completely fill up the GPU memory.
    2. If you want to completely fill up the GPU memory, use an actual disk checkpoint instead.
  • PML fields: Have weird geometries, ignored for now.
  • Other plugins that participate in checkpoint-restart: Not considered at all for now.

Note to myself:

FABRIC_PROVIDER=shm FABRIC_IFACE=shm PIC_USE_THREADED_MPI=MPI_THREAD_MULTIPLE picongpu -s 100 -g 32 32 32 -d 1 1 1 --checkpoint.period 50:50 --checkpoint.openPMD.ext sst --checkpoint.openPMD.json '{"adios2":{"engine":{"parameters":{"DataTransport": "fabric"}}}}'

UCX_TLS=shm UCX_POSIX_USE_PROC_LINK=n PIC_USE_THREADED_MPI=MPI_THREAD_MULTIPLE picongpu -s 100 -g 32 32 32 -d 1 1 1 --checkpoint.period 50:50 --checkpoint.openPMD.ext sst --checkpoint.openPMD.json '{"adios2":{"engine":{"parameters":{"DataTransport": "ucx"}}}}'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant