Particle creation takes a lot time on Nvidia GPU node #4506

lwJi · 2025-06-13T19:10:27Z

lwJi
Jun 13, 2025

Particle creation takes much longer on Nvidia GPU node (Grace Hopper) compare to a CPU node (intel Cascade Lake). Here is the segment of code we tested. Is there a way to make it also efficient on GPU node

  // Create particle containers
  using Container = amrex::AmrParticleContainer<3, 2>;
  using ParticleTile = Container::ParticleTileType;
  std::vector<Container> containers(ghext->num_patches());
  for (int patch = 0; patch < ghext->num_patches(); ++patch) {
    const auto &restrict patchdata = ghext->patchdata.at(patch);
    containers.at(patch) = Container(patchdata.amrcore.get());
    const int level = 0;
    const auto &restrict leveldata = patchdata.leveldata.at(level);
    const amrex::MFIter mfi(*leveldata.fab);
    assert(mfi.isValid());
    ParticleTile *const particle_tile = &containers.at(patch).GetParticles(
        level)[make_pair(mfi.index(), mfi.LocalTileIndex())];

    // Set particle positions
    const int proc = amrex::ParallelDescriptor::MyProc();
    for (int n = 0; n < npoints; ++n) {
      // TODO: Loop over points only once
      if (patches.at(n) == patch) {
        amrex::Particle<3, 2> p;
        p.id() = Container::ParticleType::NextID();
        p.cpu() = proc;
        p.pos(0) = posx[n]; // AMReX distribution position
        p.pos(1) = posy[n];
        p.pos(2) = posz[n];
        p.rdata(0) = localsx[n]; // actual particle coordinate
        p.rdata(1) = localsy[n];
        p.rdata(2) = localsz[n];
        p.idata(0) = proc; // source process
        p.idata(1) = n;    // source index
        particle_tile->push_back(p);
      }
    }
  }

Answered by atmyers

Jun 13, 2025

Hi lwJi,

It looks like your code is relying on managed memory and calling push_back to add each particle to the ParticleContainer one-by-one. That will result in a lot of memory traffic back and forth between the host and the device, which I think is why your code is slow.

The fastest way is to generate the particles on the GPU instead of the CPU. To do that, you could follow the code here. That routine 1) launches a kernel to count the number of particles that will be added in each cell, 2) calls Gpu::exclusive_scan on the resulting counts to get a set of offsets, then 3) launches a second kernel where threads fill in the data using those offsets.

If you still want to generate the partic…

View full answer

atmyers · 2025-06-13T21:08:27Z

atmyers
Jun 13, 2025
Maintainer

Hi lwJi,

It looks like your code is relying on managed memory and calling push_back to add each particle to the ParticleContainer one-by-one. That will result in a lot of memory traffic back and forth between the host and the device, which I think is why your code is slow.

The fastest way is to generate the particles on the GPU instead of the CPU. To do that, you could follow the code here. That routine 1) launches a kernel to count the number of particles that will be added in each cell, 2) calls Gpu::exclusive_scan on the resulting counts to get a set of offsets, then 3) launches a second kernel where threads fill in the data using those offsets.

If you still want to generate the particles on the host, you could also push them back onto a pinned ParticleTile, then copy them all at once to the GPU. Here is an example. That might make the most sense for you, as it looks like you already have a vector of positions and other particle data on the host.

8 replies

lwJi Jun 16, 2025
Author

Thanks @atmyers . BTW, where can I find the definition of ContainerLike in this line?

WeiqunZhang Jun 16, 2025
Maintainer

git grep is our friend.

$ git grep -n "ContainerLike"
Src/Particle/AMReX_ParticleContainer.H:1349:    using ContainerLike = amrex::ParticleContainer_impl<ParticleType, NArrayReal, NArrayInt, NewAllocator>;
Src/Particle/AMReX_ParticleContainer.H:1362:    ContainerLike<NewAllocator>
Src/Particle/AMReX_ParticleContainer.H:1365:        ContainerLike<NewAllocator> tmp(m_gdb);

lwJi Jun 17, 2025
Author

Thanks.
Does the following pinned ParticleTile looks good to you?

  // Create particle containers
  using Container = amrex::AmrParticleContainer<3, 2>;
  using ParticleTile = Container::ParticleTileType;
  std::vector<Container> containers(ghext->num_patches());
  for (int patch = 0; patch < ghext->num_patches(); ++patch) {
    const auto &restrict patchdata = ghext->patchdata.at(patch);
    containers.at(patch) = Container(patchdata.amrcore.get());
    const int level = 0;
    const auto &restrict leveldata = patchdata.leveldata.at(level);
    const amrex::MFIter mfi(*leveldata.fab);
    assert(mfi.isValid());
    ParticleTile &particle_tile = containers.at(patch).GetParticles(
        level)[make_pair(mfi.index(), mfi.LocalTileIndex())];

    using PinnedTile = typename amrex::ParticleContainer_impl<
        Container::ParticleType, 0, 0,
        amrex::PinnedArenaAllocator>::ParticleTileType;
    PinnedTile pinned_tile;
    pinned_tile.define(particle_tile.NumRuntimeRealComps(),
                       particle_tile.NumRuntimeIntComps());

    // Set particle positions
    const int proc = amrex::ParallelDescriptor::MyProc();
    for (int n = 0; n < npoints; ++n) {
      // TODO: Loop over points only once
      if (patches.at(n) == patch) {
        amrex::Particle<3, 2> p;
        p.id() = Container::ParticleType::NextID();
        p.cpu() = proc;
        p.pos(0) = posx[n]; // AMReX distribution position
        p.pos(1) = posy[n];
        p.pos(2) = posz[n];
        p.rdata(0) = localsx[n]; // actual particle coordinate
        p.rdata(1) = localsy[n];
        p.rdata(2) = localsz[n];
        p.idata(0) = proc; // source process
        p.idata(1) = n;    // source index
        pinned_tile.push_back(p);
      }
    }

    auto old_np = particle_tile.numParticles();
    auto new_np = old_np + pinned_tile.numParticles();
    particle_tile.resize(new_np);
    amrex::copyParticles(particle_tile, pinned_tile, 0, old_np,
                         pinned_tile.numParticles());
  }

atmyers Jun 17, 2025
Maintainer

Hi IwJi,

I haven't tried to compile or run your code, but the basic approach looks right to me. I don't see any mistakes at first glance.

lwJi Jun 17, 2025
Author

Thanks @atmyers. Using pinned ParticleTile does solve the problem. I will also try the first approach later.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Particle creation takes a lot time on Nvidia GPU node #4506

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 8 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Particle creation takes a lot time on Nvidia GPU node #4506

Uh oh!

Uh oh!

lwJi Jun 13, 2025

Replies: 1 comment · 8 replies

Uh oh!

atmyers Jun 13, 2025 Maintainer

Uh oh!

lwJi Jun 16, 2025 Author

Uh oh!

WeiqunZhang Jun 16, 2025 Maintainer

Uh oh!

lwJi Jun 17, 2025 Author

Uh oh!

atmyers Jun 17, 2025 Maintainer

Uh oh!

lwJi Jun 17, 2025 Author

lwJi
Jun 13, 2025

Replies: 1 comment 8 replies

atmyers
Jun 13, 2025
Maintainer

lwJi Jun 16, 2025
Author

WeiqunZhang Jun 16, 2025
Maintainer

lwJi Jun 17, 2025
Author

atmyers Jun 17, 2025
Maintainer

lwJi Jun 17, 2025
Author