PeleLMeX vs PeleC GPU Memory Usage #514

kelseasouders · 2025-05-16T21:49:20Z

kelseasouders
May 16, 2025

Hello,

I have recently been testing whether PeleLMeX would offer performance improvements over PeleC for some of my simulations. I was successfully able to restart from a plot file and get LMeX running, but I am noticing that the memory usage of PeleLMeX appears to be ~twice that of PeleC in my tests.

I have configured the input files between PeleC and PeleLMeX such that the cases should be as identical as possible. I have compiled both using the same compilation flags and libraries, and have set TINY_PROFILE = TRUE and MEM_PROFILE = TRUE for both. The actual case is a bluff body stabilized methane-air flame, using EB for the geometry and drm19 for the chemistry, is run with a base grid of 432 x 144 x 72 and has one level of AMR (refining based on temperature, vorticity magnitude, and Y_CO). I have been testing the codes using 8x Nvidia A100-40GB cards, compiled using CUDA 12.6.

With AMR, both codes have approximately 26M cells (~4.5M level 0, ~21M level 1), and share the same AMR settings here:

amr.ref_ratio       = 2 2 2 2 2          # refinement ratio
amr.n_error_buf     = 6 6 6 6         # number of buffer cells in error est
amr.grid_eff        = 0.95                 # what constitutes an efficient grid
amr.blocking_factor = 8                # block factor in grid generation (min box size)
amr.max_grid_size   = 128            # max box size

I did have to disable regridding in PeleLMeX due to running out of memory. I have also run other tests varying a few parameters that I have noticed come up in my github searches, namely

amr.max_grid_size   = 128
amrex.the_arena_init_size = 38000000000 # 38GB
amrex.use_gpu_aware_mpi = 1
amrex.the_arena_is_managed = 1
amrex.abort_on_out_of_gpu_memory = 1
# amr.loadbalance_with_workestimates = 1
tiny_profiler.device_synchronize_around_region=1

but changing these did not appear to make much of a difference (besides amrex.the_arena_is_managed=0, which led to an immediate OOM crash in LMeX).

From my tests, I have found that PeleLMeX is using at around twice the memory as PeleC. While the PeleC runs use around 160GB of GPU memory, PeleLMeX uses the full 280GB that I have allocated (which I confirmed with nvitop), and it still doesn't appear to be enough. I did another test of LMeX with amr.max_level=0, again restarting from a plot file, and even then the utilization was hovering close to what PeleC uses with one level of AMR.

I am wondering if this is expected behavior caused by the difference in numerical schemes between the codes, or if there appears to be something wrong with how I have built and run PeleLMeX. I have attached the run outputs (including tinyprofiler outputs at the bottom) from the three tests - one for PeleC, one for LMeX, and another for LMeX with AMR disabled.

PeleC.log
PeleLMeX_noAMR.log
PeleLMeX_withAMR.log

Thank you for your help!

baperry2 · 2025-05-16T22:31:59Z

baperry2
May 16, 2025
Maintainer

I haven't previously done a direct comparison, but PeleLMeX is definitely a bit of a memory hog and I'm not surprised that it would require significantly more memory for the same size grid as PeleC based on algorithmic and code structure differences. Although it seems like you're seeing ~5X memory usage for LMeX vs C, which is quite substantial. PeleLMeX stores a lot of intermediates (transport coefficients, various transport terms, etc) as vectors of multifabs containing the full domain data across all levels (see how many vectors of multifabs with NUM_SPECIES components are defined in PeleLMeX_Data.cpp). In PeleC, a lot of this is computed at the FAB level and discarded rather than being stored. You may just have to spread your simulation over a larger number of GPUs with PeleLMeX.

Take a look at the memlog output when compiling with MEM_PROFILE=TRUE. As long as the memory used isn't increasing over time for a fixed size grid, I wouldn't be too worried. Note that we did just fix a memory leak in #512, but that was specific to cases with Soret diffusion and isothermal walls so i don't think it would have impacted you.

4 replies

kelseasouders May 19, 2025
Author

Thanks, Bruce! Looking at the memlog, I don't think I have a memory leak. I will try restarting on more GPUs to see how things look.

kelseasouders May 20, 2025
Author

@baperry2
I was able to get this simulation running on stronger GPUs with more memory. From watching the nvitop output during the run, I saw that the device memory usage was hovering around 400GB for the case (28 state vars, ~26M total cells), so ~2.5x what PeleC requires.

I did notice during this test that the GPU utilization drops significantly during the chemistry solve using the magma_direct solver. I was running on 8x Nvidia H200 cards and while the utilization was hovering around 20% per card during each time step, it would drop to 1-2% during the chemistry solve. I am curious whether this is the expected behavior since I would assume that the GPU utilization would be higher when using MAGMA. For a given time step, the chemistry solve took ~60% of the computation time, which seems roughly consistent with the cases shown on the PeleLMeX performance documentation.

 ====================   NEW TIME STEP   ==================== 
 Est. time step - Conv: 8.037074817e-07, divu: 1.268133309e-05
 STEP [9] - Time: 7.305081138e-06, dt 8.037074817e-07
     [Adv. start] GPU mem. avail. (MB) 66730
   SDC iter [1] 
      Before SDC 1: max relative P mismatch is 0.02038854421
   - oneSDC()::MACProjection()   --> Time: 1.93132366
     [MAC-Proj] GPU mem. avail. (MB) 66730
   - oneSDC()::ScalarAdvection() --> Time: 0.351941121
     [ScalAdv] GPU mem. avail. (MB) 66730
   - oneSDC()::ScalarDiffusion() --> Time: 3.14381663
     [ScalDiff] GPU mem. avail. (MB) 66730
   - oneSDC()::ScalarReaction()  --> Time: 14.42300879
     [ScalReact] GPU mem. avail. (MB) 66730
   - Advance()::VelocityAdvance  --> Time: 1.888390238
     [Nodal-Proj] GPU mem. avail. (MB) 66730
 >> PeleLMeX::Advance() --> Time: 23.09383827

This was with amr.blocking_factor = 8 and amr.max_grid_size = 128.

Here are the top contributors from the TinyProfiler results for a test run over 10 time steps. I am also noticing that the FabArray::ParallelCopy_finish() is the top contributor, which I have seen someone suggest could be indicative of load balancing issues, but I can't find the thread. I just want to make sure that this doesn't seem abnormal before I run more tests and work on transferring my simulations from PeleC.

-----------------------------------------------------------------------------------------------
Name                                            NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %
-----------------------------------------------------------------------------------------------
FabArray::ParallelCopy_finish()                  14321      43.57      63.51      100.4  37.97%
Pele::ReactorCvode::react()                        703      41.39         76      94.47  35.72%
Pele::ReactorCvode::react():CVode                  703      34.08      39.39      45.05  17.03%
PhysBCFunct::()                                   1026      7.761      14.91      23.41   8.85%
FillBoundary_finish()                            69076      15.99         18      19.33   7.31%
VisMF::Write(FabArray)                              12      8.113      9.959      10.72   4.05%
MLEBABecLap::applyBC()                           56880      5.404        7.5      10.11   3.82%
FillBoundary_nowait()                            69076      3.023      3.559      4.129   1.56%
MLEBABecLap::Fsmooth()                           11096      2.843      3.271      3.734   1.41%
amrex::Dot()                                     91025      1.799      2.061      2.718   1.03%

baperry2 May 21, 2025
Maintainer

2.5X memory requirement vs. PeleC isn't surprising to me unfortunately. I am a bit surprised by how low the GPU usage is during the chemistry solve. I haven't used nvitop to measure this so IDK exactly how to interpret that though. Perhaps @jrood-nrel can comment?

In general load balancing can be a big issue if combustion is only occurring in a small portion of your domain. And with bad load balancing you would see a lot of GPUs sitting idle during reaction evaluations (corresponding to parts of the domain with no chemistry happening), and lots of time spent in FabArray::ParallelCopy_finish() in the TINY PROFILE. We've found HPCtoolkit to be an effective way to profile and visualize load balancing issues if you want to dive into this further. PeleLMeX does have some load balancing options which can lead to a substantial performance improvement in these cases. Here are some settings that worked well for one particular case (varying amr.max_grid_size also has an effect):

And the HPCtoolkit profile showing the improved performance due to load balancing:

I shared a presentation with you that has more details on PeleLMeX performance optimization.

It's probably also worth trying out different chemistry solvers, as depending on the case it can vary which is optimal. There are definitely a lot of knobs you can turn, so there's a balance between finding something that works and runs fast enough vs. spending a bunch of time eking out the last drop of performance.

kelseasouders May 21, 2025
Author

Thank you for this! I will certainly test turning some more of the knobs. I was running more tests this morning, having recompiled with PELE_CVODE_FORCE_YCORDER = TRUE, and found that the sparse_direct solver actually worked pretty well for my case. I also noticed that the magma_direct solver was no longer leading to little/no GPU utilization and they were pinned at 100%, but the MAGMA solver was also considerably slower, taking ~110s to solve each reacting step versus ~20s with the sparse direct solver. This makes me wonder if there's something funky with how I have compiled the code, but I will probably just go with the sparse solver for the time being, unless there's reason to believe that the MAGMA solver should drastically improve performance.

jrood-nrel · 2025-05-21T18:27:54Z

jrood-nrel
May 21, 2025
Maintainer

I only skimmed all this. It looks like you are aware AMReX uses a memory pool. It allocates like 80% of GPU memory at initialization by default. It looks like you're using amrex.the_arena_init_size , but setting it to a large number. You can also set that to 0. If amrex.the_arena_is_managed=0 is segfaulting, we need to fix that. We know that has significant performance implications (at least on AMD) in the past. I would use a blocking factor of 16. Also, I think you would need to use nsight systems and/or nsight compute to really judge what is happening regarding utilization. The AMR algorithm tends to idle processing devices with not much that can be done about it besides Bruce's recommendations for load balancing options. You need to watch all GPUs like what HPCToolkit shows. Keeping GPUs busy is quite difficult for most real world applications. @drummerdoc has done the performance comparison between PeleLM and PeleC a long time ago, so not sure how true it remains for PeleLMeX and PeleC. PeleLM had a large performance benefit.

1 reply

kelseasouders May 21, 2025
Author

Thank you! I will see if I can convert these cases to a blocking factor of 16 - I originally designed these cases with a blocking factor of 8, not realizing the performance implications at the time, but I should be able to fix that restarting from a plot file.

I will also look into the profiling tools that are available on the system to see if anything stands out.

esclapez · 2025-05-21T19:05:26Z

esclapez
May 21, 2025
Maintainer

On the memory requirements, 2.5x is not surprising to me either. There are a couple of structural reasons for this:

PeleLMeX is non-subcycling, which means that all the data containers for the time advance function have to be available on all the levels at the same time, whereas PeleC sub-cycling approach effectively treats one level after another (still keeping some containers for refluxing)
the semi-implicit SDC algorithm of PeleLMeX is memory hungry. For instance, we need to keep around three versions of the diffusion term.
But I do think that there are a few places where we could try to be a bit more clever or accept a bit more compute (especially when running on GPUs) to reduce the memory. Like recomputing the transport coefficients instead of storing them as Bruce suggested.

On PeleC vs PeleLMeX, we did some comparison on the ECP challenge problem a few years back, and PeleLMeX was about 10x faster. But this is highly dependent on your CFL number and the stiffness of the chemistry.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PeleLMeX vs PeleC GPU Memory Usage #514

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

PeleLMeX vs PeleC GPU Memory Usage #514

Uh oh!

kelseasouders May 16, 2025

Replies: 3 comments · 5 replies

Uh oh!

Uh oh!

baperry2 May 16, 2025 Maintainer

Uh oh!

kelseasouders May 19, 2025 Author

Uh oh!

kelseasouders May 20, 2025 Author

Uh oh!

baperry2 May 21, 2025 Maintainer

Uh oh!

kelseasouders May 21, 2025 Author

Uh oh!

jrood-nrel May 21, 2025 Maintainer

Uh oh!

kelseasouders May 21, 2025 Author

Uh oh!

esclapez May 21, 2025 Maintainer

kelseasouders
May 16, 2025

Replies: 3 comments 5 replies

baperry2
May 16, 2025
Maintainer

kelseasouders May 19, 2025
Author

kelseasouders May 20, 2025
Author

baperry2 May 21, 2025
Maintainer

kelseasouders May 21, 2025
Author

jrood-nrel
May 21, 2025
Maintainer

kelseasouders May 21, 2025
Author

esclapez
May 21, 2025
Maintainer