Estimating the memory needed for ESMF_RegridWeightGen #357

davidhassell · 2025-02-21T12:45:45Z

davidhassell
Feb 21, 2025

Requirements

Reviewed ESMF Reference Manual
Searched GitHub Discussions

Affiliation(s)

NCAS

ESMF Version

v8.8.0

Issue

Hello,

Is it possible to estimate physical memory required for a weights calculation?

I'm using

ESMF_RegridWeightGen -i --ignore_degenerate -s src.nc  -d dst.nc  -m bilinear -w w.nc

where src.nc is 768 x 768 = 589,824 grid points, and dst.nc is 43200 x 4200 = 181,440,000 grid points.

I have attempted to run this on 3072 PEs where each of the 48 groups of 64 PEs has shared access to 512 GB of RAM, giving a total of ~24 TB, but I'm still getting out-of-memory errors.

It's wholly possible that my parallelised set up is wrong (!), but to try to diagnose if that's the case, or if I just need more resources, it would be useful to know what the memory requirement ought to be.

Many thanks,
David

Autotag

@oehmke

oehmke · 2025-02-21T19:04:15Z

oehmke
Feb 21, 2025
Maintainer

Hi David, That seems like a lot of memory for that case. Would you mind trying it with the argument: -p none ? That turns off some extra extrapolation at the pole which could potentially add more memory. If it gives you an unmapped point error use the flag: -i to turn it off. Let me know if that helps. If not, is there someplace where I can get the files to try (e.g. do you have an account on Derecho)? - Bob

…

On Feb 21, 2025, at 5:46 AM, David Hassell ***@***.***> wrote: Requirements Reviewed ESMF Reference Manual <https://earthsystemmodeling.org/doc/> Searched GitHub Discussions <https://github.com/orgs/esmf-org/discussions?discussions_q=> Affiliation(s) NCAS ESMF Version v8.8.0b0 Issue Hello, Is it possible to estimate physical memory required for a weights calculation? I'm using ESMF_RegridWeightGen -i --ignore_degenerate -s src.nc -d dst.nc -m bilinear -w w.nc where src.nc is 768 x 768 = 589,824 grid points, and dst.nc is 43200 x 4200 = 181,440,000 grid points. I have attempted to run this on 3072 PEs where each of the 48 groups of 64 PEs has shared access to 512 GB of RAM, giving a total of ~24 TB, but I'm still getting out-of-memory errors. It's wholly possible that my parallelised set up is wrong (!), but to try to diagnose if that's the case, or if I just need more resources, it would be useful to know what the memory requirement ought to be. Many thanks, David Autotag @oehmke <https://github.com/oehmke> — Reply to this email directly, view it on GitHub <#357>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE6A7UYZ4A5T2JSWYYQUT2D2Q4NZBAVCNFSM6AAAAABXTE6H4OVHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZXHE4TQMZZGU>. You are receiving this because you were mentioned.

0 replies

davidhassell · 2025-02-21T19:25:34Z

davidhassell
Feb 21, 2025
Author

Hi Bob,

Thanks! Both grids include the south pole (not the north) - I shall try with -p none. I'll let you know how that goes ...

David

0 replies

davidhassell · 2025-02-24T13:46:02Z

davidhassell
Feb 24, 2025
Author

Hi Bob, still fails on memory. I don't have anywhere handy to put the src and dst files - they're only 8MB - I could upload them to the discussion, if that's OK.

Thanks,
David

0 replies

oehmke · 2025-02-24T17:07:16Z

oehmke
Feb 24, 2025
Maintainer

Too bad that didn’t work, but yep, that’s fine, just attach them and I’ll take a look. Thanks, - Bob

…

On Feb 24, 2025, at 6:46 AM, David Hassell ***@***.***> wrote: Hi Bob, still fails on memory. I don't have anywhere handy to put the src and dst files - they're only 8MB - I could upload them to the discussion, if that's OK. Thanks, David — Reply to this email directly, view it on GitHub <#357 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE6A7U4KLQSK4RQGSLRJHVD2RMPDFAVCNFSM6AAAAABXTE6H4OVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTEMZQGA4TSMI>. You are receiving this because you were mentioned.

0 replies

davidhassell · 2025-02-24T17:16:28Z

davidhassell
Feb 24, 2025
Author

Thank you!

Files: src.nc.gz, dst.nc.gz

with command:

ESMF_RegridWeightGen -p none -i --ignore_degenerate -s src.nc  -d dst.nc  -m bilinear -w w.nc -t CFGRID --netcdf4 --src_regional

David

0 replies

oehmke · 2025-03-04T22:50:39Z

oehmke
Mar 4, 2025
Maintainer

Hi David,

I tried this and it looks like it worked. I ran it on 1024 PEs. Where each of the 32 groups of 32 PEs has access to 256 GB. This is less than your case, so I wonder if something else is going wrong. Maybe try the above layout and see if it works for you? If not, let me know how I can help figure out what's going wrong.

Thanks,

Bob

0 replies

davidhassell · 2025-03-06T13:34:05Z

davidhassell
Mar 6, 2025
Author

Hi Bob,

Many thanks for trying and succeeding! I have tried the layout you used (1024 PEs, each group 32 having shared access to 512 GB memory) , and still suffered an out-of-memory on a PE.

I am using slurm's srun (as opposed to, say mpirun) - ought that be OK? Alos, I am using the executable as install with esmpy.

David

0 replies

oehmke · 2025-03-07T17:45:21Z

oehmke
Mar 7, 2025
Maintainer

Hi David, I think those should both be ok. However, if it’s not too hard to use mpirun, it might be worth a shot to try it. I’m just trying to eliminate differences in our two cases to see where the problem may be. What version of ESMF are you using? I was using the latest. - Bob

…

On Mar 6, 2025, at 6:34 AM, David Hassell ***@***.***> wrote: Hi Bob, Many thanks for trying and succeeding! I have tried the layout you used (1024 PEs, each group 32 having shared access to 512 GB memory) , and still suffered an out-of-memory on a PE. I am using slurm's srun (as opposed to, say mpirun) - ought that be OK? Alos, I am using the executable as install with esmpy. David — Reply to this email directly, view it on GitHub <#357 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE6A7U4AOWKZGTRPZOZ7RI32TBFGFAVCNFSM6AAAAABXTE6H4OVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTENBRGQZDOMY>. You are receiving this because you were mentioned.

0 replies

davidhassell · 2025-03-12T10:05:20Z

davidhassell
Mar 12, 2025
Author

Hi Bob,

I'm using the latest, I think:

$ ESMF_RegridWeightGen  -V
   ESMF_VERSION_STRING:       8.8.0

Apparently mpirun is not available on the machine I've been using. (I've just realized that it might be on another - I'm going to check ....)

Cheers,
David

0 replies

Earth System Modeling Framework

Estimating the memory needed for ESMF_RegridWeightGen #357

Uh oh!

Uh oh!

davidhassell Feb 21, 2025

Requirements

Affiliation(s)

ESMF Version

Issue

Autotag

Replies: 9 comments

Uh oh!

oehmke Feb 21, 2025 Maintainer

Uh oh!

davidhassell Feb 21, 2025 Author

Uh oh!

davidhassell Feb 24, 2025 Author

Uh oh!

oehmke Feb 24, 2025 Maintainer

Uh oh!

davidhassell Feb 24, 2025 Author

Uh oh!

oehmke Mar 4, 2025 Maintainer

Uh oh!

davidhassell Mar 6, 2025 Author

Uh oh!

oehmke Mar 7, 2025 Maintainer

Uh oh!

davidhassell Mar 12, 2025 Author

davidhassell
Feb 21, 2025

oehmke
Feb 21, 2025
Maintainer

davidhassell
Feb 21, 2025
Author

davidhassell
Feb 24, 2025
Author

oehmke
Feb 24, 2025
Maintainer

davidhassell
Feb 24, 2025
Author

oehmke
Mar 4, 2025
Maintainer

davidhassell
Mar 6, 2025
Author

oehmke
Mar 7, 2025
Maintainer

davidhassell
Mar 12, 2025
Author