Estimating the memory needed for ESMF_RegridWeightGen #357
Replies: 9 comments
-
|
Hi David,
That seems like a lot of memory for that case. Would you mind trying it with the argument: -p none ? That turns off some extra extrapolation at the pole which could potentially add more memory. If it gives you an unmapped point error use the flag: -i to turn it off.
Let me know if that helps. If not, is there someplace where I can get the files to try (e.g. do you have an account on Derecho)?
- Bob
… On Feb 21, 2025, at 5:46 AM, David Hassell ***@***.***> wrote:
Requirements
Reviewed ESMF Reference Manual <https://earthsystemmodeling.org/doc/>
Searched GitHub Discussions <https://github.com/orgs/esmf-org/discussions?discussions_q=>
Affiliation(s)
NCAS
ESMF Version
v8.8.0b0
Issue
Hello,
Is it possible to estimate physical memory required for a weights calculation?
I'm using
ESMF_RegridWeightGen -i --ignore_degenerate -s src.nc -d dst.nc -m bilinear -w w.nc
where src.nc is 768 x 768 = 589,824 grid points, and dst.nc is 43200 x 4200 = 181,440,000 grid points.
I have attempted to run this on 3072 PEs where each of the 48 groups of 64 PEs has shared access to 512 GB of RAM, giving a total of ~24 TB, but I'm still getting out-of-memory errors.
It's wholly possible that my parallelised set up is wrong (!), but to try to diagnose if that's the case, or if I just need more resources, it would be useful to know what the memory requirement ought to be.
Many thanks,
David
Autotag
@oehmke <https://github.com/oehmke>
—
Reply to this email directly, view it on GitHub <#357>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE6A7UYZ4A5T2JSWYYQUT2D2Q4NZBAVCNFSM6AAAAABXTE6H4OVHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZXHE4TQMZZGU>.
You are receiving this because you were mentioned.
|
Beta Was this translation helpful? Give feedback.
-
|
Hi Bob, Thanks! Both grids include the south pole (not the north) - I shall try with David |
Beta Was this translation helpful? Give feedback.
-
|
Hi Bob, still fails on memory. I don't have anywhere handy to put the src and dst files - they're only 8MB - I could upload them to the discussion, if that's OK. Thanks, |
Beta Was this translation helpful? Give feedback.
-
|
Too bad that didn’t work, but yep, that’s fine, just attach them and I’ll take a look.
Thanks,
- Bob
… On Feb 24, 2025, at 6:46 AM, David Hassell ***@***.***> wrote:
Hi Bob, still fails on memory. I don't have anywhere handy to put the src and dst files - they're only 8MB - I could upload them to the discussion, if that's OK.
Thanks,
David
—
Reply to this email directly, view it on GitHub <#357 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE6A7U4KLQSK4RQGSLRJHVD2RMPDFAVCNFSM6AAAAABXTE6H4OVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTEMZQGA4TSMI>.
You are receiving this because you were mentioned.
|
Beta Was this translation helpful? Give feedback.
-
|
Thank you! with command: ESMF_RegridWeightGen -p none -i --ignore_degenerate -s src.nc -d dst.nc -m bilinear -w w.nc -t CFGRID --netcdf4 --src_regionalDavid |
Beta Was this translation helpful? Give feedback.
-
|
Hi David, I tried this and it looks like it worked. I ran it on 1024 PEs. Where each of the 32 groups of 32 PEs has access to 256 GB. This is less than your case, so I wonder if something else is going wrong. Maybe try the above layout and see if it works for you? If not, let me know how I can help figure out what's going wrong. Thanks,
|
Beta Was this translation helpful? Give feedback.
-
|
Hi Bob, Many thanks for trying and succeeding! I have tried the layout you used (1024 PEs, each group 32 having shared access to 512 GB memory) , and still suffered an out-of-memory on a PE. I am using slurm's David |
Beta Was this translation helpful? Give feedback.
-
|
Hi David,
I think those should both be ok. However, if it’s not too hard to use mpirun, it might be worth a shot to try it. I’m just trying to eliminate differences in our two cases to see where the problem may be. What version of ESMF are you using? I was using the latest.
- Bob
… On Mar 6, 2025, at 6:34 AM, David Hassell ***@***.***> wrote:
Hi Bob,
Many thanks for trying and succeeding! I have tried the layout you used (1024 PEs, each group 32 having shared access to 512 GB memory) , and still suffered an out-of-memory on a PE.
I am using slurm's srun (as opposed to, say mpirun) - ought that be OK? Alos, I am using the executable as install with esmpy.
David
—
Reply to this email directly, view it on GitHub <#357 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE6A7U4AOWKZGTRPZOZ7RI32TBFGFAVCNFSM6AAAAABXTE6H4OVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTENBRGQZDOMY>.
You are receiving this because you were mentioned.
|
Beta Was this translation helpful? Give feedback.
-
|
Hi Bob, I'm using the latest, I think: $ ESMF_RegridWeightGen -V
ESMF_VERSION_STRING: 8.8.0Apparently mpirun is not available on the machine I've been using. (I've just realized that it might be on another - I'm going to check ....) Cheers, |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Requirements
Affiliation(s)
NCAS
ESMF Version
v8.8.0
Issue
Hello,
Is it possible to estimate physical memory required for a weights calculation?
I'm using
ESMF_RegridWeightGen -i --ignore_degenerate -s src.nc -d dst.nc -m bilinear -w w.ncwhere
src.ncis 768 x 768 = 589,824 grid points, anddst.ncis 43200 x 4200 = 181,440,000 grid points.I have attempted to run this on 3072 PEs where each of the 48 groups of 64 PEs has shared access to 512 GB of RAM, giving a total of ~24 TB, but I'm still getting out-of-memory errors.
It's wholly possible that my parallelised set up is wrong (!), but to try to diagnose if that's the case, or if I just need more resources, it would be useful to know what the memory requirement ought to be.
Many thanks,
David
Autotag
@oehmke
Beta Was this translation helpful? Give feedback.
All reactions