NaNs detected in repro sum input #7271
-
Hello everyone, Any help is appreciated. Eamxx used to work fine with the parameters I gave it. Then I edited the code to add the kokkos profiling library, names to parallel_for that support it, and region names. This used to work still. Then while I was trying to profile with NVIDIA tools, something changed maybe in the underlying software versions in the system and I see the following. I can't make sense of it so any help would be great. Thank you in advance. the srun command is:
|
Beta Was this translation helpful? Give feedback.
Replies: 9 comments 26 replies
-
@mihelog I edited your post slightly for better readability (code blocks) Is this error reproducible? If so: additionally, how are you triggering this command?
Are you on a compute node with GPUs? If so, please make sure of the two following items: First: that you have only activate the corresponding run environment, i.e.,
Second: make the srun command GPU-aware with cc @ndkeen |
Beta Was this translation helpful? Give feedback.
-
[moving to discussion since our extensive testing didn't detect this issue, so it is likely a user-side config issue that we can continue discussing] |
Beta Was this translation helpful? Give feedback.
-
First off, thank you SO MUCH for your quick responses! I'm grateful for that. Likewise, thanks so much in advance for any further help. I'm invoking through a complicated script made for my local system that used to work. I don't know if it's my changes that broke it or not, or some system update. Here's what I did: 1). I added "-G 4" to the srun invocation. No change. Also, the sbatch script that ran does have " --gpus-per-node=4". This is what it has to be exact:
2). I tried invoking it through a "clean" interactive session but that threw another error, likely unrelated due to that environment, but FYI it's here " ERROR: (cime_cpl_init) :: namelist read returns an end of file or end of record condition". 3). I cleaned up my environment variables to only system defaults plus these two user-defined ones:
4). I confirmed that the second file you pointed to above runs so the environment variables specified in it are loaded. Likewise, all the modules in that script are loaded. The following are in addition, that are system defaults:
|
Beta Was this translation helpful? Give feedback.
-
Agreed. I'm invoking a custom script (not written by me unfortunately) that uses sbatch to invoke sbatch that includes the srun command. I'm puzzled here because it did used to work and broke at some point and remained broken even after reverting changes. What I did was add some kokkos profiling region names in the source code, linked against the kokkos library, and added nvidia profiling tools to srun. That worked well but even after removing them it didn't fix anything. So who knows. This is the srun command that ends up being executed through sbatch as a reminder: Anyway thanks in advance again. Here is the script.
|
Beta Was this translation helpful? Give feedback.
-
I see what you mean. But it's the new repo I was using. When my collaborator gave me the script, the fetch function was already inactive (do_fetch_code=false) and I pulled the new repo manually. So that should not be a problem. But it's an indication that maybe the script I'm using is old. Apologies if I'm missing something basic btw. When I fetch and update submodules, I get the error at the end of this post when running the script. I tried with several python versions (all 3.x though). Maybe my script is old too? When I restore my script edits in the "cime" directory, nothing changes which makes sense because I commented out all my changes. By doing a diff between the version of the "eamxx" directory I was using and this new one, there are many changes. So it's possible I was using an older version. Could it be an input file problem too? I'm including below the input .yaml file just in case YAML FILE
THE ERROR
|
Beta Was this translation helpful? Give feedback.
-
It doesn't want to cooperate
|
Beta Was this translation helpful? Give feedback.
-
Hello everyone, Since I have the author of the scripts here I was wondering if I can get help on something different. I'm trying to add a constraint in the slurm flags. The equivalent in a sbatch script would be a -C "gpu&hbm80g" (these are higher-memory GPUs in NERSC). I can't figure out where to add this constraint in the collection of scripts. Any ideas? Thank you! |
Beta Was this translation helpful? Give feedback.
-
There are several ways to do it. This seems to work:
|
Beta Was this translation helpful? Give feedback.
-
Thank you both! I figured out xmlchange but that didn't work for me since I had a layer of scripts that generated more scripts... and yada yada. But the environment variable and case.submit led me down the right path and I figured out where to append to the parameters string. Thank you again much! |
Beta Was this translation helpful? Give feedback.
That's fine, it will just mean you're not outputting anything as part of the simulation from the EAMxx side. If you're interested in profiling the IO layer, then you can try to insert it back carefully. But, I think the IO layer isn't as interesting in terms of profiling. So I would recommend ignoring it at least to start.