NaNs detected in repro sum input #7271

mihelog · 2025-04-19T17:04:03Z

mihelog
Apr 19, 2025

Hello everyone,

Any help is appreciated. Eamxx used to work fine with the parameters I gave it. Then I edited the code to add the kokkos profiling library, names to parallel_for that support it, and region names. This used to work still. Then while I was trying to profile with NVIDIA tools, something changed maybe in the underlying software versions in the system and I see the following. I can't make sense of it so any help would be great. Thank you in advance.

the srun command is:

srun  --label  -n 4 -N 1 -c 32  --cpu_bind=cores   -m plane=4 /pscratch/sd/m/mihelog/e3sm_scratch/pm-gpu/se09-aug20_take1/t.ne30pg2_ne30pg2.F2010-SCREAMv1.se09-aug20_take1.5d.n0001t4x111XX1.nocosp/build/e3sm.exe   >> e3sm.log.$LID 2>&1

0: PE 0: MPICH processor detected:
0: PE 0:   AMD Milan (25:1:1) (family:model:stepping)
0: MPI VERSION    : CRAY MPICH version 8.1.30.8 (ANL base 3.4a2)
0: MPI BUILD INFO : Sat Jun 01  4:44 2024 (git hash 69863f7) (CH4)
0: PE 0: MPICH environment settings =====================================
0: PE 0:   MPICH_ENV_DISPLAY                              = 1
0: PE 0:   MPICH_VERSION_DISPLAY                          = 1
0: PE 0:   MPICH_ABORT_ON_ERROR                           = 0
0: PE 0:   MPICH_CPUMASK_DISPLAY                          = 0
0: PE 0:   MPICH_STATS_DISPLAY                            = 0
0: PE 0:   MPICH_RANK_REORDER_METHOD                      = 1
0: PE 0:   MPICH_RANK_REORDER_DISPLAY                     = 0
0: PE 0:   MPICH_MEMCPY_MEM_CHECK                         = 0
0: PE 0:   MPICH_USE_SYSTEM_MEMCPY                        = 0
0: PE 0:   MPICH_OPTIMIZED_MEMCPY                         = 1
0: PE 0:   MPICH_ALLOC_MEM_PG_SZ                          = 4096
0: PE 0:   MPICH_ALLOC_MEM_POLICY                         = PREFERRED
0: PE 0:   MPICH_ALLOC_MEM_AFFINITY                       = SYS_DEFAULT
0: PE 0:   MPICH_MALLOC_FALLBACK                          = 0
0: PE 0:   MPICH_MEM_DEBUG_FNAME                          =
0: PE 0:   MPICH_INTERNAL_MEM_AFFINITY                    = SYS_DEFAULT
0: PE 0:   MPICH_NO_BUFFER_ALIAS_CHECK                    = 0
0: PE 0:   MPICH_COLL_SYNC                                = MPI_Bcast
0: PE 0:   MPICH_SINGLE_HOST_ENABLED                        = 1
0: PE 0:   MPICH_USE_PERSISTENT_TOPS                      = 0
0: PE 0:   MPICH_DISABLE_PERSISTENT_RECV_TOPS             = 0
0: PE 0:   MPICH_MAX_TOPS_COUNTERS                        = 0
0: PE 0:   MPICH_ENABLE_ACTIVE_WAIT                       = 0
0: PE 0: MPICH/RMA environment settings =================================
0: PE 0:   MPICH_RMA_MAX_PENDING                          = 128
0: PE 0:   MPICH_RMA_SHM_ACCUMULATE                       = 0
0: PE 0: MPICH/GPU environment settings =================================
0: PE 0:   MPICH_GPU_SUPPORT_ENABLED                      = 1
0: PE 0:   MPICH_GPU_MAX_NUM_STREAMS                      = 1
0: PE 0:   MPICH_GPU_IPC_ENABLED                          = 1
0: PE 0:   MPICH_GPU_IPC_CACHE_MAX_SIZE                   = 50
0: PE 0:   MPICH_GPU_EAGER_REGISTER_HOST_MEM              = 1
0: PE 0:   MPICH_GPU_IPC_THRESHOLD                        = 1024
0: PE 0:   MPICH_GPU_NO_ASYNC_COPY                        = 0
0: PE 0:   MPICH_GPU_COLL_STAGING_AREA_OPT                = 1
0: PE 0:   MPICH_GPU_EAGER_DEVICE_MEM                     = 0
0: PE 0:   MPICH_GPU_USE_STREAM_TRIGGERED                 = 0
0: PE 0:   MPICH_ENABLE_YAKSA                             = 0
0: PE 0:   MPICH_GPU_USE_KERNEL_TRIGGERED                 = 0
0: PE 0: MPICH/Dynamic Process Management environment settings ==========
0: PE 0:   MPICH_DPM_DIR                                  =
0: PE 0:   MPICH_LOCAL_SPAWN_SERVER                       = 0
0: PE 0:   MPICH_SPAWN_USE_RANKPOOL                       = 0
0: PE 0: MPICH/SMP environment settings =================================
0: PE 0:   MPICH_SMP_SINGLE_COPY_MODE                     = CMA
0: PE 0:   MPICH_SMP_SINGLE_COPY_SIZE                     = 8192
0: PE 0:   MPICH_SHM_PROGRESS_MAX_BATCH_SIZE              = 8
0: PE 0: MPICH/COLLECTIVE environment settings ==========================
0: PE 0:   MPICH_COLL_OPT_OFF                             = 0
0: PE 0:   MPICH_BCAST_ONLY_TREE                          = 1
0: PE 0:   MPICH_BCAST_INTERNODE_RADIX                    = 4
0: PE 0:   MPICH_BCAST_INTRANODE_RADIX                    = 4
0: PE 0:   MPICH_ALLTOALL_SHORT_MSG                       = 64-512
0: PE 0:   MPICH_ALLTOALL_SYNC_FREQ                       = 1-24
0: PE 0:   MPICH_ALLTOALLV_THROTTLE                       = 8
0: PE 0:   MPICH_ALLGATHER_VSHORT_MSG                     = 1024-4096
0: PE 0:   MPICH_ALLGATHERV_VSHORT_MSG                    = 1024-4096
0: PE 0:   MPICH_GATHERV_SHORT_MSG                        = 131072
0: PE 0:   MPICH_GATHERV_MIN_COMM_SIZE                    = 64
0: PE 0:   MPICH_GATHERV_MAX_TMP_SIZE                     = 536870912
0: PE 0:   MPICH_GATHERV_SYNC_FREQ                        = 16
0: PE 0:   MPICH_IGATHERV_MIN_COMM_SIZE                   = 1000
0: PE 0:   MPICH_IGATHERV_SYNC_FREQ                       = 100
0: PE 0:   MPICH_IGATHERV_RAND_COMMSIZE                   = 2048
0: PE 0:   MPICH_IGATHERV_RAND_RECVLIST                   = 0
0: PE 0:   MPICH_SCATTERV_SHORT_MSG                       = 2048-8192
0: PE 0:   MPICH_SCATTERV_MIN_COMM_SIZE                   = 64
0: PE 0:   MPICH_SCATTERV_MAX_TMP_SIZE                    = 536870912
0: PE 0:   MPICH_SCATTERV_SYNC_FREQ                       = 16
0: PE 0:   MPICH_SCATTERV_SYNCHRONOUS                     = 0
0: PE 0:   MPICH_ALLREDUCE_MAX_SMP_SIZE                   = 262144
0: PE 0:   MPICH_ALLREDUCE_BLK_SIZE                       = 716800
0: PE 0:   MPICH_GPU_ALLGATHER_VSHORT_MSG_ALGORITHM       = 1
0: PE 0:   MPICH_GPU_ALLREDUCE_USE_KERNEL                 = 1
0: PE 0:   MPICH_GPU_COLL_STAGING_BUF_SIZE                = 1048576
0: PE 0:   MPICH_GPU_ALLREDUCE_STAGING_THRESHOLD          = 256
0: PE 0:   MPICH_GPU_ALLREDUCE_BLK_SIZE                   = 8388608
0: PE 0:   MPICH_GPU_ALLREDUCE_KERNEL_THRESHOLD           = 131072
0: PE 0:   MPICH_ALLREDUCE_NO_SMP                         = 0
0: PE 0:   MPICH_REDUCE_NO_SMP                            = 0
0: PE 0:   MPICH_REDUCE_SCATTER_COMMUTATIVE_LONG_MSG_SIZE = 524288
0: PE 0:   MPICH_REDUCE_SCATTER_MAX_COMMSIZE              = 1000
0: PE 0:   MPICH_SHARED_MEM_COLL_OPT                      = 1
0: PE 0:   MPICH_SHARED_MEM_COLL_NCELLS                   = 8
0: PE 0:   MPICH_SHARED_MEM_COLL_CELLSZ                   = 256
0: PE 0: MPICH MPIIO environment settings ===============================
0: PE 0:   MPICH_MPIIO_HINTS_DISPLAY                      = 0
0: PE 0:   MPICH_MPIIO_HINTS                              = NULL
0: PE 0:   MPICH_MPIIO_ABORT_ON_RW_ERROR                  = disable
0: PE 0:   MPICH_MPIIO_CB_ALIGN                           = 2
0: PE 0:   MPICH_MPIIO_DVS_MAXNODES                       = 1
0: PE 0:   MPICH_MPIIO_AGGREGATOR_PLACEMENT_DISPLAY       = 0
0: PE 0:   MPICH_MPIIO_AGGREGATOR_PLACEMENT_STRIDE        = -1
0: PE 0:   MPICH_MPIIO_MAX_NUM_IRECV                      = 50
0: PE 0:   MPICH_MPIIO_MAX_NUM_ISEND                      = 50
0: PE 0:   MPICH_MPIIO_MAX_SIZE_ISEND                     = 10485760
0: PE 0:   MPICH_MPIIO_OFI_STARTUP_CONNECT                = disable
0: PE 0:   MPICH_MPIIO_OFI_STARTUP_NODES_AGGREGATOR        = 2
0: PE 0: MPICH MPIIO statistics environment settings ====================
0: PE 0:   MPICH_MPIIO_STATS                              = 0
0: PE 0:   MPICH_MPIIO_TIMERS                             = 0
0: PE 0:   MPICH_MPIIO_WRITE_EXIT_BARRIER                 = 1
0: PE 0: MPICH Thread Safety settings ===================================
0: PE 0:   MPICH_ASYNC_PROGRESS                           = 0
0: PE 0:   MPICH_OPT_THREAD_SYNC                          = 1
0: PE 0:   rank 0 required = single, was provided = single
0:  User-specified PIO rearranger comm max pend req (comp2io),            0  (value will be reset as requested)
0:  Resetting PIO rearranger comm max pend req (comp2io) to           64
0:  PIO rearranger options:
0:    comm type     = p2p
0:    comm fcd      = 2denable
0:    max pend req (comp2io)  =           64
0:    enable_hs (comp2io)     =  T
0:    enable_isend (comp2io)  =  F
0:    max pend req (io2comp)  =           64
0:    enable_hs (io2comp)    =  F
0:    enable_isend (io2comp)  =  T
0: 4 pes participating in computation of coupled model
0: --------------------------------------------------------------
0: GLOBAL communicator : 1 nodes, 4 MPI tasks
0: COMMUNICATOR NODE # [NODE NAME] : (# OF MPI TASKS) TASK # LIST
0: GLOBAL NODE 0 [ nid001413 ] : ( 4 MPI TASKS ) 0 1 2 3
0: --------------------------------------------------------------
0: (seq_comm_setcomm)  init ID (  1 GLOBAL          ) pelist   =     0     3     1 ( npes =     4) ( nthreads =  1)( suffix =)
0: (seq_comm_setcomm)  init ID (  2 CPL             ) pelist   =     0     3     1 ( npes =     4) ( nthreads =  1)( suffix =)
0: (seq_comm_setcomm)  init ID (  5 ATM             ) pelist   =     0     3     1 ( npes =     4) ( nthreads =  1)( suffix =)
0: (seq_comm_joincomm) init ID (  6 CPLATM          ) join IDs =     2     5       ( npes =     4) ( nthreads =  1)
0: (seq_comm_jcommarr) init ID (  3 ALLATMID        ) join multiple comp IDs       ( npes =     4) ( nthreads =  1)
0: (seq_comm_joincomm) init ID (  4 CPLALLATMID     ) join IDs =     2     3       ( npes =     4) ( nthreads =  1)
0: (seq_comm_setcomm)  init ID (  9 LND             ) pelist   =     0     3     1 ( npes =     4) ( nthreads = 16)( suffix =)
0: (seq_comm_joincomm) init ID ( 10 CPLLND          ) join IDs =     2     9       ( npes =     4) ( nthreads = 16)
0: (seq_comm_jcommarr) init ID (  7 ALLLNDID        ) join multiple comp IDs       ( npes =     4) ( nthreads = 16)
0: (seq_comm_joincomm) init ID (  8 CPLALLLNDID     ) join IDs =     2     7       ( npes =     4) ( nthreads = 16)
0: (seq_comm_setcomm)  init ID ( 13 ICE             ) pelist   =     0     3     1 ( npes =     4) ( nthreads = 16)( suffix =)
0: (seq_comm_joincomm) init ID ( 14 CPLICE          ) join IDs =     2    13       ( npes =     4) ( nthreads = 16)
0: (seq_comm_jcommarr) init ID ( 11 ALLICEID        ) join multiple comp IDs       ( npes =     4) ( nthreads = 16)
0: (seq_comm_joincomm) init ID ( 12 CPLALLICEID     ) join IDs =     2    11       ( npes =     4) ( nthreads = 16)
0: (seq_comm_setcomm)  init ID ( 17 OCN             ) pelist   =     0     3     1 ( npes =     4) ( nthreads =  1)( suffix =)
0: (seq_comm_joincomm) init ID ( 18 CPLOCN          ) join IDs =     2    17       ( npes =     4) ( nthreads =  1)
0: (seq_comm_jcommarr) init ID ( 15 ALLOCNID        ) join multiple comp IDs       ( npes =     4) ( nthreads =  1)
0: (seq_comm_joincomm) init ID ( 16 CPLALLOCNID     ) join IDs =     2    15       ( npes =     4) ( nthreads =  1)
0: (seq_comm_setcomm)  init ID ( 21 ROF             ) pelist   =     0     3     1 ( npes =     4) ( nthreads =  1)( suffix =)
0: (seq_comm_joincomm) init ID ( 22 CPLROF          ) join IDs =     2    21       ( npes =     4) ( nthreads =  1)
0: (seq_comm_jcommarr) init ID ( 19 ALLROFID        ) join multiple comp IDs       ( npes =     4) ( nthreads =  1)
0: (seq_comm_joincomm) init ID ( 20 CPLALLROFID     ) join IDs =     2    19       ( npes =     4) ( nthreads =  1)
0: (seq_comm_setcomm)  init ID ( 25 GLC             ) pelist   =     0     3     1 ( npes =     4) ( nthreads =  1)( suffix =)
0: (seq_comm_joincomm) init ID ( 26 CPLGLC          ) join IDs =     2    25       ( npes =     4) ( nthreads =  1)
0: (seq_comm_jcommarr) init ID ( 23 ALLGLCID        ) join multiple comp IDs       ( npes =     4) ( nthreads =  1)
0: (seq_comm_joincomm) init ID ( 24 CPLALLGLCID     ) join IDs =     2    23       ( npes =     4) ( nthreads =  1)
0: (seq_comm_setcomm)  init ID ( 29 WAV             ) pelist   =     0     3     1 ( npes =     4) ( nthreads =  1)( suffix =)
0: (seq_comm_joincomm) init ID ( 30 CPLWAV          ) join IDs =     2    29       ( npes =     4) ( nthreads =  1)
0: (seq_comm_jcommarr) init ID ( 27 ALLWAVID        ) join multiple comp IDs       ( npes =     4) ( nthreads =  1)
0: (seq_comm_joincomm) init ID ( 28 CPLALLWAVID     ) join IDs =     2    27       ( npes =     4) ( nthreads =  1)
0: (seq_comm_setcomm)  init ID ( 33 ESP             ) pelist   =     0     3     1 ( npes =     4) ( nthreads =  1)( suffix =)
0: (seq_comm_joincomm) init ID ( 34 CPLESP          ) join IDs =     2    33       ( npes =     4) ( nthreads =  1)
0: (seq_comm_jcommarr) init ID ( 31 ALLESPID        ) join multiple comp IDs       ( npes =     4) ( nthreads =  1)
0: (seq_comm_joincomm) init ID ( 32 CPLALLESPID     ) join IDs =     2    31       ( npes =     4) ( nthreads =  1)
0: (seq_comm_setcomm)  init ID ( 37 IAC             ) pelist   =     0     3     1 ( npes =     4) ( nthreads =  1)( suffix =)
0: (seq_comm_joincomm) init ID ( 38 CPLIAC          ) join IDs =     2    37       ( npes =     4) ( nthreads =  1)
0: (seq_comm_jcommarr) init ID ( 35 ALLIACID        ) join multiple comp IDs       ( npes =     4) ( nthreads =  1)
0: (seq_comm_joincomm) init ID ( 36 CPLALLIACID     ) join IDs =     2    35       ( npes =     4) ( nthreads =  1)
0: (seq_comm_printcomms)     1     0     4     1  GLOBAL:
0: (seq_comm_printcomms)     2     0     4     1  CPL:
0: (seq_comm_printcomms)     3     0     4     1  ALLATMID:
0: (seq_comm_printcomms)     4     0     4     1  CPLALLATMID:
0: (seq_comm_printcomms)     5     0     4     1  ATM:
0: (seq_comm_printcomms)     6     0     4     1  CPLATM:
0: (seq_comm_printcomms)     7     0     4    16  ALLLNDID:
0: (seq_comm_printcomms)     8     0     4    16  CPLALLLNDID:
0: (seq_comm_printcomms)     9     0     4    16  LND:
0: (seq_comm_printcomms)    10     0     4    16  CPLLND:
0: (seq_comm_printcomms)    11     0     4    16  ALLICEID:
0: (seq_comm_printcomms)    12     0     4    16  CPLALLICEID:
0: (seq_comm_printcomms)    13     0     4    16  ICE:
0: (seq_comm_printcomms)    14     0     4    16  CPLICE:
0: (seq_comm_printcomms)    15     0     4     1  ALLOCNID:
0: (seq_comm_printcomms)    16     0     4     1  CPLALLOCNID:
0: (seq_comm_printcomms)    17     0     4     1  OCN:
0: (seq_comm_printcomms)    18     0     4     1  CPLOCN:
0: (seq_comm_printcomms)    19     0     4     1  ALLROFID:
0: (seq_comm_printcomms)    20     0     4     1  CPLALLROFID:
0: (seq_comm_printcomms)    21     0     4     1  ROF:
0: (seq_comm_printcomms)    22     0     4     1  CPLROF:
0: (seq_comm_printcomms)    23     0     4     1  ALLGLCID:
0: (seq_comm_printcomms)    24     0     4     1  CPLALLGLCID:
0: (seq_comm_printcomms)    25     0     4     1  GLC:
0: (seq_comm_printcomms)    26     0     4     1  CPLGLC:
0: (seq_comm_printcomms)    27     0     4     1  ALLWAVID:
0: (seq_comm_printcomms)    28     0     4     1  CPLALLWAVID:
0: (seq_comm_printcomms)    29     0     4     1  WAV:
0: (seq_comm_printcomms)    30     0     4     1  CPLWAV:
0: (seq_comm_printcomms)    31     0     4     1  ALLESPID:
0: (seq_comm_printcomms)    32     0     4     1  CPLALLESPID:
0: (seq_comm_printcomms)    33     0     4     1  ESP:
0: (seq_comm_printcomms)    34     0     4     1  CPLESP:
0: (seq_comm_printcomms)    35     0     4     1  ALLIACID:
0: (seq_comm_printcomms)    36     0     4     1  CPLALLIACID:
0: (seq_comm_printcomms)    37     0     4     1  IAC:
0: (seq_comm_printcomms)    38     0     4     1  CPLIAC:
0:  (t_initf) Read in prof_inparm namelist from: drv_in
0:  (t_initf) Using profile_disable=          F
0:  (t_initf)       profile_timer=                      4
0:  (t_initf)       profile_depth_limit=               20
0:  (t_initf)       profile_detail_limit=              12
0:  (t_initf)       profile_barrier=          F
0:  (t_initf)       profile_outpe_num=                  1
0:  (t_initf)       profile_outpe_stride=               0
0:  (t_initf)       profile_single_file=      F
0:  (t_initf)       profile_global_stats=     T
0:  (t_initf)       profile_ovhd_measurement= F
0:  (t_initf)       profile_add_detail=       F
0:  (t_initf)       profile_papi_enable=      F
0: Calling initialize_kokkos
0:  ExecSpace name: Cuda
0:  ExecSpace initialized: yes
0:  active avx set: -AVX2-AVX
0:  compiler id: GCC
0:  FPE support is enabled, current FPE mask: 0 (NONE)
0:  #host threads: 1
0:
0:
0: -------- EKAT CONFIGS --------
0:
0:  ExecSpace name: Cuda
0:  ExecSpace initialized: yes
0:  active avx set: -AVX2-AVX
0:  compiler id: GCC
0:  FPE support is enabled, current FPE mask: 0 (NONE)
0:  #host threads: 1
0:
0: -------- SCREAM CONFIGS --------
0:
0:  sizeof(Real) = 8
0:  default pack size = 1
0:  default FPE mask: 0 (NONE)
0: -------------------------------
0:
0:  number of MPI processes per node: min,max=           4           4
0:
2:  var,nvars:           1           1
2:            3  **ABORTING WITH ERROR: NaNs detected in repro sum input**
0: Note: nsplit=-1, while nsplit must be >=1. We know SCREAM does not know nsplit until runtime, so this is fine.
0:       Make sure nsplit is set to a valid value before calling prim_advance_subcycle!
0: gfr> nelemd 1350 qsize 10
0: compose> nelemd 1350 qsize 10 hv_q 1 hv_subcycle_q 6 lim 9 independent_time_steps 1
0:  var,nvars:           1           1
1:  var,nvars:           1           1
1:            2  **ABORTING WITH ERROR: NaNs detected in repro sum input**
2: MPICH ERROR [Rank 2] [job id 37885683.0] [Sat Apr 19 09:02:20 2025] [nid001413] - Abort(128) (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 2
2:
2: aborting job:
2: application called MPI_Abort(MPI_COMM_WORLD, 128) - process 2
3:  var,nvars:           1           1
3:            4  **ABORTING WITH ERROR: NaNs detected in repro sum input**
0:            1  **ABORTING WITH ERROR: NaNs detected in repro sum input**
1: MPICH ERROR [Rank 1] [job id 37885683.0] [Sat Apr 19 09:02:20 2025] [nid001413] - Abort(128) (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 1
1:
0: MPICH ERROR [Rank 0] [job id 37885683.0] [Sat Apr 19 09:02:20 2025] [nid001413] - Abort(128) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 0
1: aborting job:
1: application called MPI_Abort(MPI_COMM_WORLD, 128) - process 1
0:
0: aborting job:
0: application called MPI_Abort(MPI_COMM_WORLD, 128) - process 0
3: MPICH ERROR [Rank 3] [job id 37885683.0] [Sat Apr 19 09:02:20 2025] [nid001413] - Abort(128) (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 3
3:
3: aborting job:
3: application called MPI_Abort(MPI_COMM_WORLD, 128) - process 3
1: Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
1:
1: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
1:
1: Backtrace for this error:
2: Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
2:
2: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
2:
2: Backtrace for this error:
0: Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
0:  WARNING! prim_init_model_f90 was not called yet (or prim_finalize_f90 was already called)
0:   We assume this is happening because an exception was thrown during initialization,
0:   and we're destroying objects as part of the stack unwinding.
0:
0: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
0:
0: Backtrace for this error:
3: Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
3:
3: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
3: Backtrace for this error:
0: #0  0x14c0f191d862 in ???
0: #1  0x14c0f191c8f5 in ???
0: #2  0x14c0f11ffd6f in ???
1: #0  0x14f71011d862 in ???
1: #1  0x14f71011c8f5 in ???
1: #2  0x14f70f9ffd6f in ???
2: #0  0x14b85311d862 in ???
2: #1  0x14b85311c8f5 in ???
2: #2  0x14b8529ffd6f in ???
3: #0  0x153d2a11d862 in ???
3: #1  0x153d2a11c8f5 in ???
3: #2  0x153d299ffd6f in ???
2: #3  0x1224afa in ???
2: #4  0x1a834e8 in ???
2: #5  0xe4e8e1 in ???
2: #6  0xe4ed41 in ???
2: #7  0x5e69ed in ???
2: #8  0x5e6b88 in ???
2: #9  0x5e69ed in ???
2: #10  0x5e0508 in ???
2: #11  0x5e7677 in ???
2: #12  0x14b852a02b88 in ???
2: #13  0x14b852a02d19 in ???
2: #14  0x14b8594d54db in PMI2_Abort
2:      at /home/jenkins/src/api/pmi2/pmi2_abort.c:38
2: #15  0x14b858395fb1 in ???
2: #16  0x14b856d9c407 in ???
2: #17  0x14b859961b3c in ???
2: #18  0x113f417 in ???
2: #19  0x1128bd1 in ???
2: #20  0x1075142 in ???
2: #21  0xfbbc64 in ???
2: #22  0xf62be5 in ???
2: #23  0x1a63eaf in ???
2: #24  0x1a82180 in ???
2: #25  0x1a63eaf in ???
2: #26  0xe5d21a in ???
2: #27  0x5e173b in ???
2: #28  0x5dbe09 in ???
2: #29  0x550b3d in ???
2: #30  0x540f4a in ???
2: #31  0x4e327c in ???
2: #32  0x14b8529ea1fc in ???
2: #33  0x532519 in _start
2:      at ../sysdeps/x86_64/start.S:120
2: #34  0xffffffffffffffff in ???

Answered by mahf708

Apr 23, 2025

That's fine, it will just mean you're not outputting anything as part of the simulation from the EAMxx side. If you're interested in profiling the IO layer, then you can try to insert it back carefully. But, I think the IO layer isn't as interesting in terms of profiling. So I would recommend ignoring it at least to start.

View full answer

mahf708 · 2025-04-19T23:39:45Z

mahf708
Apr 19, 2025
Collaborator

@mihelog I edited your post slightly for better readability (code blocks)

Is this error reproducible?

If so: additionally, how are you triggering this command?

srun  --label  -n 4 -N 1 -c 32  --cpu_bind=cores   -m plane=4 /pscratch/sd/m/mihelog/e3sm_scratch/pm-gpu/se09-aug20_take1/t.ne30pg2_ne30pg2.F2010-SCREAMv1.se09-aug20_take1.5d.n0001t4x111XX1.nocosp/build/e3sm.exe   >> e3sm.log.$LID 2>&1

Are you on a compute node with GPUs? If so, please make sure of the two following items:

First: that you have only activate the corresponding run environment, i.e.,

source /pscratch/sd/m/mihelog/e3sm_scratch/pm-gpu/se09-aug20_take1/t.ne30pg2_ne30pg2.F2010-SCREAMv1.se09-aug20_take1.5d.n0001t4x111XX1.nocosp/case_scripts/.env_mach_specific.[csh|sh]

Second: make the srun command GPU-aware with -G 4

cc @ndkeen

0 replies

mahf708 · 2025-04-20T00:18:35Z

mahf708
Apr 20, 2025
Collaborator

[moving to discussion since our extensive testing didn't detect this issue, so it is likely a user-side config issue that we can continue discussing]

0 replies

mihelog · 2025-04-20T14:50:25Z

mihelog
Apr 20, 2025
Author

First off, thank you SO MUCH for your quick responses! I'm grateful for that. Likewise, thanks so much in advance for any further help.

I'm invoking through a complicated script made for my local system that used to work. I don't know if it's my changes that broke it or not, or some system update. Here's what I did:

1). I added "-G 4" to the srun invocation. No change. Also, the sbatch script that ran does have " --gpus-per-node=4". This is what it has to be exact:

#SBATCH  --job-name=st_archive.t.ne30pg2_ne30pg2.F2010-SCREAMv1.se09-aug20_take1.5d.n0001t4x111XX1.nocosp
#SBATCH  --nodes=1
#SBATCH  --output=st_archive.t.ne30pg2_ne30pg2.F2010-SCREAMv1.se09-aug20_take1.5d.n0001t4x111XX1.nocosp.%j
#SBATCH  --constraint=gpu
#SBATCH  --gpus-per-node=4
#SBATCH  --gpu-bind=none

2). I tried invoking it through a "clean" interactive session but that threw another error, likely unrelated due to that environment, but FYI it's here " ERROR: (cime_cpl_init) :: namelist read returns an end of file or end of record condition".

3). I cleaned up my environment variables to only system defaults plus these two user-defined ones:
export EAMXX_SRC_DIR=/global/homes/m/mihelog/ThrustE/mihelog/E3SM/
export SCREAM_BASELINES_DIR=/global/homes/m/mihelog/ThrustE/mihelog/scream_input
Also, i load these two as part of sun but I've tried without them an no difference:

export KOKKOS_TOOLS_LIBS=/global/homes/m/mihelog/ThrustE/mihelog/kokkos-tools/build/profiling/nvtx-connector/libkp_nvtx_connector.so
export KOKKOS_PROFILE_LIBRARY=/global/homes/m/mihelog/ThrustE/mihelog/kokkos-tools/build/profiling/nvtx-connector/libkp_nvtx_connector.so

4). I confirmed that the second file you pointed to above runs so the environment variables specified in it are loaded. Likewise, all the modules in that script are loaded. The following are in addition, that are system defaults:

 1) craype-x86-milan                  5) gpu/1.0           9) pytorch/2.6.0                 
  2) libfabric/1.20.1                  6) sqs/2.0          10) cray-python/3.11.7       
  3) craype-network-ofi                7) cudnn/9.5.0      11) cray-dsmml/0.3.0 
  4) xpmem/2.9.7-1.1___ge3a6be7bbdab   8) nccl/2.24.3 (g)

2 replies

mahf708 Apr 21, 2025
Collaborator

Can you share the full script? I think the number 2 error may be pretty straightforward, you need to be in the run directory ($case_dir/run).

mahf708 Apr 21, 2025
Collaborator

Yeah, so you're not triggering srun directly. If you're doing it through sbatch, then you don't need the -G4 in the srun command (it will carry over). I think the most productive thing is to share your full script next. Otherwise, this could be a lot of different things :/

mihelog · 2025-04-21T17:54:01Z

mihelog
Apr 21, 2025
Author

Agreed. I'm invoking a custom script (not written by me unfortunately) that uses sbatch to invoke sbatch that includes the srun command. I'm puzzled here because it did used to work and broke at some point and remained broken even after reverting changes. What I did was add some kokkos profiling region names in the source code, linked against the kokkos library, and added nvidia profiling tools to srun. That worked well but even after removing them it didn't fix anything. So who knows.

This is the srun command that ends up being executed through sbatch as a reminder:
srun --label -n 4 -N 1 -c 32 --cpu_bind=cores -m plane=4 /pscratch/sd/m/mihelog/e3sm_scratch/pm-gpu/se09-aug20_take1/t.ne30pg2_ne30pg2.F2010-SCREAMv1.se09-aug20_take1.5d.n0001t4x111XX1.nocosp/build/e3sm.exe >> e3sm.log.$LID 2>&1

Anyway thanks in advance again. Here is the script.

#!/bin/bash -fe

main() {

do_fetch_code=false
do_create_newcase=true
do_case_setup=true
do_case_build=true
do_case_submit=true

readonly MACHINE="pm-gpu"
readonly CHECKOUT="se09-aug20_take1"
readonly BRANCH="master"
readonly CHERRY=( )
readonly COMPILER="gnugpu"
readonly DEBUG_COMPILE=FALSE
readonly Q=debug

readonly COMPSET="F2010-SCREAMv1"
readonly RESOLUTION="ne30pg2_ne30pg2"

readonly CODE_ROOT="/global/homes/m/mihelog/ThrustE/mihelog/E3SM"
readonly PROJECT="m888"

githash_eamxx=`git --git-dir ${CODE_ROOT}/.git rev-parse HEAD`

readonly CASE_NAME=t.${RESOLUTION}.${COMPSET}.${CHECKOUT}.5d.n0001t4x111XX1.nocosp

readonly CASE_ROOT="${SCRATCH}/e3sm_scratch/${MACHINE}/${CHECKOUT}/${CASE_NAME}"

readonly CASE_GROUP=""

readonly HIST_OPTION="never"
readonly HIST_N="1"

# Run options
readonly MODEL_START_TYPE="initial"  # "initial", "continue", "branch", "hybrid"
readonly START_DATE="2022-10-04"     # "" for default, or explicit "0001-01-01"

# Additional options for 'branch' and 'hybrid'
readonly GET_REFCASE=false
readonly RUN_REFDIR=""
readonly RUN_REFCASE=""
readonly RUN_REFDATE=""   # same as MODEL_START_DATE for 'branch', can be different for 'hybrid'


# Sub-directories
readonly CASE_BUILD_DIR=${CASE_ROOT}/build
readonly CASE_ARCHIVE_DIR=${CASE_ROOT}/archive

readonly CASE_SCRIPTS_DIR=${CASE_ROOT}/case_scripts
readonly CASE_RUN_DIR=${CASE_ROOT}/run

readonly PELAYOUT="4x1" # 1 nodes
readonly WALLTIME="0:20:00"
readonly STOP_OPTION="ndays"
readonly STOP_N="5" # XXX Used to be 5
readonly REST_OPTION="ndays"
readonly REST_N="99993"
readonly RESUBMIT="0"
readonly DO_SHORT_TERM_ARCHIVING=false

# Leave empty (unless you understand what it does)
readonly OLD_EXECUTABLE=""

# Make directories created by this script world-readable
umask 022

if [ -d "${CASE_ROOT}" ]; then
    printf '%s\n' "Removing ${CASE_ROOT}"
    rm -rf "${CASE_ROOT}"
fi

# Fetch code from Github
fetch_code

# Create case
create_newcase

# Setup
case_setup

# Build
case_build

# Configure runtime options
runtime_options

# Copy script into case_script directory for provenance
copy_script

# Submit
case_submit

# All done
echo $'\n----- All done -----\n'

}

# =======================
# Custom user_nl settings
# =======================

user_nl() {

    echo "+++ Configuring SCREAM for 128 vertical levels +++"
    ./xmlchange SCREAM_CMAKE_OPTIONS="SCREAM_NP 4 SCREAM_NUM_VERTICAL_LEV 128 SCREAM_NUM_TRACERS 10"
}

######################################################
### Most users won't need to change anything below ###
######################################################

#-----------------------------------------------------
fetch_code() {

    if [ "${do_fetch_code,,}" != "true" ]; then
        echo $'\n----- Skipping fetch_code -----\n'
        return
    fi

    echo $'\n----- Starting fetch_code -----\n'
    local path=${CODE_ROOT}
    local repo=scream

    echo "Cloning $repo repository branch $BRANCH under $path"
    if [ -d "${path}" ]; then
        echo "ERROR: Directory already exists. Not overwriting"
        exit 20
    fi
    mkdir -p ${path}
    pushd ${path}

    # This will put repository, with all code
    git clone git@github.com:E3SM-Project/${repo}.git .

    # Q: DO WE NEED THIS FOR EAMXX?
    # Setup git hooks
    rm -rf .git/hooks
    git clone git@github.com:E3SM-Project/E3SM-Hooks.git .git/hooks
    git config commit.template .git/hooks/commit.template

    # Check out desired branch
    git checkout ${BRANCH}

    # Custom addition
    if [ "${CHERRY}" != "" ]; then
        echo ----- WARNING: adding git cherry-pick -----
        for commit in "${CHERRY[@]}"
        do
            echo ${commit}
            git cherry-pick ${commit}
        done
        echo -------------------------------------------
    fi

    # Bring in all submodule components
    git submodule update --init --recursive

    popd
}

#-----------------------------------------------------
create_newcase() {

    if [ "${do_create_newcase,,}" != "true" ]; then
        echo $'\n----- Skipping create_newcase -----\n'
        return
    fi

    echo $'\n----- Starting create_newcase -----\n'

    # Base arguments
    args=" --case ${CASE_NAME} \
        --output-root ${CASE_ROOT} \
        --script-root ${CASE_SCRIPTS_DIR} \
        --handle-preexisting-dirs u \
        --compset ${COMPSET} \
        --res ${RESOLUTION} \
        --machine ${MACHINE} \
        --compiler ${COMPILER} \
        --walltime ${WALLTIME} \
        --pecount ${PELAYOUT}"

    # Oprional arguments
    if [ ! -z "${PROJECT}" ]; then
      args="${args} --project ${PROJECT}"
    fi
    if [ ! -z "${CASE_GROUP}" ]; then
      args="${args} --case-group ${CASE_GROUP}"
    fi
    if [ ! -z "${QUEUE}" ]; then
      args="${args} --queue ${QUEUE}"
    fi

    ${CODE_ROOT}/cime/scripts/create_newcase ${args}

    if [ $? != 0 ]; then
      echo $'\nNote: if create_newcase failed because sub-directory already exists:'
      echo $'  * delete old case_script sub-directory'
      echo $'  * or set do_newcase=false\n'
      exit 35
    fi

}

#-----------------------------------------------------
case_setup() {

    if [ "${do_case_setup,,}" != "true" ]; then
        echo $'\n----- Skipping case_setup -----\n'
        return
    fi

    echo $'\n----- Starting case_setup -----\n'
    pushd ${CASE_SCRIPTS_DIR}

    # Setup some CIME directories
    ./xmlchange EXEROOT=${CASE_BUILD_DIR}
    ./xmlchange RUNDIR=${CASE_RUN_DIR}

    # Short term archiving
    ./xmlchange DOUT_S=${DO_SHORT_TERM_ARCHIVING}
    ./xmlchange DOUT_S_ROOT=${CASE_ARCHIVE_DIR}

    # Extracts input_data_dir in case it is needed for user edits to the namelist later
    local input_data_dir=`./xmlquery DIN_LOC_ROOT --value`

    # Custom user_nl
    user_nl

    ./xmlchange --file env_mach_pes.xml MAX_MPITASKS_PER_NODE=4

    ./xmlchange --file env_mach_pes.xml NTHRDS="1"
    ./xmlchange --file env_mach_pes.xml NTHRDS_ATM="1"
    ./xmlchange --file env_mach_pes.xml NTHRDS_LND="16"
    ./xmlchange --file env_mach_pes.xml NTHRDS_ICE="16"
    ./xmlchange --file env_mach_pes.xml NTHRDS_OCN="1"
    ./xmlchange --file env_mach_pes.xml NTHRDS_ROF="1"
    ./xmlchange --file env_mach_pes.xml NTHRDS_CPL="1"
    ./xmlchange --file env_mach_pes.xml NTHRDS_GLC="1"
    ./xmlchange --file env_mach_pes.xml NTHRDS_WAV="1"

    ./xmlchange PIO_NETCDF_FORMAT="64bit_data"

    # Finally, run CIME case.setup
    ./case.setup --reset

    echo "branch hash for EAMxx: $githash_eamxx" > GIT_INFO.txt

    popd
}

#-----------------------------------------------------
case_build() {

    pushd ${CASE_SCRIPTS_DIR}

    # do_case_build = false
    if [ "${do_case_build,,}" != "true" ]; then

        echo $'\n----- case_build -----\n'

        if [ "${OLD_EXECUTABLE}" == "" ]; then
            # Ues previously built executable, make sure it exists
            if [ -x ${CASE_BUILD_DIR}/e3sm.exe ]; then
                echo 'Skipping build because $do_case_build = '${do_case_build}
            else
                echo 'ERROR: $do_case_build = '${do_case_build}' but no executable exists for this case.'
                exit 297
            fi
        else
            # If absolute pathname exists and is executable, reuse pre-exiting executable
            if [ -x ${OLD_EXECUTABLE} ]; then
                echo 'Using $OLD_EXECUTABLE = '${OLD_EXECUTABLE}
                cp -fp ${OLD_EXECUTABLE} ${CASE_BUILD_DIR}/
            else
                echo 'ERROR: $OLD_EXECUTABLE = '$OLD_EXECUTABLE' does not exist or is not an executable file.'
                exit 297
            fi
        fi
        echo 'WARNING: Setting BUILD_COMPLETE = TRUE.  This is a little risky, but trusting the user.'
        ./xmlchange BUILD_COMPLETE=TRUE

    # do_case_build = true
    else

        echo $'\n----- Starting case_build -----\n'

        # Turn on debug compilation option if requested
        if [ "${DEBUG_COMPILE}" == "TRUE" ]; then
            ./xmlchange DEBUG=${DEBUG_COMPILE}
        fi

        # add yaml file that says what variables to output
        ./atmchange output_yaml_files+=/global/homes/m/mihelog/ThrustE/mihelog/testruns/scream_3hr_average_output.yaml

        # Run CIME case.build
        ./case.build

        # Some user_nl settings won't be updated to *_in files under the run directory
        # Call preview_namelists to make sure *_in and user_nl files are consistent.
        ./preview_namelists


    fi

    popd
}

#-----------------------------------------------------
runtime_options() {

    echo $'\n----- Starting runtime_options -----\n'
    pushd ${CASE_SCRIPTS_DIR}

    # Set simulation start date
    if [ ! -z "${START_DATE}" ]; then
        ./xmlchange RUN_STARTDATE=${START_DATE}
    fi

    # Set temperature cut off in dycore threshold to 180K
    ./atmchange vtheta_thresh=180
    ./atmquery vtheta_thresh
    ./atmchange lambda_high=0.08

    ./atmchange initial_conditions::Filename="/global/cfs/cdirs/e3sm/inputdata/atm/scream/init/screami_ne30np4L128_20221004.nc"

    ./atmchange physics::mac_aero_mic::shoc::compute_tendencies=T_mid,qv
    ./atmchange physics::mac_aero_mic::p3::compute_tendencies=T_mid,qv
    ./atmchange physics::rrtmgp::compute_tendencies=T_mid
    ./atmchange homme::compute_tendencies=T_mid,qv
    # use GHG levels more appropriate for 2019
    ./atmchange co2vmr=410.5e-6
    ./atmchange ch4vmr=1877.0e-9
    ./atmchange n2ovmr=332.0e-9
    ./atmchange orbital_year=2019
    # use CO2 the same in land model
    ./xmlchange CCSM_CO2_PPMV=410.5


    #specify land IC file
cat << EOF >> user_nl_elm
 !hist_dov2xy = .true.,.true.
 !hist_fincl2 = 'H2OSNO','SOILWATER_10CM','TG'
 !hist_mfilt = 1,120
 !hist_nhtfrq = 0,-24
 !hist_avgflag_pertape = 'A','A'
 hist_empty_htapes=.true.
EOF

    # Segment length
    ./xmlchange STOP_OPTION=${STOP_OPTION,,},STOP_N=${STOP_N}

    # Restart frequency
    ./xmlchange REST_OPTION=${REST_OPTION,,},REST_N=${REST_N}

    # Coupler history
    ./xmlchange HIST_OPTION=${HIST_OPTION,,},HIST_N=${HIST_N}

    # Coupler budgets (always on)
    ./xmlchange BUDGETS=TRUE

    # Set resubmissions
    if (( RESUBMIT > 0 )); then
        ./xmlchange RESUBMIT=${RESUBMIT}
    fi

    # Run type
    # Start from default of user-specified initial conditions
    if [ "${MODEL_START_TYPE,,}" == "initial" ]; then
        ./xmlchange RUN_TYPE="startup"
        ./xmlchange CONTINUE_RUN="FALSE"

    # Continue existing run
    elif [ "${MODEL_START_TYPE,,}" == "continue" ]; then
        ./xmlchange CONTINUE_RUN="TRUE"

    elif [ "${MODEL_START_TYPE,,}" == "branch" ] || [ "${MODEL_START_TYPE,,}" == "hybrid" ]; then
        ./xmlchange RUN_TYPE=${MODEL_START_TYPE,,}
        ./xmlchange GET_REFCASE=${GET_REFCASE}
        ./xmlchange RUN_REFDIR=${RUN_REFDIR}
        ./xmlchange RUN_REFCASE=${RUN_REFCASE}
        ./xmlchange RUN_REFDATE=${RUN_REFDATE}
        echo 'Warning: $MODEL_START_TYPE = '${MODEL_START_TYPE}
        echo '$RUN_REFDIR = '${RUN_REFDIR}
        echo '$RUN_REFCASE = '${RUN_REFCASE}
        echo '$RUN_REFDATE = '${START_DATE}

    else
        echo 'ERROR: $MODEL_START_TYPE = '${MODEL_START_TYPE}' is unrecognized. Exiting.'
        exit 380
    fi

    # cp "/global/cfs/cdirs/m888/ThrustE/mihelog/testruns/scream_3hr_average_output.yaml" .
    # XXX
    #./atmchange output_yaml_files+=/global/cfs/cdirs/m888/ThrustE/mihelog/testruns/scream_3hr_average_output.yaml

    ./xmlchange --file env_run.xml --id SSTICE_DATA_FILENAME --val "/global/cfs/cdirs/e3sm/inputdata/atm/cam/sst/sst_ostia_ukmo-l4_ghrsst_3600x7200_20190731_20200901_c20230913.nc"
    ./xmlchange --file env_run.xml --id  SSTICE_GRID_FILENAME --val "/global/cfs/cdirs/e3sm/inputdata/ocn/docn7/domain.ocn.3600x7200.230522.nc"
    ./xmlchange --file env_run.xml --id SSTICE_YEAR_ALIGN --val 2019
    ./xmlchange --file env_run.xml --id SSTICE_YEAR_START --val 2019
    ./xmlchange --file env_run.xml --id SSTICE_YEAR_END --val 2020

    popd
}

#-----------------------------------------------------
case_submit() {

    if [ "${do_case_submit,,}" != "true" ]; then
        echo $'\n----- Skipping case_submit -----\n'
        return
    fi

    echo $'\n----- Starting case_submit -----\n'
    pushd ${CASE_SCRIPTS_DIR}

    # Run CIME case.submit
    ./case.submit -a="--qos=${Q} -t ${WALLTIME} --mail-type=begin,end,fail --mail-user=mihelog@lbl.gov" >& submitout.txt

    popd
}

#-----------------------------------------------------
copy_script() {

    echo $'\n----- Saving run script for provenance -----\n'

    local script_provenance_dir=${CASE_SCRIPTS_DIR}/run_script_provenance
    mkdir -p ${script_provenance_dir}
    local this_script_name=`basename $0`
    local script_provenance_name=${this_script_name}.`date +%Y%m%d-%H%M%S`
    cp -vp ${this_script_name} ${script_provenance_dir}/${script_provenance_name}
    echo "Script names "
    echo "${this_script_name} ${script_provenance_dir}/${script_provenance_name}"
    echo "\n"

}

#-----------------------------------------------------
# Silent versions of popd and pushd
pushd() {
    command pushd "$@" > /dev/null
}
popd() {
    command popd "$@" > /dev/null
}

# Now, actually run the script
#-----------------------------------------------------
main

9 replies

mihelog Apr 21, 2025
Author

@ndkeen thank you so much for that script :) It's a lifesaver!

Sure let me try that. That'll overwrite some of my source code changes but it's ok I can restore them.

I can confirm that the error happens just fine without profiling. How I added profiling is by editing cime/CIME/case/case_run.py to do some string editing into the srun command to add the invocation to NVIDIA's tools. But after I remove those the problem persists. So I can't find correlation between the error and profiling.

Edit: I saw you said with a new case name. I did give it a new case name and I'm not sure about the E3SM directory:

Cloning scream repository branch master under /global/homes/m/mihelog/ThrustE/mihelog/E3SM
ERROR: Directory already exists. Not overwriting
mihelog@perlmutter:login36:~/ThrustE/mihelog/testruns> mv ../E3SM ../E3SM_2
mihelog@perlmutter:login36:~/ThrustE/mihelog/testruns> ./basic-run.ne30.F2010-SCREAMv1.pm-gpu.sh
fatal: not a git repository: '/global/homes/m/mihelog/ThrustE/mihelog/E3SM/.git'

mahf708 Apr 21, 2025
Collaborator

Let's try a new code root too:

readonly CODE_ROOT="/global/homes/m/mihelog/ThrustE/mihelog/FRESH_E3SM"

mihelog Apr 21, 2025
Author

Still doesn't like something.

mihelog@perlmutter:login36:~/ThrustE/mihelog/testruns> ./basic-run.ne30.F2010-SCREAMv1.pm-gpu.sh
fatal: not a git repository: '/global/homes/m/mihelog/ThrustE/mihelog/E3SM_FRESH/.git'

mahf708 Apr 21, 2025
Collaborator

oh dang ... I didn't realize you were trying to run the scream repo!!! That may explain it (that repo is no longer maintained)

- local repo=scream
+ local repo=E3SM

The code and all dev is now in E3SM proper

mahf708 Apr 21, 2025
Collaborator

If that doesn't work, please clone this repo (E3SM-Project/E3SM) then do git submodule update --init --recursive and point to it

mihelog · 2025-04-22T11:22:25Z

mihelog
Apr 22, 2025
Author

I see what you mean. But it's the new repo I was using. When my collaborator gave me the script, the fetch function was already inactive (do_fetch_code=false) and I pulled the new repo manually. So that should not be a problem. But it's an indication that maybe the script I'm using is old.

Apologies if I'm missing something basic btw.

When I fetch and update submodules, I get the error at the end of this post when running the script. I tried with several python versions (all 3.x though). Maybe my script is old too? When I restore my script edits in the "cime" directory, nothing changes which makes sense because I commented out all my changes. By doing a diff between the version of the "eamxx" directory I was using and this new one, there are many changes. So it's possible I was using an older version.

Could it be an input file problem too? I'm including below the input .yaml file just in case

YAML FILE

%YAML 1.1
---
Averaging Type: Average
Fields:
  Physics PG2:
    Field Names:
    # 2D
    - precip_liq_surf_mass_flux # rainfall rate
    - precip_ice_surf_mass_flux # snow/ice rate
    - RainWaterPath
    - LiqWaterPath
    - IceWaterPath
    - VapWaterPath
    - LW_flux_up_at_model_top # OLR
    - SeaLevelPressure # SLP
    - ps # surface pressure
    - T_2m # 2m air temperature
    - ZonalVapFlux
    - MeridionalVapFlux
    - wind_speed_10m # 10m wind speed
    - U_at_10m_above_surface # zonal wind at 10m
    - V_at_10m_above_surface # meridional wind at 10m
    - landfrac # land sea mask

Max Snapshots Per File: 8  # one file per day
Restart:
  force_new_file: true
filename_prefix: scream_output.3hourlyAVG.h
iotype: pnetcdf
output_control:
  Frequency: 3 # output frequency
  frequency_units: nhours # output frequency units (e.g. nmins, nhours)
  MPI Ranks in Filename: false
flush_frequency: 1 # this is to ensure files that are written are flushed immediately -- otherwise might be lost

THE ERROR

Removing /pscratch/sd/m/mihelog/e3sm_scratch/pm-gpu/se09-aug20_take1/t.ne30pg2_ne30pg2.F2010-SCREAMv1.se09-aug20_take1.5d.n0001t4x111XX1.nocosp

----- Skipping fetch_code -----


----- Starting create_newcase -----

Compset longname is 2010_SCREAM_ELM%SPBC_CICE%PRES_DOCN%DOM_SROF_SGLC_SWAV_SIAC_SESP
Compset specification file is /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/components/eamxx//cime_config/config_compsets.xml
Compset forcing is
ATM component is scream default
LND component is ELM with :Satellite phenology :Satellite phenology with black carbon deposition :
ICE component is prescribed cice:
OCN component is DOCN   prescribed ocean mode
ROF component is Stub river component
GLC component is Stub glacier (land ice) component
WAV component is Stub wave component
IAC component is Stub iac component
ESP component is Stub external system processing (ESP) component
Pes     specification file is /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/cime_config/allactive/config_pesall.xml
WARNING: User-selected machine 'pm-gpu' does not match probed machine 'pm-cpu'
setting additional fields from config_pes: {}, append {}
 Compset is: 2010_SCREAM_ELM%SPBC_CICE%PRES_DOCN%DOM_SROF_SGLC_SWAV_SIAC_SESP
 Grid is: a%ne30np4.pg2_l%ne30np4.pg2_oi%ne30np4.pg2_r%null_g%null_w%null_z%null_m%gx1v6
 Components in compset are: ['scream', 'elm', 'cice', 'docn', 'srof', 'sglc', 'swav', 'siac', 'sesp']
No charge_account info available, using value from PROJECT
e3sm model version found: 0930513157
Batch_system_type is nersc_slurm
job is case.run USER_REQUESTED_WALLTIME 0:20:00 USER_REQUESTED_QUEUE None WALLTIME_FORMAT %H:%M:%S
job is case.post_run_io USER_REQUESTED_WALLTIME None USER_REQUESTED_QUEUE None WALLTIME_FORMAT %H:%M:%S
job is case.st_archive USER_REQUESTED_WALLTIME None USER_REQUESTED_QUEUE None WALLTIME_FORMAT %H:%M:%S
Deprecated "arg" node detected in /pscratch/sd/m/mihelog/e3sm_scratch/pm-gpu/se09-aug20_take1/t.ne30pg2_ne30pg2.F2010-SCREAMv1.se09-aug20_take1.5d.n0001t4x111XX1.nocosp/case_scripts/env_batch.xml, check files /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/cime_config/machines/config_batch.xml
Deprecated "arg" node detected in /pscratch/sd/m/mihelog/e3sm_scratch/pm-gpu/se09-aug20_take1/t.ne30pg2_ne30pg2.F2010-SCREAMv1.se09-aug20_take1.5d.n0001t4x111XX1.nocosp/case_scripts/env_batch.xml, check files /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/cime_config/machines/config_batch.xml
Deprecated "arg" node detected in /pscratch/sd/m/mihelog/e3sm_scratch/pm-gpu/se09-aug20_take1/t.ne30pg2_ne30pg2.F2010-SCREAMv1.se09-aug20_take1.5d.n0001t4x111XX1.nocosp/case_scripts/env_batch.xml, check files /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/cime_config/machines/config_batch.xml
 Creating Case directory /pscratch/sd/m/mihelog/e3sm_scratch/pm-gpu/se09-aug20_take1/t.ne30pg2_ne30pg2.F2010-SCREAMv1.se09-aug20_take1.5d.n0001t4x111XX1.nocosp/case_scripts

----- Starting case_setup -----

+++ Configuring SCREAM for 128 vertical levels +++
setting case file to env_mach_pes.xml
setting case file to env_mach_pes.xml
setting case file to env_mach_pes.xml
setting case file to env_mach_pes.xml
setting case file to env_mach_pes.xml
setting case file to env_mach_pes.xml
setting case file to env_mach_pes.xml
setting case file to env_mach_pes.xml
setting case file to env_mach_pes.xml
setting case file to env_mach_pes.xml
Successfully cleaned env_mach_specific.xml
Setting resource.RLIMIT_STACK to -1 from (8388608, -1)
job is case.run USER_REQUESTED_WALLTIME 0:20:00 USER_REQUESTED_QUEUE None WALLTIME_FORMAT %H:%M:%S
Creating batch scripts
Writing case.run script from input template /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/cime_config/machines/template.case.run
Creating file .case.run
Writing case.post_run_io script from input template /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/cime_config/machines/template.post_run_io
Creating file .case.post_run_io
Writing case.st_archive script from input template /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/cime_config/machines/template.st_archive
Creating file case.st_archive
Creating file .case.run.sh
Creating user_nl_xxx files for components and cpl
If an old case build already exists, might want to run 'case.build --clean' before building
Generating component namelists as part of setup
  2025-04-22 03:38:32 atm
Create namelist for component scream
   Calling /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/components/eamxx//cime_config/buildnml
Regenerating /pscratch/sd/m/mihelog/e3sm_scratch/pm-gpu/se09-aug20_take1/t.ne30pg2_ne30pg2.F2010-SCREAMv1.se09-aug20_take1.5d.n0001t4x111XX1.nocosp/case_scripts/namelist_scream.xml. Manual edits will be lost.
  2025-04-22 03:38:33 lnd
Create namelist for component elm
   Calling /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/components/elm/cime_config/buildnml
  2025-04-22 03:38:33 ice
Create namelist for component cice
   Calling /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/components/cice/cime_config/buildnml
  2025-04-22 03:38:33 ocn
Create namelist for component docn
   Calling /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/components/data_comps/docn/cime_config/buildnml
  2025-04-22 03:38:33 rof
Create namelist for component srof
   Calling /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/components/stub_comps/srof/cime_config/buildnml
  2025-04-22 03:38:33 glc
Create namelist for component sglc
   Calling /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/components/stub_comps/sglc/cime_config/buildnml
  2025-04-22 03:38:33 wav
Create namelist for component swav
   Calling /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/components/stub_comps/swav/cime_config/buildnml
  2025-04-22 03:38:33 iac
Create namelist for component siac
   Calling /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/components/stub_comps/siac/cime_config/buildnml
  2025-04-22 03:38:33 esp
Create namelist for component sesp
   Calling /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/components/stub_comps/sesp/cime_config/buildnml
  2025-04-22 03:38:33 cpl
Create namelist for component drv
   Calling /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/driver-mct/cime_config/buildnml
You can now run './preview_run' to get more info on how your case will be run

----- Starting case_build -----

Regenerating /pscratch/sd/m/mihelog/e3sm_scratch/pm-gpu/se09-aug20_take1/t.ne30pg2_ne30pg2.F2010-SCREAMv1.se09-aug20_take1.5d.n0001t4x111XX1.nocosp/case_scripts/namelist_scream.xml. Manual edits will be lost.
Building case in directory /pscratch/sd/m/mihelog/e3sm_scratch/pm-gpu/se09-aug20_take1/t.ne30pg2_ne30pg2.F2010-SCREAMv1.se09-aug20_take1.5d.n0001t4x111XX1.nocosp/case_scripts
sharedlib_only is False
model_only is False
Setting resource.RLIMIT_STACK to -1 from (8388608, -1)
Generating component namelists as part of build
  2025-04-22 03:34:35 atm
Create namelist for component scream
   Calling /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/components/eamxx//cime_config/buildnml
Regenerating /pscratch/sd/m/mihelog/e3sm_scratch/pm-gpu/se09-aug20_take1/t.ne30pg2_ne30pg2.F2010-SCREAMv1.se09-aug20_take1.5d.n0001t4x111XX1.nocosp/case_scripts/namelist_scream.xml. Manual edits will be lost.
Traceback (most recent call last):
  File "/pscratch/sd/m/mihelog/e3sm_scratch/pm-gpu/se09-aug20_take1/t.ne30pg2_ne30pg2.F2010-SCREAMv1.se09-aug20_take1.5d.n0001t4x111XX1.nocosp/case_scripts/./case.build", line 267, in <module>
    _main_func(__doc__)
  File "/pscratch/sd/m/mihelog/e3sm_scratch/pm-gpu/se09-aug20_take1/t.ne30pg2_ne30pg2.F2010-SCREAMv1.se09-aug20_take1.5d.n0001t4x111XX1.nocosp/case_scripts/./case.build", line 251, in _main_func
    success = build.case_build(
              ^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/cime/CIME/build.py", line 1313, in case_build
    return run_and_log_case_status(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/cime/CIME/status.py", line 96, in run_and_log_case_status
    rv = func()
         ^^^^^^
  File "/global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/cime/CIME/build.py", line 1297, in <lambda>
    functor = lambda: _case_build_impl(
                      ^^^^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/cime/CIME/build.py", line 1170, in _case_build_impl
    sharedpath = _build_checks(
                 ^^^^^^^^^^^^^^
  File "/global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/cime/CIME/build.py", line 714, in _build_checks
    case.create_namelists()
  File "/global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/cime/CIME/case/preview_namelists.py", line 90, in create_namelists
    import_and_run_sub_or_cmd(
  File "/global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/cime/CIME/utils.py", line 669, in import_and_run_sub_or_cmd
    raise e1 from None
  File "/global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/cime/CIME/utils.py", line 665, in import_and_run_sub_or_cmd
    run_sub_or_cmd(
  File "/global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/cime/CIME/utils.py", line 710, in run_sub_or_cmd
    getattr(mod, subname)(*subargs)
  File "/global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/components/eamxx//cime_config/buildnml", line 70, in buildnml
    do_cime_vars_on_yaml_output_files(case,caseroot)
  File "/global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/components/eamxx//cime_config/eamxx_buildnml.py", line 1068, in do_cime_vars_on_yaml_output_files
    freq  = content['output_control']['frequency']
            ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
KeyError: 'frequency'

1 reply

mahf708 Apr 22, 2025
Collaborator

Try this:

output yaml

%YAML 1.1
---
averaging_type: average
fields:
  physics_pg2:
    field_names:
    # 2D
    - precip_liq_surf_mass_flux # rainfall rate
    - precip_ice_surf_mass_flux # snow/ice rate
    - RainWaterPath
    - LiqWaterPath
    - IceWaterPath
    - VapWaterPath
    - LW_flux_up_at_model_top # OLR
    - SeaLevelPressure # SLP
    - ps # surface pressure
    - T_2m # 2m air temperature
    - ZonalVapFlux
    - MeridionalVapFlux
    - wind_speed_10m # 10m wind speed
    - U_at_10m_above_surface # zonal wind at 10m
    - V_at_10m_above_surface # meridional wind at 10m
    - landfrac # land sea mask

max_snapshots_per_file: 8  # one file per day
restart:
  force_new_file: true
filename_prefix: scream_output.3hourlyAVG.h
iotype: pnetcdf
output_control:
  frequency: 3 # output frequency
  frequency_units: nhours # output frequency units (e.g. nmins, nhours)
flush_frequency: 1 # this is to ensure files that are written are flushed immediately -- otherwise might be lost

Sorry this was changed recently (mainly to unify naming scheme, etc.)

mihelog · 2025-04-22T17:08:37Z

mihelog
Apr 22, 2025
Author

It doesn't want to cooperate

----- Starting runtime_options -----

Regenerating /pscratch/sd/m/mihelog/e3sm_scratch/pm-gpu/se09-aug20_take1/t.ne30pg2_ne30pg2.F2010-SCREAMv1.se09-aug20_take1.5d.n0001t4x111XX1.nocosp/case_scripts/namelist_scream.xml. Manual edits will be lost.
    ctl_nl::vtheta_thresh: 180
ERROR: initial_conditions::Filename did not match any items

10 replies

mihelog Apr 22, 2025
Author

I should had caught that too!

Well the good news is that it does submit now and runs. But the bad news is that it fails. Still, one step closer I suppose. It fails in a different way than before.
Apologies this is taking more time :(

0: PE 0: MPICH processor detected:
0: PE 0:   AMD Milan (25:1:1) (family:model:stepping)
0: MPI VERSION    : CRAY MPICH version 8.1.30.8 (ANL base 3.4a2)
0: MPI BUILD INFO : Sat Jun 01  4:44 2024 (git hash 69863f7) (CH4)
0: PE 0: MPICH environment settings =====================================
0: PE 0:   MPICH_ENV_DISPLAY                              = 1
0: PE 0:   MPICH_VERSION_DISPLAY                          = 1
0: PE 0:   MPICH_ABORT_ON_ERROR                           = 0
0: PE 0:   MPICH_CPUMASK_DISPLAY                          = 0
0: PE 0:   MPICH_STATS_DISPLAY                            = 0
0: PE 0:   MPICH_RANK_REORDER_METHOD                      = 1
0: PE 0:   MPICH_RANK_REORDER_DISPLAY                     = 0
0: PE 0:   MPICH_MEMCPY_MEM_CHECK                         = 0
0: PE 0:   MPICH_USE_SYSTEM_MEMCPY                        = 0
0: PE 0:   MPICH_OPTIMIZED_MEMCPY                         = 1
0: PE 0:   MPICH_ALLOC_MEM_PG_SZ                          = 4096
0: PE 0:   MPICH_ALLOC_MEM_POLICY                         = PREFERRED
0: PE 0:   MPICH_ALLOC_MEM_AFFINITY                       = SYS_DEFAULT
0: PE 0:   MPICH_MALLOC_FALLBACK                          = 0
0: PE 0:   MPICH_MEM_DEBUG_FNAME                          =
0: PE 0:   MPICH_INTERNAL_MEM_AFFINITY                    = SYS_DEFAULT
0: PE 0:   MPICH_NO_BUFFER_ALIAS_CHECK                    = 0
0: PE 0:   MPICH_COLL_SYNC                                = MPI_Bcast
0: PE 0:   MPICH_SINGLE_HOST_ENABLED                        = 1
0: PE 0:   MPICH_USE_PERSISTENT_TOPS                      = 0
0: PE 0:   MPICH_DISABLE_PERSISTENT_RECV_TOPS             = 0
0: PE 0:   MPICH_MAX_TOPS_COUNTERS                        = 0
0: PE 0:   MPICH_ENABLE_ACTIVE_WAIT                       = 0
0: PE 0: MPICH/RMA environment settings =================================
0: PE 0:   MPICH_RMA_MAX_PENDING                          = 128
0: PE 0:   MPICH_RMA_SHM_ACCUMULATE                       = 0
0: PE 0: MPICH/GPU environment settings =================================
0: PE 0:   MPICH_GPU_SUPPORT_ENABLED                      = 1
0: PE 0:   MPICH_GPU_MAX_NUM_STREAMS                      = 1
0: PE 0:   MPICH_GPU_IPC_ENABLED                          = 1
0: PE 0:   MPICH_GPU_IPC_CACHE_MAX_SIZE                   = 50
0: PE 0:   MPICH_GPU_EAGER_REGISTER_HOST_MEM              = 1
0: PE 0:   MPICH_GPU_IPC_THRESHOLD                        = 1024
0: PE 0:   MPICH_GPU_NO_ASYNC_COPY                        = 0
0: PE 0:   MPICH_GPU_COLL_STAGING_AREA_OPT                = 1
0: PE 0:   MPICH_GPU_EAGER_DEVICE_MEM                     = 0
0: PE 0:   MPICH_GPU_USE_STREAM_TRIGGERED                 = 0
0: PE 0:   MPICH_ENABLE_YAKSA                             = 0
0: PE 0:   MPICH_GPU_USE_KERNEL_TRIGGERED                 = 0
0: PE 0: MPICH/Dynamic Process Management environment settings ==========
0: PE 0:   MPICH_DPM_DIR                                  =
0: PE 0:   MPICH_LOCAL_SPAWN_SERVER                       = 0
0: PE 0:   MPICH_SPAWN_USE_RANKPOOL                       = 0
0: PE 0: MPICH/SMP environment settings =================================
0: PE 0:   MPICH_SMP_SINGLE_COPY_MODE                     = CMA
0: PE 0:   MPICH_SMP_SINGLE_COPY_SIZE                     = 8192
0: PE 0:   MPICH_SHM_PROGRESS_MAX_BATCH_SIZE              = 8
0: PE 0: MPICH/COLLECTIVE environment settings ==========================
0: PE 0:   MPICH_COLL_OPT_OFF                             = 0
0: PE 0:   MPICH_BCAST_ONLY_TREE                          = 1
0: PE 0:   MPICH_BCAST_INTERNODE_RADIX                    = 4
0: PE 0:   MPICH_BCAST_INTRANODE_RADIX                    = 4
0: PE 0:   MPICH_ALLTOALL_SHORT_MSG                       = 64-512
0: PE 0:   MPICH_ALLTOALL_SYNC_FREQ                       = 1-24
0: PE 0:   MPICH_ALLTOALLV_THROTTLE                       = 8
0: PE 0:   MPICH_ALLGATHER_VSHORT_MSG                     = 1024-4096
0: PE 0:   MPICH_ALLGATHERV_VSHORT_MSG                    = 1024-4096
0: PE 0:   MPICH_GATHERV_SHORT_MSG                        = 131072
0: PE 0:   MPICH_GATHERV_MIN_COMM_SIZE                    = 64
0: PE 0:   MPICH_GATHERV_MAX_TMP_SIZE                     = 536870912
0: PE 0:   MPICH_GATHERV_SYNC_FREQ                        = 16
0: PE 0:   MPICH_IGATHERV_MIN_COMM_SIZE                   = 1000
0: PE 0:   MPICH_IGATHERV_SYNC_FREQ                       = 100
0: PE 0:   MPICH_IGATHERV_RAND_COMMSIZE                   = 2048
0: PE 0:   MPICH_IGATHERV_RAND_RECVLIST                   = 0
0: PE 0:   MPICH_SCATTERV_SHORT_MSG                       = 2048-8192
0: PE 0:   MPICH_SCATTERV_MIN_COMM_SIZE                   = 64
0: PE 0:   MPICH_SCATTERV_MAX_TMP_SIZE                    = 536870912
0: PE 0:   MPICH_SCATTERV_SYNC_FREQ                       = 16
0: PE 0:   MPICH_SCATTERV_SYNCHRONOUS                     = 0
0: PE 0:   MPICH_ALLREDUCE_MAX_SMP_SIZE                   = 262144
0: PE 0:   MPICH_ALLREDUCE_BLK_SIZE                       = 716800
0: PE 0:   MPICH_GPU_ALLGATHER_VSHORT_MSG_ALGORITHM       = 1
0: PE 0:   MPICH_GPU_ALLREDUCE_USE_KERNEL                 = 1
0: PE 0:   MPICH_GPU_COLL_STAGING_BUF_SIZE                = 1048576
0: PE 0:   MPICH_GPU_ALLREDUCE_STAGING_THRESHOLD          = 256
0: PE 0:   MPICH_GPU_ALLREDUCE_BLK_SIZE                   = 8388608
0: PE 0:   MPICH_GPU_ALLREDUCE_KERNEL_THRESHOLD           = 131072
0: PE 0:   MPICH_ALLREDUCE_NO_SMP                         = 0
0: PE 0:   MPICH_REDUCE_NO_SMP                            = 0
0: PE 0:   MPICH_REDUCE_SCATTER_COMMUTATIVE_LONG_MSG_SIZE = 524288
0: PE 0:   MPICH_REDUCE_SCATTER_MAX_COMMSIZE              = 1000
0: PE 0:   MPICH_SHARED_MEM_COLL_OPT                      = 1
0: PE 0:   MPICH_SHARED_MEM_COLL_NCELLS                   = 8
0: PE 0:   MPICH_SHARED_MEM_COLL_CELLSZ                   = 256
0: PE 0: MPICH MPIIO environment settings ===============================
0: PE 0:   MPICH_MPIIO_HINTS_DISPLAY                      = 0
0: PE 0:   MPICH_MPIIO_HINTS                              = NULL
0: PE 0:   MPICH_MPIIO_ABORT_ON_RW_ERROR                  = disable
0: PE 0:   MPICH_MPIIO_CB_ALIGN                           = 2
0: PE 0:   MPICH_MPIIO_DVS_MAXNODES                       = 1
0: PE 0:   MPICH_MPIIO_AGGREGATOR_PLACEMENT_DISPLAY       = 0
0: PE 0:   MPICH_MPIIO_AGGREGATOR_PLACEMENT_STRIDE        = -1
0: PE 0:   MPICH_MPIIO_MAX_NUM_IRECV                      = 50
0: PE 0:   MPICH_MPIIO_MAX_NUM_ISEND                      = 50
0: PE 0:   MPICH_MPIIO_MAX_SIZE_ISEND                     = 10485760
0: PE 0:   MPICH_MPIIO_OFI_STARTUP_CONNECT                = disable
0: PE 0:   MPICH_MPIIO_OFI_STARTUP_NODES_AGGREGATOR        = 2
0: PE 0: MPICH MPIIO statistics environment settings ====================
0: PE 0:   MPICH_MPIIO_STATS                              = 0
0: PE 0:   MPICH_MPIIO_TIMERS                             = 0
0: PE 0:   MPICH_MPIIO_WRITE_EXIT_BARRIER                 = 1
0: PE 0: MPICH Thread Safety settings ===================================
0: PE 0:   MPICH_ASYNC_PROGRESS                           = 0
0: PE 0:   MPICH_OPT_THREAD_SYNC                          = 1
0: PE 0:   rank 0 required = single, was provided = single
0:  User-specified PIO rearranger comm max pend req (comp2io),            0  (value will be reset as requested)
0:  Resetting PIO rearranger comm max pend req (comp2io) to           64
0:  PIO rearranger options:
0:    comm type     = p2p
0:    comm fcd      = 2denable
0:    max pend req (comp2io)  =           64
0:    enable_hs (comp2io)     =  T
0:    enable_isend (comp2io)  =  F
0:    max pend req (io2comp)  =           64
0:    enable_hs (io2comp)    =  F
0:    enable_isend (io2comp)  =  T
0: 4 pes participating in computation of coupled model
0: --------------------------------------------------------------
0: GLOBAL communicator : 1 nodes, 4 MPI tasks
0: COMMUNICATOR NODE # [NODE NAME] : (# OF MPI TASKS) TASK # LIST
0: GLOBAL NODE 0 [ nid001901 ] : ( 4 MPI TASKS ) 0 1 2 3
0: --------------------------------------------------------------
0: (seq_comm_setcomm)  init ID (  1 GLOBAL          ) pelist   =     0     3     1 ( npes =     4) ( nthreads =  1)( suffix =)
0: (seq_comm_setcomm)  init ID (  2 CPL             ) pelist   =     0     3     1 ( npes =     4) ( nthreads =  1)( suffix =)
0: (seq_comm_setcomm)  init ID (  5 ATM             ) pelist   =     0     3     1 ( npes =     4) ( nthreads =  1)( suffix =)
0: (seq_comm_joincomm) init ID (  6 CPLATM          ) join IDs =     2     5       ( npes =     4) ( nthreads =  1)
0: (seq_comm_jcommarr) init ID (  3 ALLATMID        ) join multiple comp IDs       ( npes =     4) ( nthreads =  1)
0: (seq_comm_joincomm) init ID (  4 CPLALLATMID     ) join IDs =     2     3       ( npes =     4) ( nthreads =  1)
0: (seq_comm_setcomm)  init ID (  9 LND             ) pelist   =     0     3     1 ( npes =     4) ( nthreads = 16)( suffix =)
0: (seq_comm_joincomm) init ID ( 10 CPLLND          ) join IDs =     2     9       ( npes =     4) ( nthreads = 16)
0: (seq_comm_jcommarr) init ID (  7 ALLLNDID        ) join multiple comp IDs       ( npes =     4) ( nthreads = 16)
0: (seq_comm_joincomm) init ID (  8 CPLALLLNDID     ) join IDs =     2     7       ( npes =     4) ( nthreads = 16)
0: (seq_comm_setcomm)  init ID ( 13 ICE             ) pelist   =     0     3     1 ( npes =     4) ( nthreads = 16)( suffix =)
0: (seq_comm_joincomm) init ID ( 14 CPLICE          ) join IDs =     2    13       ( npes =     4) ( nthreads = 16)
0: (seq_comm_jcommarr) init ID ( 11 ALLICEID        ) join multiple comp IDs       ( npes =     4) ( nthreads = 16)
0: (seq_comm_joincomm) init ID ( 12 CPLALLICEID     ) join IDs =     2    11       ( npes =     4) ( nthreads = 16)
0: (seq_comm_setcomm)  init ID ( 17 OCN             ) pelist   =     0     3     1 ( npes =     4) ( nthreads =  1)( suffix =)
0: (seq_comm_joincomm) init ID ( 18 CPLOCN          ) join IDs =     2    17       ( npes =     4) ( nthreads =  1)
0: (seq_comm_jcommarr) init ID ( 15 ALLOCNID        ) join multiple comp IDs       ( npes =     4) ( nthreads =  1)
0: (seq_comm_joincomm) init ID ( 16 CPLALLOCNID     ) join IDs =     2    15       ( npes =     4) ( nthreads =  1)
0: (seq_comm_setcomm)  init ID ( 21 ROF             ) pelist   =     0     3     1 ( npes =     4) ( nthreads =  1)( suffix =)
0: (seq_comm_joincomm) init ID ( 22 CPLROF          ) join IDs =     2    21       ( npes =     4) ( nthreads =  1)
0: (seq_comm_jcommarr) init ID ( 19 ALLROFID        ) join multiple comp IDs       ( npes =     4) ( nthreads =  1)
0: (seq_comm_joincomm) init ID ( 20 CPLALLROFID     ) join IDs =     2    19       ( npes =     4) ( nthreads =  1)
0: (seq_comm_setcomm)  init ID ( 25 GLC             ) pelist   =     0     3     1 ( npes =     4) ( nthreads =  1)( suffix =)
0: (seq_comm_joincomm) init ID ( 26 CPLGLC          ) join IDs =     2    25       ( npes =     4) ( nthreads =  1)
0: (seq_comm_jcommarr) init ID ( 23 ALLGLCID        ) join multiple comp IDs       ( npes =     4) ( nthreads =  1)
0: (seq_comm_joincomm) init ID ( 24 CPLALLGLCID     ) join IDs =     2    23       ( npes =     4) ( nthreads =  1)
0: (seq_comm_setcomm)  init ID ( 29 WAV             ) pelist   =     0     3     1 ( npes =     4) ( nthreads =  1)( suffix =)
0: (seq_comm_joincomm) init ID ( 30 CPLWAV          ) join IDs =     2    29       ( npes =     4) ( nthreads =  1)
0: (seq_comm_jcommarr) init ID ( 27 ALLWAVID        ) join multiple comp IDs       ( npes =     4) ( nthreads =  1)
0: (seq_comm_joincomm) init ID ( 28 CPLALLWAVID     ) join IDs =     2    27       ( npes =     4) ( nthreads =  1)
0: (seq_comm_setcomm)  init ID ( 33 ESP             ) pelist   =     0     3     1 ( npes =     4) ( nthreads =  1)( suffix =)
0: (seq_comm_joincomm) init ID ( 34 CPLESP          ) join IDs =     2    33       ( npes =     4) ( nthreads =  1)
0: (seq_comm_jcommarr) init ID ( 31 ALLESPID        ) join multiple comp IDs       ( npes =     4) ( nthreads =  1)
0: (seq_comm_joincomm) init ID ( 32 CPLALLESPID     ) join IDs =     2    31       ( npes =     4) ( nthreads =  1)
0: (seq_comm_setcomm)  init ID ( 37 IAC             ) pelist   =     0     3     1 ( npes =     4) ( nthreads =  1)( suffix =)
0: (seq_comm_joincomm) init ID ( 38 CPLIAC          ) join IDs =     2    37       ( npes =     4) ( nthreads =  1)
0: (seq_comm_jcommarr) init ID ( 35 ALLIACID        ) join multiple comp IDs       ( npes =     4) ( nthreads =  1)
0: (seq_comm_joincomm) init ID ( 36 CPLALLIACID     ) join IDs =     2    35       ( npes =     4) ( nthreads =  1)
0: (seq_comm_printcomms)     1     0     4     1  GLOBAL:
0: (seq_comm_printcomms)     2     0     4     1  CPL:
0: (seq_comm_printcomms)     3     0     4     1  ALLATMID:
0: (seq_comm_printcomms)     4     0     4     1  CPLALLATMID:
0: (seq_comm_printcomms)     5     0     4     1  ATM:
0: (seq_comm_printcomms)     6     0     4     1  CPLATM:
0: (seq_comm_printcomms)     7     0     4    16  ALLLNDID:
0: (seq_comm_printcomms)     8     0     4    16  CPLALLLNDID:
0: (seq_comm_printcomms)     9     0     4    16  LND:
0: (seq_comm_printcomms)    10     0     4    16  CPLLND:
0: (seq_comm_printcomms)    11     0     4    16  ALLICEID:
0: (seq_comm_printcomms)    12     0     4    16  CPLALLICEID:
0: (seq_comm_printcomms)    13     0     4    16  ICE:
0: (seq_comm_printcomms)    14     0     4    16  CPLICE:
0: (seq_comm_printcomms)    15     0     4     1  ALLOCNID:
0: (seq_comm_printcomms)    16     0     4     1  CPLALLOCNID:
0: (seq_comm_printcomms)    17     0     4     1  OCN:
0: (seq_comm_printcomms)    18     0     4     1  CPLOCN:
0: (seq_comm_printcomms)    19     0     4     1  ALLROFID:
0: (seq_comm_printcomms)    20     0     4     1  CPLALLROFID:
0: (seq_comm_printcomms)    21     0     4     1  ROF:
0: (seq_comm_printcomms)    22     0     4     1  CPLROF:
0: (seq_comm_printcomms)    23     0     4     1  ALLGLCID:
0: (seq_comm_printcomms)    24     0     4     1  CPLALLGLCID:
0: (seq_comm_printcomms)    25     0     4     1  GLC:
0: (seq_comm_printcomms)    26     0     4     1  CPLGLC:
0: (seq_comm_printcomms)    27     0     4     1  ALLWAVID:
0: (seq_comm_printcomms)    28     0     4     1  CPLALLWAVID:
0: (seq_comm_printcomms)    29     0     4     1  WAV:
0: (seq_comm_printcomms)    30     0     4     1  CPLWAV:
0: (seq_comm_printcomms)    31     0     4     1  ALLESPID:
0: (seq_comm_printcomms)    32     0     4     1  CPLALLESPID:
0: (seq_comm_printcomms)    33     0     4     1  ESP:
0: (seq_comm_printcomms)    34     0     4     1  CPLESP:
0: (seq_comm_printcomms)    35     0     4     1  ALLIACID:
0: (seq_comm_printcomms)    36     0     4     1  CPLALLIACID:
0: (seq_comm_printcomms)    37     0     4     1  IAC:
0: (seq_comm_printcomms)    38     0     4     1  CPLIAC:
0:  (t_initf) Read in prof_inparm namelist from: drv_in
0:  (t_initf) Using profile_disable=          F
0:  (t_initf)       profile_timer=                      4
0:  (t_initf)       profile_depth_limit=               20
0:  (t_initf)       profile_detail_limit=              12
0:  (t_initf)       profile_barrier=          F
0:  (t_initf)       profile_outpe_num=                  1
0:  (t_initf)       profile_outpe_stride=               0
0:  (t_initf)       profile_single_file=      F
0:  (t_initf)       profile_global_stats=     T
0:  (t_initf)       profile_ovhd_measurement= F
0:  (t_initf)       profile_add_detail=       F
0:  (t_initf)       profile_papi_enable=      F
0: Calling initialize_kokkos
0:  ExecSpace name: Cuda
0:  ExecSpace initialized: yes
0:  active avx set: -AVX2-AVX
0:  compiler id: GCC
0:  FPE support is enabled, current FPE mask: 0 (NONE)
0:  #host threads: 1
0:
0:
0: -------- EKAT CONFIGS --------
0:
0:  ExecSpace name: Cuda
0:  ExecSpace initialized: yes
0:  active avx set: -AVX2-AVX
0:  compiler id: GCC
0:  FPE support is enabled, current FPE mask: 0 (NONE)
0:  #host threads: 1
0:
0: -------- SCREAM CONFIGS --------
0:
0:  sizeof(Real) = 8
0:  default pack size = 1
0:  default FPE mask: 0 (NONE)
0: -------------------------------
0:
0:  number of MPI processes per node: min,max=           4           4
0:
0: Note: nsplit=-1, while nsplit must be >=1. We know SCREAM does not know nsplit until runtime, so this is fine.
0:       Make sure nsplit is set to a valid value before calling prim_advance_subcycle!
0: gfr> nelemd 1350 qsize 10
0: compose> nelemd 1350 qsize 10 hv_q 1 hv_subcycle_q 6 lim 9 independent_time_steps 1
0: Reading ice lookup tables in file: /global/cfs/cdirs/e3sm/inputdata/atm/scream/tables/p3_lookup_table_1.dat-v4.1.1
0: Reading lookup (non-ice) tables in dir /global/cfs/cdirs/e3sm/inputdata/atm/scream/tables
1:
1:  proc=            1  beg gridcell=         1861  end gridcell=         3715  total gridcells per proc=         1855
1:  proc=            1  beg topounit=         1861  end topounit=         3715  total topounits per proc=         1855
1:  proc=            1  beg landunit=         7722  end landunit=        15405  total landunits per proc=         7684
1:  proc=            1  beg column  =        28098  end column  =        56001  total columns per proc  =        27904
1:  proc=            1  beg pft     =        57858  end pft     =       115441  total pfts per proc     =        57584
1:  proc=            1  beg coh     =            1  end coh     =            0  total coh per proc     =            0
1:  proc=            1  lnd ngseg   =         1771  lnd nlseg   =          446
1:  proc=            1  gce ngseg   =         1280  gce nlseg   =          320
1:  proc=            1  top ngseg   =         1280  top nlseg   =          320
1:  proc=            1  lun ngseg   =        30801  lun nlseg   =         7675
1:  proc=            1  col ngseg   =        30801  col nlseg   =         7675
1:  proc=            1  pft ngseg   =        30801  pft nlseg   =         7675
1:  proc=            1  coh ngseg   =            0  coh nlseg   =            0
1:  proc=            1  nclumps =           16
1:  proc=            1  clump no =            1  clump id=            2  beg gridcell=         1861  end gridcell=         1975  total gridcells per clump=          115
1:  proc=            1  clump no =            1  clump id=            2  beg topounit=         1861  end topounit=         1975  total topounits per clump =          115
1:  proc=            1  clump no =            1  clump id=            2  beg landunit=         7722  end landunit=         8211  total landunits per clump =          490
1:  proc=            1  clump no =            1  clump id=            2  beg column  =        28098  end column  =        29907  total columns per clump  =         1810
1:  proc=            1  clump no =            1  clump id=            2  beg pft     =        57858  end pft     =        61507  total pfts per clump     =         3650
1:  proc=            1  clump no =            1  clump id=            2  beg cohort     =            1  end cohort     =            0  total cohorts per clump     =            0
1:
1:  proc=            1  beg gridcell=         1861  end gridcell=         3715  total gridcells per proc=         1855
1:  proc=            1  beg topounit=         1861  end topounit=         3715  total topounits per proc=         1855
1:  proc=            1  beg landunit=         7722  end landunit=        15405  total landunits per proc=         7684
1:  proc=            1  beg column  =        28098  end column  =        56001  total columns per proc  =        27904
1:  proc=            1  beg pft     =        57858  end pft     =       115441  total pfts per proc     =        57584
1:  proc=            1  beg coh     =            1  end coh     =            0  total coh per proc     =            0
1:  proc=            1  lnd ngseg   =         1771  lnd nlseg   =          446
1:  proc=            1  gce ngseg   =         1280  gce nlseg   =          320
1:  proc=            1  top ngseg   =         1280  top nlseg   =          320
1:  proc=            1  lun ngseg   =        30801  lun nlseg   =         7675
1:  proc=            1  col ngseg   =        30801  col nlseg   =         7675
1:  proc=            1  pft ngseg   =        30801  pft nlseg   =         7675
1:  proc=            1  coh ngseg   =            0  coh nlseg   =            0
1:  proc=            1  nclumps =           16
1:  proc=            1  clump no =            1  clump id=            2  beg gridcell=         1861  end gridcell=         1975  total gridcells per clump=          115
1:  proc=            1  clump no =            1  clump id=            2  beg topounit=         1861  end topounit=         1975  total topounits per clump =          115
1:  proc=            1  clump no =            1  clump id=            2  beg landunit=         7722  end landunit=         8211  total landunits per clump =          490
1:  proc=            1  clump no =            1  clump id=            2  beg column  =        28098  end column  =        29907  total columns per clump  =         1810
1:  proc=            1  clump no =            1  clump id=            2  beg pft     =        57858  end pft     =        61507  total pfts per clump     =         3650
1:  proc=            1  clump no =            1  clump id=            2  beg cohort     =            1  end cohort     =            0  total cohorts per clump     =            0
2:
2:  proc=            2  beg gridcell=         3716  end gridcell=         5570  total gridcells per proc=         1855
2:  proc=            2  beg topounit=         3716  end topounit=         5570  total topounits per proc=         1855
2:  proc=            2  beg landunit=        15406  end landunit=        23117  total landunits per proc=         7712
2:  proc=            2  beg column  =        56002  end column  =        84005  total columns per proc  =        28004
2:  proc=            2  beg pft     =       115442  end pft     =       173125  total pfts per proc     =        57684
2:  proc=            2  beg coh     =            1  end coh     =            0  total coh per proc     =            0
2:  proc=            2  lnd ngseg   =         1771  lnd nlseg   =          435
2:  proc=            2  gce ngseg   =         1280  gce nlseg   =          320
2:  proc=            2  top ngseg   =         1280  top nlseg   =          320
2:  proc=            2  lun ngseg   =        30801  lun nlseg   =         7703
2:  proc=            2  col ngseg   =        30801  col nlseg   =         7703
2:  proc=            2  pft ngseg   =        30801  pft nlseg   =         7703
2:  proc=            2  coh ngseg   =            0  coh nlseg   =            0
2:  proc=            2  nclumps =           16
2:  proc=            2  clump no =            1  clump id=            3  beg gridcell=         3716  end gridcell=         3830  total gridcells per clump=          115
2:  proc=            2  clump no =            1  clump id=            3  beg topounit=         3716  end topounit=         3830  total topounits per clump =          115
2:  proc=            2  clump no =            1  clump id=            3  beg landunit=        15406  end landunit=        15882  total landunits per clump =          477
2:  proc=            2  clump no =            1  clump id=            3  beg column  =        56002  end column  =        57786  total columns per clump  =         1785
2:  proc=            2  clump no =            1  clump id=            3  beg pft     =       115442  end pft     =       119066  total pfts per clump     =         3625
2:  proc=            2  clump no =            1  clump id=            3  beg cohort     =            1  end cohort     =            0  total cohorts per clump     =            0
0:  Reading setup_nml
0:  Reading grid_nml
0:  Reading ice_nml
0:  Reading tracer_nml
0: CalcWorkPerBlock: Total blocks:         72 Ice blocks:         72 IceFree blocks:          0 Land blocks:          0
0: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
0: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
0: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
0: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
0: bfbhash>              0 a9a787e7e84eb9b1 (Hommexx)
0:
0:  FAIL:
0: f.time_dim==nullptr
0: /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/components/eamxx/src/share/io/eamxx_scorpio_interface.cpp:1093
0: Error! Attempt to redeclare unlimited dimension.
0:  - filename: scream_output.3hourlyAVG.h.AVERAGE.nhours_x3.2022-10-04-00000.nc
0:
1:
1:  FAIL:
1: f.time_dim==nullptr
1: /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/components/eamxx/src/share/io/eamxx_scorpio_interface.cpp:1093
1: Error! Attempt to redeclare unlimited dimension.
1:  - filename: scream_output.3hourlyAVG.h.AVERAGE.nhours_x3.2022-10-04-00000.nc
1:
2:
2:  FAIL:
2: f.time_dim==nullptr
2: /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/components/eamxx/src/share/io/eamxx_scorpio_interface.cpp:1093
2: Error! Attempt to redeclare unlimited dimension.
2:  - filename: scream_output.3hourlyAVG.h.AVERAGE.nhours_x3.2022-10-04-00000.nc
2:
3:
3:  FAIL:
3: f.time_dim==nullptr
3: /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/components/eamxx/src/share/io/eamxx_scorpio_interface.cpp:1093
3: Error! Attempt to redeclare unlimited dimension.
3:  - filename: scream_output.3hourlyAVG.h.AVERAGE.nhours_x3.2022-10-04-00000.nc
3:
0: rrtmgp_conversion MemPoolSingleton used 0 out of 1048576049; high_water was 513692248
0: terminate called after throwing an instance of 'std::runtime_error'
0:   what():
0:  FAIL:
0: it.second.num_customers==0
0: /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/components/eamxx/src/share/io/eamxx_scorpio_interface.cpp:366
0: Error! ScorpioSession::finalize called, but a file is still in use elsewhere.
0:  - filename: scream_output.3hourlyAVG.h.AVERAGE.nhours_x3.2022-10-04-00000.nc
0:
2: terminate called after throwing an instance of 'std::runtime_error'
2:   what():
2:  FAIL:
2: it.second.num_customers==0
2: /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/components/eamxx/src/share/io/eamxx_scorpio_interface.cpp:366
2: Error! ScorpioSession::finalize called, but a file is still in use elsewhere.
2:  - filename: scream_output.3hourlyAVG.h.AVERAGE.nhours_x3.2022-10-04-00000.nc
2:
3: terminate called after throwing an instance of 'std::runtime_error'
3:   what():
3:  FAIL:
3: it.second.num_customers==0
3: /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/components/eamxx/src/share/io/eamxx_scorpio_interface.cpp:366
3: Error! ScorpioSession::finalize called, but a file is still in use elsewhere.
3:  - filename: scream_output.3hourlyAVG.h.AVERAGE.nhours_x3.2022-10-04-00000.nc
3:
0:
0: Program received signal SIGABRT: Process abort signal.
0:
0: Backtrace for this error:
3:
3: Program received signal SIGABRT: Process abort signal.
3:
3: Backtrace for this error:
2:
2: Program received signal SIGABRT: Process abort signal.
2:
2: Backtrace for this error:
1: terminate called after throwing an instance of 'std::runtime_error'
1:   what():
1:  FAIL:
1: it.second.num_customers==0
1: /global/cfs/cdirs/m888/ThrustE/mihelog/E3SM/components/eamxx/src/share/io/eamxx_scorpio_interface.cpp:366
1: Error! ScorpioSession::finalize called, but a file is still in use elsewhere.
1:  - filename: scream_output.3hourlyAVG.h.AVERAGE.nhours_x3.2022-10-04-00000.nc
1:
1:
1: Program received signal SIGABRT: Process abort signal.

mahf708 Apr 22, 2025
Collaborator

CThis looks easy... this type of error happens when some output stream is trying to overwrite another. Could you please share the stdout of this command in case_scripts under this case?

./atmquery output_yaml_files

my guess is that you're duplicating the same file...

mihelog Apr 23, 2025
Author

That's good news.

I did add that inside case_build() but it doesn't seem to produce an output. However, I commented out the following, approximately at line 297:

./atmchange output_yaml_files+=/global/homes/m/mihelog/ThrustE/mihelog/testruns/scream_3hr_average_output.yaml

Now I get a "completed" code in the slurm email after it runs. Yay.

Is this appropriate or will it break something else?

mahf708 Apr 23, 2025
Collaborator

That's fine, it will just mean you're not outputting anything as part of the simulation from the EAMxx side. If you're interested in profiling the IO layer, then you can try to insert it back carefully. But, I think the IO layer isn't as interesting in terms of profiling. So I would recommend ignoring it at least to start.

Answer selected by mahf708

mihelog Apr 23, 2025
Author

Got it. But regardless of profiling, if I don't output an output file, I cannot later visualize the simulation results, right?
I wonder how to fix this if there is a quick remedy. But if not it's ok there is no need to worry about this now.
Thank you SO MUCH.

mahf708 Apr 23, 2025
Collaborator

It should be fixable. Could you move this line

-./atmchange output_yaml_files+=/global/homes/m/mihelog/ThrustE/mihelog/testruns/scream_3hr_average_output.yaml
+./atmchange output_yaml_files=/global/homes/m/mihelog/ThrustE/mihelog/testruns/scream_3hr_average_output.yaml

To the runtime_options() function of your script?

Also, note the minor detail: I changed += to =. I suspect what had happened in the error about "Attempt to redeclare unlimited dimension" is overwriting the file itself, so you likely had this output file repeated, and trying to overwrite itself, leading to the error

mahf708 Apr 23, 2025
Collaborator

The output call shouldn't in in the build section. There are some complex details about how the build happens, so it's likely something gets done twice (and it would be wasteful to try to debug that). Let's just do the customary thing of adding that line to runtime_opts for ezpz sol

mihelog Apr 23, 2025
Author

This seemed to do it. Thanks again so much! I have a working copy now I can tweak for profiling.

mihelog · 2025-05-23T17:16:38Z

mihelog
May 23, 2025
Author

Hello everyone,

Since I have the author of the scripts here I was wondering if I can get help on something different. I'm trying to add a constraint in the slurm flags. The equivalent in a sbatch script would be a -C "gpu&hbm80g" (these are higher-memory GPUs in NERSC). I can't figure out where to add this constraint in the collection of scripts. Any ideas? Thank you!

3 replies

bartgol May 23, 2025
Maintainer

Have you tried ./xmlchange --append BATCH_COMMAND_FLAGS='-C gpu&hbm80g'?

mihelog May 23, 2025
Author

That's what I was looking for! Can you clarify though what directory I should run it in? I tried several combinations and I always get:
does not appear to be a valid case directory

bartgol May 23, 2025
Maintainer

You should be in the case directory. I am assuming you are using a run script, in which case you prob have a var storing the case root, like CASE_ROOT in this script? In any case, it should be wherever create_newcase spits out all the case stuff.

ndkeen · 2025-05-23T19:47:08Z

ndkeen
May 23, 2025
Collaborator

There are several ways to do it. This seems to work:

case.submit -a="-C gpu\&hbm80g -q debug -t 30"

0 replies

mihelog · 2025-05-24T02:24:08Z

mihelog
May 24, 2025
Author

Thank you both! I figured out xmlchange but that didn't work for me since I had a layer of scripts that generated more scripts... and yada yada. But the environment variable and case.submit led me down the right path and I figured out where to append to the parameters string.

Thank you again much!

1 reply

bartgol May 27, 2025
Maintainer

Glad you figured it out. If there's nothing more, go ahead and close the discussion, please.

NaNs detected in repro sum input #7271

Uh oh!

Uh oh!

mihelog Apr 19, 2025

Replies: 9 comments · 26 replies

Uh oh!

Uh oh!

mahf708 Apr 19, 2025 Collaborator

Uh oh!

mahf708 Apr 20, 2025 Collaborator

Uh oh!

Uh oh!

mihelog Apr 20, 2025 Author

Uh oh!

mahf708 Apr 21, 2025 Collaborator

Uh oh!

mahf708 Apr 21, 2025 Collaborator

Uh oh!

mihelog Apr 21, 2025 Author

Uh oh!

Uh oh!

mihelog Apr 21, 2025 Author

Uh oh!

mahf708 Apr 21, 2025 Collaborator

Uh oh!

mihelog Apr 21, 2025 Author

Uh oh!

Uh oh!

mahf708 Apr 21, 2025 Collaborator

Uh oh!

mahf708 Apr 21, 2025 Collaborator

Uh oh!

Uh oh!

mihelog Apr 22, 2025 Author

Uh oh!

Uh oh!

mahf708 Apr 22, 2025 Collaborator

Uh oh!

mihelog Apr 22, 2025 Author

Uh oh!

mihelog Apr 22, 2025 Author

Uh oh!

mahf708 Apr 22, 2025 Collaborator

Uh oh!

mihelog Apr 23, 2025 Author

Uh oh!

mahf708 Apr 23, 2025 Collaborator

Uh oh!

mihelog Apr 23, 2025 Author

Uh oh!

mahf708 Apr 23, 2025 Collaborator

Uh oh!

mahf708 Apr 23, 2025 Collaborator

Uh oh!

mihelog Apr 23, 2025 Author

Uh oh!

mihelog May 23, 2025 Author

Uh oh!

bartgol May 23, 2025 Maintainer

Uh oh!

mihelog May 23, 2025 Author

Uh oh!

bartgol May 23, 2025 Maintainer

Uh oh!

ndkeen May 23, 2025 Collaborator

Uh oh!

mihelog
Apr 19, 2025

Replies: 9 comments 26 replies

mahf708
Apr 19, 2025
Collaborator

mahf708
Apr 20, 2025
Collaborator

mihelog
Apr 20, 2025
Author

mahf708 Apr 21, 2025
Collaborator

mahf708 Apr 21, 2025
Collaborator

mihelog
Apr 21, 2025
Author

mihelog Apr 21, 2025
Author

mahf708 Apr 21, 2025
Collaborator

mihelog Apr 21, 2025
Author

mahf708 Apr 21, 2025
Collaborator

mahf708 Apr 21, 2025
Collaborator

mihelog
Apr 22, 2025
Author

mahf708 Apr 22, 2025
Collaborator

mihelog
Apr 22, 2025
Author

mihelog Apr 22, 2025
Author

mahf708 Apr 22, 2025
Collaborator

mihelog Apr 23, 2025
Author

mahf708 Apr 23, 2025
Collaborator

mihelog Apr 23, 2025
Author

mahf708 Apr 23, 2025
Collaborator

mahf708 Apr 23, 2025
Collaborator

mihelog Apr 23, 2025
Author

mihelog
May 23, 2025
Author

bartgol May 23, 2025
Maintainer

mihelog May 23, 2025
Author

bartgol May 23, 2025
Maintainer

ndkeen
May 23, 2025
Collaborator