-
Notifications
You must be signed in to change notification settings - Fork 0
Debugging PI crash #88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…t will also run on cascadelake, broadwell)
…-parens to limit re-ordering)
… the dependency versions
!redeploy |
🚀 Attempted to deploy 🖥️
|
This last build (with
|
@manodeep Im a bit confused - why are there so many errors popping up that should've been revealed before? |
@JhanSrbinovsky The latest round of errors does not look like a compiler issue - I will ask within the team. But the Canberra folks are on a public holiday today - so it will have to wait till tomorrow. In general, changing the compiler (in a complex codebase) can lead to all kinds of weird errors - different assumptions being made by the compiler leading to different generated code and consequently different runtime behaviour. That in turn can uncover hidden bugs in the source, or even compiler bugs themselves. I will keep digging and figure out an efficient working solution (we already know that building everything with |
@manodeep you're not in Canberra? Anyway, I misuderstood, I assumed these errors were popping up WHEN you used -O0. We constantly reveal hidden bugs that -O2 lets slide when we revert to zero optimisation. It surprising that it is the other way around OR do you mean that mixing optimisation levels doesnt work? That wouldn't surprise me greatly, not that I have any direct experience trying to to do that (deliberately). |
…ending for loops. Also removed the -O1 flag
!redeploy |
I am in Melbourne and follow the VIC public holidays - so today is a working day for me :) The divide-by-zero error shows up with |
🚀 Attempted to deploy 🖥️
|
The last build crashes with this error: forrtl: error (73): floating divide by zero
Image PC Routine Line Source
libpthread-2.28.s 0000146990AF5990 Unknown Unknown Unknown
fms_ACCESS-ESM.x 000000000083147F thickness_restart 2260 ocean_thickness.F90
fms_ACCESS-ESM.x 00000000007F4FDE ocean_thickness_i 633 ocean_thickness.F90
fms_ACCESS-ESM.x 000000000045DFB6 ocean_model_init 1269 ocean_model.F90
fms_ACCESS-ESM.x 000000000043219C main 371 ocean_solo.F90
fms_ACCESS-ESM.x 000000000041319D Unknown Unknown Unknown
.... That offending line of code looks to be another divide. Takeaways:
|
…o the if condition.
!redeploy |
Here is the MOM5 commit for the latest build |
🚀 Attempted to deploy 🖥️
|
!redeploy |
Seems to be unrelated build failure coming from UM7- here are the contents of ==> [2025-06-02-14:32:36.983157] um7: Executing phase: 'edit'
==> [2025-06-02-14:32:36.985170] Find (not recursive): /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/oasis3-mct-git.7036f26ece68c26083fec2fe96e3cb1faed7559d_access-esm1.5-mvmism73hrpqzydjynrcoo2ezuzobaev/lib ['libpsmile.MPI1.a', 'libmct.a', 'libmpeu.a', 'libscrip.a']
==> [2025-06-02-14:32:36.985486] Find complete: /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/oasis3-mct-git.7036f26ece68c26083fec2fe96e3cb1faed7559d_access-esm1.5-mvmism73hrpqzydjynrcoo2ezuzobaev/lib ['libpsmile.MPI1.a', 'libmct.a', 'libmpeu.a', 'libscrip.a']
==> [2025-06-02-14:32:36.986980] Find (not recursive): /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/netcdf-fortran-4.5.2-gzopzrncyjfmvctitb75t7hstkv23cpy/lib ['libnetcdff.so']
==> [2025-06-02-14:32:36.987081] Find complete: /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/netcdf-fortran-4.5.2-gzopzrncyjfmvctitb75t7hstkv23cpy/lib ['libnetcdff.so']
==> [2025-06-02-14:32:36.987979] Find (not recursive): /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/dummygrib-1.0-p3e3j56ivaraifnitkmrj5my4nz4tafl/lib ['libdummygrib.so']
==> [2025-06-02-14:32:36.988078] Find complete: /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/dummygrib-1.0-p3e3j56ivaraifnitkmrj5my4nz4tafl/lib ['libdummygrib.so']
==> [2025-06-02-14:32:36.988130] Find (recursive): /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/dummygrib-1.0-p3e3j56ivaraifnitkmrj5my4nz4tafl ['libdummygrib.so']
==> [2025-06-02-14:32:36.992092] Find complete: /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/dummygrib-1.0-p3e3j56ivaraifnitkmrj5my4nz4tafl ['libdummygrib.so']
==> [2025-06-02-14:32:36.992185] Find (not recursive): /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/dummygrib-1.0-p3e3j56ivaraifnitkmrj5my4nz4tafl/lib ['libdummygrib.a']
==> [2025-06-02-14:32:36.992266] Find complete: /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/dummygrib-1.0-p3e3j56ivaraifnitkmrj5my4nz4tafl/lib ['libdummygrib.a']
==> [2025-06-02-14:32:37.056323] um7: Executing phase: 'build'
==> [2025-06-02-14:32:37.058304] '/g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/fcm-2021.05.0-537j3rk2xhgg4g54ee5f3tlbbj7thhyb/bin/fcm' 'build' '-f' '-j' '4' 'ummodel_hg3/cfg/bld-hadgem3-spack.cfg'
Build command started on Mon Jun 2 14:32:38 2025.
->Parse configuration: start
Config file (bld): ummodel_hg3/cfg/bld-hadgem3-spack.cfg
Config file (bld): /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-um7-git.1ea43190add8627fb317906d257f278763b55125_access-esm1.6-kvrgwgqaj7e7iu647kywc32lp4uheuqt/spack-src/umbase_hg3/cfg/bld.cfg
->Parse configuration: 1 second
->Setup destination: start
Destination: tm70_ci@gadi-login-04.gadi.nci.org.au:/scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-um7-git.1ea43190add8627fb317906d257f278763b55125_access-esm1.6-kvrgwgqaj7e7iu647kywc32lp4uheuqt/spack-src/ummodel_hg3
->Setup destination: 0 second
->Setup build: start
->Setup build: 2 seconds
->Pre-process: start
No. of files scanned for PP dependency: 2968
field_length_mod.F90:78:2: fatal error: typsize.h: No such file or directory
#include <typptra.h>
^~~~~~~~~~~
compilation terminated.
[FAIL] cpp -P -traditional -DC_LONG_LONG_INT=c_long_long_int -DMPP=mpp -DC_LOW_U=c_low_u -DFRL8=frl8 -DLINUX=linux -DBUFRD_IO=bufrd_io -DLITTLE_END=little_end -DLINUX_INTEL_COMPILER=linux_intel_compiler -DACCESS=access -DOASIS3=oasis3 -DCONTROL=control -DREPROD=reprod -DMPP=mpp -DATMOS=atmos -DGLOBAL=global -DA04_ALL=a04_all -DA01_3A=a01_3a -DA02_3A=a02_3a -DA03_8C=a03_8c -DA04_3D=a04_3d -DA05_4A=a05_4a -DA06_4A=a06_4a -DA08_7A=a08_7a -DA09_2A=a09_2a -DA10_2A=a10_2a -DA11_2A=a11_2a -DA12_2A=a12_2a -DA13_2A=a13_2a -DA14_1B=a14_1b -DA15_1A=a15_1a -DA16_1A=a16_1a -DA17_2B=a17_2b -DA18_0A=a18_0a -DA19_1A=a19_1a -DA25_0A=a25_0a -DA26_1A=a26_1a -DA30_1A=a30_1a -DA31_0A=a31_0a -DA32_1A=a32_1a -DA33_0A=a33_0a -DA34_0A=a34_0a -DA35_0A=a35_0a -DA38_0A=a38_0a -DA70_1B=a70_1b -DA71_1A=a71_1a -DC70_1A=c70_1a -DC72_0A=c72_0a -DC80_1A=c80_1a -DC82_1A=c82_1a -DC84_1A=c84_1a -DC92_2A=c92_2a -DC94_1A=c94_1a -DC95_2A=c95_2a -DC96_1C=c96_1c -DC97_3A=c97_3a -DCABLE_17TILES=cable_17tiles -DCABLE_SOIL_LAYERS=cable_soil_layers -DTIMER=timer -I/scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-um7-git.1ea43190add8627fb317906d257f278763b55125_access-esm1.6-kvrgwgqaj7e7iu647kywc32lp4uheuqt/spack-src/ummodel_hg3/inc -I/scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-um7-git.1ea43190add8627fb317906d257f278763b55125_access-esm1.6-kvrgwgqaj7e7iu647kywc32lp4uheuqt/spack-src/umbase_hg3/inc field_length_mod.F90 failed (1) at /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/fcm-2021.05.0-537j3rk2xhgg4g54ee5f3tlbbj7thhyb/bin/../lib/FCM1/BuildSrc.pm line 751. |
🚀 Attempted to deploy 🖥️
|
The last build crashed with the "mysteriously appearing-disappearing error" - this needs someone from the ocean team to look at it. FATAL from PE 29: ==>Error from fm_util_mod(fm_util_check_for_bad_fields)[ocean_tracer_mod(ocean_prog_tracer_init)]: List length > number of good fields for list: /ocean_mod/prog_tracers
FATAL from PE 48: ==>Error from fm_util_mod(fm_util_check_for_bad_fields)[ocean_tracer_mod(ocean_prog_tracer_init)]: List length > number of good fields for list: /ocean_mod/prog_tracers
FATAL from PE 75: ==>Error from fm_util_mod(fm_util_check_for_bad_fields)[ocean_tracer_mod(ocean_prog_tracer_init)]: List length > number of good fields for list: /ocean_mod/prog_tracers
FATAL from PE 92: ==>Error from fm_util_mod(fm_util_check_for_bad_fields)[ocean_tracer_mod(ocean_prog_tracer_init)]: List length > number of good fields for list: /ocean_mod/prog_tracers
For a sanity check, I re-ran with the (previously crashing with divide-by-zero) exes from |
@manodeep I've just pushed changes that should get you past those errors. Note, with these changes you'll need to change the name of the ocean exe in your config from |
!redeploy |
🚀 Attempted to deploy 🖥️
|
The latest build goes past the init stages but crashes with the following error: WARNING from PE 135: set_date_c: Year zero is invalid. Resetting year to 1
forrtl: error (65): floating invalid
Image PC Routine Line Source
libpthread-2.28.s 0000151822C2A990 Unknown Unknown Unknown
mom5_access_cm 00000000011DBE22 blmix_kpp 2662 ocean_vert_kpp_mom4p1.F90
mom5_access_cm 00000000011C1B8C vert_mix_kpp_mom4 1265 ocean_vert_kpp_mom4p1.F90
mom5_access_cm 0000000000A68D42 vert_mix_coeff 3098 ocean_vert_mix.F90
mom5_access_cm 0000000000453385 update_ocean_mode 1639 ocean_model.F90
mom5_access_cm 000000000042541F main 450 ocean_solo.F90
mom5_access_cm 000000000041378D Unknown Unknown Unknown
libc-2.28.so 00001518226787E5 __libc_start_main Unknown Unknown
mom5_access_cm 00000000004136AE Unknown Unknown Unknown
forrtl: error (65): floating invalid
Image PC Routine Line Source
libpthread-2.28.s 00001519035A1990 Unknown Unknown Unknown
mom5_access_cm 00000000011DBE22 blmix_kpp 2662 ocean_vert_kpp_mom4p1.F90
mom5_access_cm 00000000011C1B8C vert_mix_kpp_mom4 1265 ocean_vert_kpp_mom4p1.F90
mom5_access_cm 0000000000A68D42 vert_mix_coeff 3098 ocean_vert_mix.F90 |
!redeploy |
🚀 Attempted to deploy 🖥️
|
Ahh right - got to add the new compiler to the list of spack compilers ... |
This warning occurs even for successful runs - see ACCESS-NRI/access-esm1.6-configs#105. It's something we need to fix, but I think it's unrelated to your issues here. |
Thanks Dougie! I copied the warning to show the context - I know it's benign :) |
Do not merge
Debugging why PI crashes with oneAPI
🚀 The latest prerelease
access-esm1p6/pr88-23
at a9dc871 is here: #88 (comment) 🚀