Skip to content

Debugging PI crash #88

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 52 commits into
base: main
Choose a base branch
from
Draft

Debugging PI crash #88

wants to merge 52 commits into from

Conversation

manodeep
Copy link

@manodeep manodeep commented May 28, 2025

Do not merge

Debugging why PI crashes with oneAPI


🚀 The latest prerelease access-esm1p6/pr88-23 at a9dc871 is here: #88 (comment) 🚀

manodeep added 30 commits March 24, 2025 13:38
@manodeep
Copy link
Author

!redeploy

Copy link

🚀 Attempted to deploy access-esm1p6 Prerelease pr88-18 with commit 785fb77

🖥️ Gadi Deployment ✔️

Usage Instructions

This access-esm1.6 model will be deployed to Gadi as:

  • latest as a Release (when merged).
  • pr88-18 as a Prerelease (during this PR).

This Prerelease is accessible on Gadi using:

module use /g/data/vk83/prerelease/modules
module load access-esm1p6/pr88-18

When using the above modules, the binaries shall be on your $PATH.

For advanced users, this Prerelease is also accessible on Gadi via /g/data/vk83/prerelease/apps/spack/0.22/spack in the access-esm1p6-pr88-18 environment.

Configuration Information

This Prerelease is deployed using:

  • access-nri/spack on branch 0.22
  • access-nri/spack-packages version 2025.03.002
  • access-nri/spack-config version 2025.02.2

If the above was not what was expected, commit changes to config/versions.json in this PR.

@manodeep
Copy link
Author

This last build (with -O1) crashed with another different error - but looks like a config error this time.

FATAL from PE   136: ==>Error from fm_util_mod(fm_util_check_for_bad_fields)[ocean_tracer_mod(ocean_prog_tracer_init)]: List length > number of good fields for list: /ocean_mod/prog_tracers


FATAL from PE   142: ==>Error from fm_util_mod(fm_util_check_for_bad_fields)[ocean_tracer_mod(ocean_prog_tracer_init)]: List length > number of good fields for list: /ocean_mod/prog_tracers


FATAL from PE   146: ==>Error from fm_util_mod(fm_util_check_for_bad_fields)[ocean_tracer_mod(ocean_prog_tracer_init)]: List length > number of good fields for list: /ocean_mod/prog_tracers


FATAL from PE     1: ==>Error from fm_util_mod(fm_util_check_for_bad_fields)[ocean_tracer_mod(ocean_prog_tracer_init)]: List length > number of good fields for list: /ocean_mod/prog_tracers


FATAL from PE   163: ==>Error from fm_util_mod(fm_util_check_for_bad_fields)[ocean_tracer_mod(ocean_prog_tracer_init)]: List length > number of good fields for list: /ocean_mod/prog_tracers

...

@JhanSrbinovsky
Copy link
Collaborator

@manodeep Im a bit confused - why are there so many errors popping up that should've been revealed before?

@manodeep
Copy link
Author

manodeep commented Jun 2, 2025

@JhanSrbinovsky The latest round of errors does not look like a compiler issue - I will ask within the team. But the Canberra folks are on a public holiday today - so it will have to wait till tomorrow.

In general, changing the compiler (in a complex codebase) can lead to all kinds of weird errors - different assumptions being made by the compiler leading to different generated code and consequently different runtime behaviour. That in turn can uncover hidden bugs in the source, or even compiler bugs themselves.

I will keep digging and figure out an efficient working solution (we already know that building everything with -O0 fixes the crash, but unsurprisingly, that build is extremely slow).

@JhanSrbinovsky
Copy link
Collaborator

@manodeep you're not in Canberra?

Anyway, I misuderstood, I assumed these errors were popping up WHEN you used -O0. We constantly reveal hidden bugs that -O2 lets slide when we revert to zero optimisation. It surprising that it is the other way around OR do you mean that mixing optimisation levels doesnt work? That wouldn't surprise me greatly, not that I have any direct experience trying to to do that (deliberately).

@manodeep
Copy link
Author

manodeep commented Jun 2, 2025

!redeploy

@manodeep
Copy link
Author

manodeep commented Jun 2, 2025

@manodeep you're not in Canberra?

Anyway, I misuderstood, I assumed these errors were popping up WHEN you used -O0. We constantly reveal hidden bugs that -O2 lets slide when we revert to zero optimisation. It surprising that it is the other way around OR do you mean that mixing optimisation levels doesnt work? That wouldn't surprise me greatly, not that I have any direct experience trying to to do that (deliberately).

I am in Melbourne and follow the VIC public holidays - so today is a working day for me :)

The divide-by-zero error shows up with -O2 and go away with -O0 (couldn't test with -O1). It seems to be related to vectorisation - so only shows up with higher optimisation levels. Mixing optimisation levels between different source files is fine - the biggest constraint is that the compiler has to be the same for the fortran codes.

Copy link

github-actions bot commented Jun 2, 2025

🚀 Attempted to deploy access-esm1p6 Prerelease pr88-19 with commit 5e50f56

🖥️ Gadi Deployment ✔️

Usage Instructions

access-esm1.6, defined in ``, will be deployed to Gadi as:

  • latest as a Release (when merged).
  • pr88-19 as a Prerelease (during this PR).

This Prerelease is accessible on Gadi using:

module use /g/data/vk83/prerelease/modules
module load access-esm1p6/pr88-19

When using the above modules, the binaries shall be on your $PATH.

For advanced users, this Prerelease is also accessible on Gadi via /g/data/vk83/prerelease/apps/spack/0.22/spack in the access-esm1p6-pr88-19 environment.

Configuration Information

This Prerelease is deployed using:

  • access-nri/spack on branch 0.22
  • access-nri/spack-packages version 2025.03.002
  • access-nri/spack-config version 2025.02.2

If the above was not what was expected, commit changes to config/versions.json in this PR.

@manodeep
Copy link
Author

manodeep commented Jun 2, 2025

The last build crashes with this error:

forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source             
libpthread-2.28.s  0000146990AF5990  Unknown               Unknown  Unknown
fms_ACCESS-ESM.x   000000000083147F  thickness_restart        2260  ocean_thickness.F90
fms_ACCESS-ESM.x   00000000007F4FDE  ocean_thickness_i         633  ocean_thickness.F90
fms_ACCESS-ESM.x   000000000045DFB6  ocean_model_init         1269  ocean_model.F90
fms_ACCESS-ESM.x   000000000043219C  main                      371  ocean_solo.F90
fms_ACCESS-ESM.x   000000000041319D  Unknown               Unknown  Unknown
....

That offending line of code looks to be another divide. Takeaways:

  • the !DIR$ NOVECTOR compiler directive works and prevents the compiler from vectorising the code. Also confirmed separately by compiling the ocean_topog.F90 by itself on the command-line
  • the odd error about list length greater than the number of good fields has gone away (errors walk in mysterious ways apparently)

@manodeep
Copy link
Author

manodeep commented Jun 2, 2025

!redeploy

@manodeep
Copy link
Author

manodeep commented Jun 2, 2025

Here is the MOM5 commit for the latest build

Copy link

github-actions bot commented Jun 2, 2025

🚀 Attempted to deploy access-esm1p6 Prerelease pr88-20 with commit b41f693

🖥️ Gadi Deployment ❌

@manodeep
Copy link
Author

manodeep commented Jun 2, 2025

!redeploy

@manodeep
Copy link
Author

manodeep commented Jun 2, 2025

Seems to be unrelated build failure coming from UM7- here are the contents of spack-build-out.txt:

==> [2025-06-02-14:32:36.983157] um7: Executing phase: 'edit'
==> [2025-06-02-14:32:36.985170] Find (not recursive): /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/oasis3-mct-git.7036f26ece68c26083fec2fe96e3cb1faed7559d_access-esm1.5-mvmism73hrpqzydjynrcoo2ezuzobaev/lib ['libpsmile.MPI1.a', 'libmct.a', 'libmpeu.a', 'libscrip.a']
==> [2025-06-02-14:32:36.985486] Find complete: /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/oasis3-mct-git.7036f26ece68c26083fec2fe96e3cb1faed7559d_access-esm1.5-mvmism73hrpqzydjynrcoo2ezuzobaev/lib ['libpsmile.MPI1.a', 'libmct.a', 'libmpeu.a', 'libscrip.a']
==> [2025-06-02-14:32:36.986980] Find (not recursive): /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/netcdf-fortran-4.5.2-gzopzrncyjfmvctitb75t7hstkv23cpy/lib ['libnetcdff.so']
==> [2025-06-02-14:32:36.987081] Find complete: /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/netcdf-fortran-4.5.2-gzopzrncyjfmvctitb75t7hstkv23cpy/lib ['libnetcdff.so']
==> [2025-06-02-14:32:36.987979] Find (not recursive): /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/dummygrib-1.0-p3e3j56ivaraifnitkmrj5my4nz4tafl/lib ['libdummygrib.so']
==> [2025-06-02-14:32:36.988078] Find complete: /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/dummygrib-1.0-p3e3j56ivaraifnitkmrj5my4nz4tafl/lib ['libdummygrib.so']
==> [2025-06-02-14:32:36.988130] Find (recursive): /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/dummygrib-1.0-p3e3j56ivaraifnitkmrj5my4nz4tafl ['libdummygrib.so']
==> [2025-06-02-14:32:36.992092] Find complete: /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/dummygrib-1.0-p3e3j56ivaraifnitkmrj5my4nz4tafl ['libdummygrib.so']
==> [2025-06-02-14:32:36.992185] Find (not recursive): /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/dummygrib-1.0-p3e3j56ivaraifnitkmrj5my4nz4tafl/lib ['libdummygrib.a']
==> [2025-06-02-14:32:36.992266] Find complete: /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/dummygrib-1.0-p3e3j56ivaraifnitkmrj5my4nz4tafl/lib ['libdummygrib.a']
==> [2025-06-02-14:32:37.056323] um7: Executing phase: 'build'
==> [2025-06-02-14:32:37.058304] '/g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/fcm-2021.05.0-537j3rk2xhgg4g54ee5f3tlbbj7thhyb/bin/fcm' 'build' '-f' '-j' '4' 'ummodel_hg3/cfg/bld-hadgem3-spack.cfg'
Build command started on Mon Jun  2 14:32:38 2025.
->Parse configuration: start
Config file (bld): ummodel_hg3/cfg/bld-hadgem3-spack.cfg
Config file (bld): /scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-um7-git.1ea43190add8627fb317906d257f278763b55125_access-esm1.6-kvrgwgqaj7e7iu647kywc32lp4uheuqt/spack-src/umbase_hg3/cfg/bld.cfg
->Parse configuration: 1 second
->Setup destination: start
Destination: tm70_ci@gadi-login-04.gadi.nci.org.au:/scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-um7-git.1ea43190add8627fb317906d257f278763b55125_access-esm1.6-kvrgwgqaj7e7iu647kywc32lp4uheuqt/spack-src/ummodel_hg3
->Setup destination: 0 second
->Setup build: start
->Setup build: 2 seconds
->Pre-process: start
No. of files scanned for PP dependency: 2968
field_length_mod.F90:78:2: fatal error: typsize.h: No such file or directory
 #include <typptra.h>
  ^~~~~~~~~~~
compilation terminated.
[FAIL] cpp -P -traditional -DC_LONG_LONG_INT=c_long_long_int -DMPP=mpp -DC_LOW_U=c_low_u -DFRL8=frl8 -DLINUX=linux -DBUFRD_IO=bufrd_io -DLITTLE_END=little_end -DLINUX_INTEL_COMPILER=linux_intel_compiler -DACCESS=access -DOASIS3=oasis3 -DCONTROL=control -DREPROD=reprod -DMPP=mpp -DATMOS=atmos -DGLOBAL=global -DA04_ALL=a04_all -DA01_3A=a01_3a -DA02_3A=a02_3a -DA03_8C=a03_8c -DA04_3D=a04_3d -DA05_4A=a05_4a -DA06_4A=a06_4a -DA08_7A=a08_7a -DA09_2A=a09_2a -DA10_2A=a10_2a -DA11_2A=a11_2a -DA12_2A=a12_2a -DA13_2A=a13_2a -DA14_1B=a14_1b -DA15_1A=a15_1a -DA16_1A=a16_1a -DA17_2B=a17_2b -DA18_0A=a18_0a -DA19_1A=a19_1a -DA25_0A=a25_0a -DA26_1A=a26_1a -DA30_1A=a30_1a -DA31_0A=a31_0a -DA32_1A=a32_1a -DA33_0A=a33_0a -DA34_0A=a34_0a -DA35_0A=a35_0a -DA38_0A=a38_0a -DA70_1B=a70_1b -DA71_1A=a71_1a -DC70_1A=c70_1a -DC72_0A=c72_0a -DC80_1A=c80_1a -DC82_1A=c82_1a -DC84_1A=c84_1a -DC92_2A=c92_2a -DC94_1A=c94_1a -DC95_2A=c95_2a -DC96_1C=c96_1c -DC97_3A=c97_3a -DCABLE_17TILES=cable_17tiles -DCABLE_SOIL_LAYERS=cable_soil_layers -DTIMER=timer -I/scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-um7-git.1ea43190add8627fb317906d257f278763b55125_access-esm1.6-kvrgwgqaj7e7iu647kywc32lp4uheuqt/spack-src/ummodel_hg3/inc -I/scratch/tm70/tm70_ci/tmp/restricted/spack-stage/spack-stage-um7-git.1ea43190add8627fb317906d257f278763b55125_access-esm1.6-kvrgwgqaj7e7iu647kywc32lp4uheuqt/spack-src/umbase_hg3/inc field_length_mod.F90 failed (1) at /g/data/vk83/prerelease/apps/spack/0.22/restricted/ukmo/release/linux-rocky8-x86_64_v3/oneapi-2025.0.4/fcm-2021.05.0-537j3rk2xhgg4g54ee5f3tlbbj7thhyb/bin/../lib/FCM1/BuildSrc.pm line 751.

Copy link

github-actions bot commented Jun 2, 2025

🚀 Attempted to deploy access-esm1p6 Prerelease pr88-21 with commit b41f693

🖥️ Gadi Deployment ✔️

Usage Instructions

access-esm1.6, defined in ./spack.yaml, will be deployed to Gadi as:

  • latest as a Release (when merged).
  • pr88-21 as a Prerelease (during this PR).

This Prerelease is accessible on Gadi using:

module use /g/data/vk83/prerelease/modules
module load access-esm1p6/pr88-21

When using the above modules, the binaries shall be on your $PATH.

For advanced users, this Prerelease is also accessible on Gadi via /g/data/vk83/prerelease/apps/spack/0.22/spack in the access-esm1p6-pr88-21 environment.

Configuration Information

This Prerelease is deployed using:

  • access-nri/spack on branch 0.22
  • access-nri/spack-packages version 2025.03.002
  • access-nri/spack-config version 2025.02.2

If the above was not what was expected, commit changes to config/versions.json in this PR.

@manodeep
Copy link
Author

manodeep commented Jun 2, 2025

The last build crashed with the "mysteriously appearing-disappearing error" - this needs someone from the ocean team to look at it.

FATAL from PE    29: ==>Error from fm_util_mod(fm_util_check_for_bad_fields)[ocean_tracer_mod(ocean_prog_tracer_init)]: List length > number of good fields for list: /ocean_mod/prog_tracers


FATAL from PE    48: ==>Error from fm_util_mod(fm_util_check_for_bad_fields)[ocean_tracer_mod(ocean_prog_tracer_init)]: List length > number of good fields for list: /ocean_mod/prog_tracers


FATAL from PE    75: ==>Error from fm_util_mod(fm_util_check_for_bad_fields)[ocean_tracer_mod(ocean_prog_tracer_init)]: List length > number of good fields for list: /ocean_mod/prog_tracers


FATAL from PE    92: ==>Error from fm_util_mod(fm_util_check_for_bad_fields)[ocean_tracer_mod(ocean_prog_tracer_init)]: List length > number of good fields for list: /ocean_mod/prog_tracers

For a sanity check, I re-ran with the (previously crashing with divide-by-zero) exes from pr88-15 and I do get the same divide-by-zero. Something has changed with the config somewhere ...

@dougiesquire
Copy link
Contributor

@manodeep I've just pushed changes that should get you past those errors.

Note, with these changes you'll need to change the name of the ocean exe in your config from fms_ACCESS-ESM.x to mom5_access_cm

@manodeep
Copy link
Author

manodeep commented Jun 3, 2025

!redeploy

Copy link

github-actions bot commented Jun 3, 2025

🚀 Attempted to deploy access-esm1p6 Prerelease pr88-22 with commit f387236

🖥️ Gadi Deployment ✔️

Usage Instructions

access-esm1.6, defined in ./spack.yaml, will be deployed to Gadi as:

  • latest as a Release (when merged).
  • pr88-22 as a Prerelease (during this PR).

This Prerelease is accessible on Gadi using:

module use /g/data/vk83/prerelease/modules
module load access-esm1p6/pr88-22

When using the above modules, the binaries shall be on your $PATH.

For advanced users, this Prerelease is also accessible on Gadi via /g/data/vk83/prerelease/apps/spack/0.22/spack in the access-esm1p6-pr88-22 environment.

Configuration Information

This Prerelease is deployed using:

  • access-nri/spack on branch 0.22
  • access-nri/spack-packages version 2025.05.002
  • access-nri/spack-config version 2025.02.2

If the above was not what was expected, commit changes to config/versions.json in this PR.

@manodeep
Copy link
Author

manodeep commented Jun 3, 2025

The latest build goes past the init stages but crashes with the following error:

WARNING from PE   135: set_date_c: Year zero is invalid. Resetting year to 1

forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source             
libpthread-2.28.s  0000151822C2A990  Unknown               Unknown  Unknown
mom5_access_cm     00000000011DBE22  blmix_kpp                2662  ocean_vert_kpp_mom4p1.F90
mom5_access_cm     00000000011C1B8C  vert_mix_kpp_mom4        1265  ocean_vert_kpp_mom4p1.F90
mom5_access_cm     0000000000A68D42  vert_mix_coeff           3098  ocean_vert_mix.F90
mom5_access_cm     0000000000453385  update_ocean_mode        1639  ocean_model.F90
mom5_access_cm     000000000042541F  main                      450  ocean_solo.F90
mom5_access_cm     000000000041378D  Unknown               Unknown  Unknown
libc-2.28.so       00001518226787E5  __libc_start_main     Unknown  Unknown
mom5_access_cm     00000000004136AE  Unknown               Unknown  Unknown
forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source             
libpthread-2.28.s  00001519035A1990  Unknown               Unknown  Unknown
mom5_access_cm     00000000011DBE22  blmix_kpp                2662  ocean_vert_kpp_mom4p1.F90
mom5_access_cm     00000000011C1B8C  vert_mix_kpp_mom4        1265  ocean_vert_kpp_mom4p1.F90
mom5_access_cm     0000000000A68D42  vert_mix_coeff           3098  ocean_vert_mix.F90

@manodeep
Copy link
Author

manodeep commented Jun 3, 2025

!redeploy

Copy link

github-actions bot commented Jun 3, 2025

🚀 Attempted to deploy access-esm1p6 Prerelease pr88-23 with commit a9dc871

🖥️ Gadi Deployment ❌

@manodeep
Copy link
Author

manodeep commented Jun 3, 2025

Ahh right - got to add the new compiler to the list of spack compilers ...

@dougiesquire
Copy link
Contributor

WARNING from PE 135: set_date_c: Year zero is invalid. Resetting year to 1

This warning occurs even for successful runs - see ACCESS-NRI/access-esm1.6-configs#105. It's something we need to fix, but I think it's unrelated to your issues here.

@manodeep
Copy link
Author

manodeep commented Jun 3, 2025

Thanks Dougie! I copied the warning to show the context - I know it's benign :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants