Skip to content

fix issue that parallel gets set to None with EB 5+ #1104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

trz42
Copy link
Collaborator

@trz42 trz42 commented May 28, 2025

Fixes an issue caused by the post_ready_hook trying to obtain the parallelism via

self.cfg['max_parallel'] 

which is not set by EasyBlock.set_parallel. The latter always (for EB 4 and EB 5) sets

self.cfg['parallel'] 

or

self.cfg.parallel 

@trz42 trz42 added bug Something isn't working 2023.06-software.eessi.io 2023.06 version of software.eessi.io labels May 28, 2025
@eessi-bot-surf
Copy link

Instance eessi-bot-surf is configured to build for:

  • architectures: x86_64/amd/zen4, x86_64/amd/zen2
  • repositories: eessi-hpc.org-2023.06-software, eessi.io-2023.06-compat, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software

@trz42
Copy link
Collaborator Author

trz42 commented May 28, 2025

Just need to build this once (for a single architecture) ...
bot: build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/generic

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented May 28, 2025

Updates by the bot instance eessi-bot-surf (click for details)
  • received bot command build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/generic from trz42

    • expanded format: build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/generic
  • handling command build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/generic resulted in:

    • no jobs were submitted

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented May 28, 2025

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.05/pr_1104/65836

date job status comment
May 28 17:56:50 UTC 2025 submitted job id 65836 awaits release by job manager
May 28 17:57:45 UTC 2025 released job awaits launch by Slurm scheduler
May 28 18:03:48 UTC 2025 running job 65836 is running
May 28 18:10:55 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-65836.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-generic-17484554550.tar.gzsize: 0 MiB (16297 bytes)
entries: 1
modules under 2023.06/software/linux/x86_64/generic/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/generic/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/generic
2023.06/init/easybuild/eb_hooks.py
May 28 18:10:55 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] ( 1/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/29Aug2024-foss-2023b-kokkos %scale=1_node /aeb2d9df @BotBuildTests:x86_64_generic+default
P: perf: 396.224 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 2/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=1_node /04ff9ece @BotBuildTests:x86_64_generic+default
P: perf: 428.298 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 3/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node %device_type=cpu /775175bf @BotBuildTests:x86_64_generic+default
P: latency: 2.95 us (r:0, l:None, u:None)
[ OK ] ( 4/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node %device_type=cpu /52707c40 @BotBuildTests:x86_64_generic+default
P: latency: 3.0 us (r:0, l:None, u:None)
[ OK ] ( 5/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node %device_type=cpu /b1aacda9 @BotBuildTests:x86_64_generic+default
P: latency: 5.68 us (r:0, l:None, u:None)
[ OK ] ( 6/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node %device_type=cpu /c6bad193 @BotBuildTests:x86_64_generic+default
P: latency: 6.01 us (r:0, l:None, u:None)
[ OK ] ( 7/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /15cad6c4 @BotBuildTests:x86_64_generic+default
P: latency: 0.7 us (r:0, l:None, u:None)
[ OK ] ( 8/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /6672deda @BotBuildTests:x86_64_generic+default
P: latency: 0.73 us (r:0, l:None, u:None)
[ OK ] ( 9/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /2a9a47b1 @BotBuildTests:x86_64_generic+default
P: bandwidth: 10624.63 MB/s (r:0, l:None, u:None)
[ OK ] (10/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /1b24ab8e @BotBuildTests:x86_64_generic+default
P: bandwidth: 10428.24 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-65836.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@trz42 trz42 requested a review from ocaisa May 28, 2025 17:58
eb_hooks.py Outdated
@@ -133,7 +133,8 @@ def post_ready_hook(self, *args, **kwargs):
# Check whether we have EasyBuild 4 or 5
parallel_param = 'parallel'
if EASYBUILD_VERSION >= '5':
parallel_param = 'max_parallel'
# parallel_param = 'max_parallel' # EasyBlock.set_parallel sets 'parallel'
parallel_param = 'parallel'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just reverts the previous change and reintroduces the warning about using a deprecated parameter

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a better solution as the warning appears for anyone using EESSI-extend

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but self.cfg['max_parallel'] doesn't exist plus --max-parallel seems not fully working in EB 5+.

Could we disable that specific warning?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the warning really appearing for anyone or only when we adjust the parallelism in the post_ready_hook?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we just use the parallel property (I'm guessing that is self.parallel) then in the EB5 scenario?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the warning really appearing for anyone or only when we adjust the parallelism in the post_ready_hook?

I think yes, I fixed it because it bothered me to see it all the time while preparing the CI tutorial

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we just use the parallel property (I'm guessing that is self.parallel) then in the EB5 scenario?

Hmm, does that exist?

Trying it for TensorFlow I get

  File "/local/scratch/roeblitz1/software-layer/eb_hooks.py", line 141, in post_ready_hook
    print_msg("self.parallel = '%s'", self.parallel)
                                      ^^^^^^^^^^^^^
AttributeError: 'PythonBundle' object has no attribute 'parallel'

@trz42
Copy link
Collaborator Author

trz42 commented May 28, 2025

Suggesting a different fix/workaround ...
bot: build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/generic

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented May 28, 2025

Updates by the bot instance eessi-bot-surf (click for details)
  • received bot command build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/generic from trz42

    • expanded format: build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/generic
  • handling command build instance:eessi-bot-mc-aws repository:eessi.io-2023.06-software architecture:x86_64/generic resulted in:

    • no jobs were submitted

@eessi-bot-aws
Copy link

eessi-bot-aws bot commented May 28, 2025

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2025.05/pr_1104/65837

date job status comment
May 28 20:28:23 UTC 2025 submitted job id 65837 awaits release by job manager
May 28 20:29:09 UTC 2025 released job awaits launch by Slurm scheduler
May 28 20:36:12 UTC 2025 running job 65837 is running
May 28 20:43:19 UTC 2025 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-65837.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-generic-17484646050.tar.gzsize: 0 MiB (16298 bytes)
entries: 1
modules under 2023.06/software/linux/x86_64/generic/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/generic/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/generic
2023.06/init/easybuild/eb_hooks.py
May 28 20:43:19 UTC 2025 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] ( 1/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/29Aug2024-foss-2023b-kokkos %scale=1_node /aeb2d9df @BotBuildTests:x86_64_generic+default
P: perf: 415.523 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 2/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=1_node /04ff9ece @BotBuildTests:x86_64_generic+default
P: perf: 448.824 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 3/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node %device_type=cpu /775175bf @BotBuildTests:x86_64_generic+default
P: latency: 3.07 us (r:0, l:None, u:None)
[ OK ] ( 4/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node %device_type=cpu /52707c40 @BotBuildTests:x86_64_generic+default
P: latency: 3.02 us (r:0, l:None, u:None)
[ OK ] ( 5/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node %device_type=cpu /b1aacda9 @BotBuildTests:x86_64_generic+default
P: latency: 6.1 us (r:0, l:None, u:None)
[ OK ] ( 6/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node %device_type=cpu /c6bad193 @BotBuildTests:x86_64_generic+default
P: latency: 5.75 us (r:0, l:None, u:None)
[ OK ] ( 7/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /15cad6c4 @BotBuildTests:x86_64_generic+default
P: latency: 0.73 us (r:0, l:None, u:None)
[ OK ] ( 8/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /6672deda @BotBuildTests:x86_64_generic+default
P: latency: 0.73 us (r:0, l:None, u:None)
[ OK ] ( 9/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /2a9a47b1 @BotBuildTests:x86_64_generic+default
P: bandwidth: 10607.13 MB/s (r:0, l:None, u:None)
[ OK ] (10/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /1b24ab8e @BotBuildTests:x86_64_generic+default
P: bandwidth: 10803.5 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-65837.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@@ -136,6 +136,8 @@ def post_ready_hook(self, *args, **kwargs):
parallel_param = 'max_parallel'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better with EasyBuild 5.x would be to use self.cfg.parallel instead of self.cfg['parallel']

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think the information gathering here is in the wrong place. You only need to do this if you plan to change the parallelism, so this whole block could be moved to when you need it, right before if new_parallel != parallel:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, operation_func requires it (right now), but at it should at least be within the if statement

@trz42
Copy link
Collaborator Author

trz42 commented Jun 5, 2025

@ocaisa @boegel this just hit me again. Any movement on this?

@@ -136,6 +136,8 @@ def post_ready_hook(self, *args, **kwargs):
parallel_param = 'max_parallel'
# get current parallelism setting
parallel = self.cfg[parallel_param]
if parallel == None:
return # self.cfg doesn't contain 'parallel' or 'max_parallel'
Copy link
Member

@ocaisa ocaisa Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exits too early, it may be the case that this is None but you still want to limit the parallelism. What we really want is to call https://github.com/easybuilders/easybuild-framework/pull/3811/files#diff-8136c25c2706ba344aa622b3441a997a04037c1f5e6acadc006ae27f132554f4R1833 since that covers all cases.

That's an open framework PR though so I can be pragmatic. In general this query is triggered for every easyconfig, but in reality you only need to know it if you plan on changing it. The query should happen inside the if block. In that case you'll only see the warning when it affects what you are trying to build (right now you see it on every call to eb). I'm ok with reverting #1089 as long as the querying of the parallelism is moved into the if self.name in PARALLELISM_LIMITS: block

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think getattr(self, 'parallel', self.cfg['parallel']) will work for both scenarios without the need for a version check.

@boegel
Copy link
Contributor

boegel commented Jun 20, 2025

This is now being fixed via EESSI/software-layer-scripts#17, so let's close?

@trz42
Copy link
Collaborator Author

trz42 commented Jun 20, 2025

Closing as suggested because issue is handled via EESSI/software-layer-scripts#17

@trz42 trz42 closed this Jun 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023.06-software.eessi.io 2023.06 version of software.eessi.io bug Something isn't working ready-to-review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants