Skip to content

Conversation

@adrienbernede
Copy link
Member

@adrienbernede adrienbernede commented Oct 1, 2025

This PR adds testing on Matrix.

Note that this is working and all checks pass. However, we'll hold off on merging until we discuss this as a team. Access to machine resources may make this testing addition impractical.

TODO:


# Map CPU core allocations
declare -A core_counts=(["lassen"]=40 ["poodle"]=28 ["dane"]=28 ["corona"]=32 ["rzansel"]=48 ["tioga"]=32 ["tuolumne"]=48)
declare -A core_counts=(["lassen"]=40 ["poodle"]=28 ["dane"]=28 ["corona"]=32 ["rzansel"]=48 ["tioga"]=32 ["tuolumne"]=48 ["matrix"]=48)
Copy link

@pguthrey pguthrey Oct 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://hpc.llnl.gov/hardware/compute-platforms says 112, which is what I have been using

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know how many cores a node has on the machine. The reason it is set to use less than that is that the compilation will fail frequently if you try to run parallel make with all cores. We do this on other platforms as well.

Comment on lines +26 to +33
# Arguments for top level allocation
MATRIX_SHARED_ALLOC: "--exclusive --time=60 --nodes=1"
# Arguments for job level allocation
MATRIX_JOB_ALLOC: "--nodes=1"
# Project specific variants for matrix
PROJECT_MATRIX_VARIANTS: "~shared +cuda cuda_arch=75 +tests"
# Project specific deps for matrix
PROJECT_MATRIX_DEPS:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need to specify number of gpus (unique to matrix possibly). I use

srun -n1 -p pdebug --gres=gpu:4 --exclusive ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and

  LLNL_MATRIX_SLURM_SCHEDULER_PARAMETERS:
    value: "--nodes=1 --ntasks-per-node=1 --gres=gpu:4 --time=00:20:00 --cpus-per-task=112 -p pdebug --exclusive"

@rhornung67
Copy link
Member

@adrienbernede this is all working now when I test locally. However, it is almost impossible to get an allocation on matrix and it times out waiting for an allocation almost every time. I'm going to pursue this to see what can be done.

@adayton1
Copy link
Member

@rhornung67, is this still a WIP or is it ready for review?

@rhornung67
Copy link
Member

@adayton1 please review if you want to. I want to discuss with the team whether it makes sense to merge this since our priority on matrix is very low.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants