-
Couldn't load subscription status.
- Fork 108
[WIP] add matrix #1923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
[WIP] add matrix #1923
Conversation
|
|
||
| # Map CPU core allocations | ||
| declare -A core_counts=(["lassen"]=40 ["poodle"]=28 ["dane"]=28 ["corona"]=32 ["rzansel"]=48 ["tioga"]=32 ["tuolumne"]=48) | ||
| declare -A core_counts=(["lassen"]=40 ["poodle"]=28 ["dane"]=28 ["corona"]=32 ["rzansel"]=48 ["tioga"]=32 ["tuolumne"]=48 ["matrix"]=48) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://hpc.llnl.gov/hardware/compute-platforms says 112, which is what I have been using
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know how many cores a node has on the machine. The reason it is set to use less than that is that the compilation will fail frequently if you try to run parallel make with all cores. We do this on other platforms as well.
| # Arguments for top level allocation | ||
| MATRIX_SHARED_ALLOC: "--exclusive --time=60 --nodes=1" | ||
| # Arguments for job level allocation | ||
| MATRIX_JOB_ALLOC: "--nodes=1" | ||
| # Project specific variants for matrix | ||
| PROJECT_MATRIX_VARIANTS: "~shared +cuda cuda_arch=75 +tests" | ||
| # Project specific deps for matrix | ||
| PROJECT_MATRIX_DEPS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might need to specify number of gpus (unique to matrix possibly). I use
srun -n1 -p pdebug --gres=gpu:4 --exclusive ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and
LLNL_MATRIX_SLURM_SCHEDULER_PARAMETERS:
value: "--nodes=1 --ntasks-per-node=1 --gres=gpu:4 --time=00:20:00 --cpus-per-task=112 -p pdebug --exclusive"
|
@adrienbernede this is all working now when I test locally. However, it is almost impossible to get an allocation on matrix and it times out waiting for an allocation almost every time. I'm going to pursue this to see what can be done. |
|
@rhornung67, is this still a WIP or is it ready for review? |
|
@adayton1 please review if you want to. I want to discuss with the team whether it makes sense to merge this since our priority on matrix is very low. |
This PR adds testing on Matrix.
Note that this is working and all checks pass. However, we'll hold off on merging until we discuss this as a team. Access to machine resources may make this testing addition impractical.
TODO: