-
Notifications
You must be signed in to change notification settings - Fork 10
Description
I successfully installed legate‑boost on Perlmutter and was able to run workloads on multiple GPUs (single node) and on CPU. However, when attempting to run a multi-node job using Slurm, it fails immediately with an OMPI/PMIx error (see full logs below). I could not find any guidance in the documentation for multi-node setup or troubleshooting.
Documentation includes multi-node flags (e.g. --nodes, --launcher) (docs.nvidia.com) and UCX/GASNet requirements, but does not explicitly address errors related to missing PMIx support in MPI build.
Request for Help
- Could you clarify the recommended launcher and dependencies for running multi-node, multi-GPU jobs with Legate Boost on Slurm clusters (specifically Perlmutter)?
- Does conda-based installation via the gex channel (as above) support multi-node out-of-the-box? Or is a custom build with install.py --network gasnet1 required?
- Any instructions or examples for multi-node scheduling would be extremely helpful.
Thank you for your time and support!
Installation steps
I used the same environment setup commands I had used successfully for other Legate tasks.
$ module load conda
$ conda create -n legate‑boost‑env python=3.10
$ conda activate legate‑boost‑env
$ conda install \
-c legate/label/gex‑experimental \
-c legate/label/experimental \
-c legate \
-c conda‑forge \
-c nvidia \
legate=*=*gex* \
cupynumeric \
legate‑mpi‑wrapper \
realm‑gex‑wrapper
$ conda install -c conda‑forge cmake>=3.22.1
$ /global/homes/n/ngraddon/.conda/envs/legate‑boost‑env/mpi‑wrapper/build‑mpi‑wrapper.sh
$ /global/homes/n/ngraddon/.conda/envs/legate‑boost‑env/gex‑wrapper/build‑gex‑wrapper.sh
$ conda install \
-c legate \
-c conda‑forge \
-c nvidia \
legate‑boost
working commands:
$ legate --cpu 1 --gpus 0 ./housing.py
$ legate --gpus 4 ./housing.py
Multi-Node Error:
$ legate --nodes 2 --launcher srun --gpus 4 --ranks‑per‑node 1 ./housing.py
Error message:
[nid001217:1593825] OPAL ERROR: Unreachable in file pmix3x_client.c at line 111
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[nid001217:1593825] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
srun: error: nid001217: task 0: Exited with exit code 1
srun: Terminating StepId=40531224.0
[nid001220:1156306] OPAL ERROR: Unreachable in file pmix3x_client.c at line 111
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:
version 16.05 or later: you can use SLURM's PMIx suppor. This
requires that you configure and build SLURM --with-pmix
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[nid001220:1156306] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
srun: error: nid001220: task 1: Exited with exit code 1
(legate-boost-env) nihaal@nid001217:~> exit
exit
srun: error: nid001217: task 0: Exited with exit code 1
srun: Terminating StepId=40531224.interactive
salloc: Relinquishing job allocation 40531224