MPI Slurm interaction #1077

DavidBrayford · 2021-03-18T10:03:55Z

DavidBrayford
Mar 18, 2021

Hi,

I've been trying to execute an MPI program with OpenMPI across several nodes on a HPC running Slurm (salloc) from within a container using the following command:

ch-run -w ./test_mpi_image -- mpiexec -n 2 /ALPACA

(This command executes successfully when not using Slurm)

And get the following error:

The SLURM process starter for OpenMPI was unable to locate a
usable "srun" command in its path. Please check your path
and try again.

An internal error has occurred in ORTE:

[[56714,0],0] FORCE-TERMINATE AT (null):1 - error plm_slurm_module.c(471)

This is something that should be reported to the developers.

Do you have any documentation on how to configure Slurm to avoid this error?

Would binding the system Slurm executables and libraries in the the container using the -b command resolve this issue?

As I am running on a production system I am limited on what experimentation I can do.

heasterday · 2021-03-18T22:51:42Z

heasterday
Mar 18, 2021

Hello David,

There are two main ways to launch MPI applications with Charliecloud that we refer to as either a "host" or "guest" launch. A host launch is where the parallel launcher is used to launch multiple containers (that usually join together into a shared namespace). A guest launch is where the parallel launcher used within the container to launch the application. They are of the following forms:

host launch: <parallel launcher> <args> ch-run <args> -- <application>
guest launch: ch-run <args> -- <parallel launcher> <args> <application>

A limitation of the guest launch approach is that it is usually single node only because parallel launcher within the container doesn't know how to launch containers on other nodes.

As for your error, I believe what you are running into is the fact that the MPI install within the container is seeing Slurm variables in your environment that are making it think that it can use Slurm mechanisms to launch processes. To workaround this we suggest folks add --unset-env=SLURM* for guest launches. So something like this: ch-run -w --unset-env=SLURM* ./test_mpi_image -- mpiexec -n 2 /ALPACA

That being said, you mentioned you want to run across several nodes, so I would recommend a host launch. Using the example you provided this would look something like: srun -N 2 -n 2 ch-run -w ./test_mpi_image -- /ALPACA. Please note that for this to work you need your MPI install in the container to be PMI aware. Our example dockerfile may be useful towards this end. Please also note that once you are running more than one container per node you likely will want to use the --join flag to prevent MPI failures.

Let me know if this does/doesn't help 😃

0 replies

DavidBrayford · 2021-03-19T09:11:36Z

DavidBrayford
Mar 19, 2021
Author

Hi Heasterday,

I am unable to execute srun correctly, due to the host system Slurm/Munge configuration. The application sort of completes but a lot of Slurm, Munge and MPI error messages are generated, so not confident the results are valid.

I normally execute mpiexec -n 2 -w ./test_mpi_image -- /executable but the problem is that if I take e.g. 2 ranks, my application is run twice with a single rank. I.e. I get twice the outputfolders/data (and not one twice-as-fast application).

I need MPI to spread my application on different nodes. Not just for compute parallelism, but also due to the RAM requirements.

Any suggestions on how I distribute my containerized application across multiple nodes.

David

0 replies

heasterday · 2021-03-19T14:37:10Z

heasterday
Mar 19, 2021

Do you believe the Slurm/Munge/MPI errors are an incompatibility between the container MPI and the host MPI, or are they present for non-containerized applications as well? If it's the former I may be able to give you some things to try.

Using mpirun/mpiexec to launch a container on every node is more complex because you won't have the PMI compatibility layer and so the host MPI install will need to be as close to the container install as possible.

What outcome do you get if you launch with the following command line: mpiexec -n 2 --map-by node ch-run --unset-env=SLURM* -w ./test_mpi_image -- /executable?

Could you point me to a Dockerfile for how you built MPI for your image?

0 replies

DavidBrayford · 2021-03-22T16:36:36Z

DavidBrayford
Mar 22, 2021
Author

The original Dockerfile that my colleague created, which I modified to include libpmi:

`FROM ubuntu:latest AS buildstage
ENV DEBIAN_FRONTEND=noninteractive
COPY Alpaca /AlpacaCode
WORKDIR /AlpacaCode
RUN apt-get update && apt-get install -y build-essential make apt-utils cmake g++ libpmi0-dev libpmi-pmix-dev libpmi2-0 libpmi2-0-dev libopenmpi-dev openmpi-bin libhdf5-openmpi-dev mlocate
RUN mkdir build && cd build && cmake -DDIM=1 .. && make
FROM ubuntu:latest AS runstage
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update
&& apt-get -y install build-essential make apt-utils cmake g++ libpmi0-dev libpmix-dev libpmi2-0 libpmi2-0-dev libopenmpi-dev openmpi-bin mlocate libhdf5-openmpi-103 mlocate

##USER runner
COPY --from=buildstage /AlpacaCode/build/ALPACA .
COPY --from=buildstage /AlpacaCode/inputfile.xml .
RUN mkdir -p /lrz/sys
RUN mkdir -p /dss/dsshome1
`
I am in the process of following the openMPI example Dockerfile from the git repo, but have encountered a few cmake errors, which I am looking into.

I was able to execute ch-run .... mpiexec .... successfully without Slurm, but when I tried it on a system with Slurm using salloc I got lots of Slurm errors.

David

0 replies

heasterday · 2021-03-22T17:19:19Z

heasterday
Mar 22, 2021

Thank you for the example Dockerfile, I will build a simple mpi app with it and see what it takes to run on our systems.

Something to note on our example dockerfile, it does assume that our CentOS 8 Dockerfile is being used as its base. The big things we do in that image are install general dependencies and add to the search path for the linker.

I will let you know what I find from testing with your Dockerfile.

0 replies

heasterday · 2021-03-22T20:16:26Z

heasterday
Mar 22, 2021

Using the provided Dockerfile I was able to build Intel's IMB benchmark and run it across two nodes on our Slurm cluster.

Some things to note:

It doesn't appear openmpi from the apt repos has pmi2 support. To check this: ch-run <path to image> -- ompi_info | grep "pmix: s2" this may be an issue in some environments. To check what your Slurm install uses by default: grep "MpiDefault" /etc/slurm/slurm.conf. To check what launch mechanisms are supported run srun --mpi=list. You may need to explicitly call out the desired version of PMIX if your cluster's default isn't compatible with what is in the container.
I got a few errors with this build:
- Chatter from the openib btl: resolved by adding OMPI_MCA_btl=^openib to my env
- Errors from the PMIX ds12 component: resolved by adding PMIX_MCA_gds=^ds12

David, please look at the errors I provided and their workarounds and let me know if they are relevant to your environment. NOTE: I didn't evaluate performance at all because this is likely very site dependent.

0 replies

DavidBrayford · 2021-03-26T10:09:43Z

DavidBrayford
Mar 26, 2021
Author

I couldn't resolve the Slurm isssues with OpenMPI, so tried using MPICH (Intel MPI) and I no longer gette slurm errors. However, I am getting errors related to the bootstrap proxies:

I am using the system version of MPI via binding and I get the same problem even if I execute mpiexec -n 2 ch-run -w image_mpich -- /executable

[mpiexec@cm2devel] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on cm2devel (pid 6559, exit code 256) [mpiexec@cm2devel] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error [mpiexec@cm2devel] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error [mpiexec@cm2devel] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:772): error waiting for event [mpiexec@cm2devel] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1938): error setting up the boostrap proxies
The system has infiniband and I've tried explicitly setting the UCX_TLS parameters, but still get the same error.

Do I need to install the infiniband drivers inside the container?
What configuration options do you recommend to be set?

David

0 replies

heasterday · 2021-03-31T16:26:55Z

heasterday
Mar 31, 2021

Could I get a copy of the Slurm errors you were getting so I can look into them? Also, were these errors generated using our example OpenMPI base or the Dockerfile you provided? I would be very interested in the errors from both for comparison.

Typically we recommend, where possible, building the MPI install in the container with the desired communication library (UCX, Libfabric, etc..) and all its dependencies. My guess is that something required for the libraries you are binding in is missing.

A shot in the dark, would it be possible for me to get a guest account on some platform with a similar configuration? The thought is that I could then test if something needs to be done differently at build/runtime for an image in your environment vs ours.

0 replies

DavidBrayford · 2021-04-13T08:01:22Z

DavidBrayford
Apr 13, 2021
Author

The OpenMPI version gave errors regarding unable to find libpmi.so.1 which is located on the host in /usr/lib64 normally I would bind the directory into the container. I could create a new directory on the host and populate it with symlinks, but don't want to go down that path if possible.

Unfortunately, I can't provide access to the system.

The system prefered MPI version is intel and OpenMPI isn't well supported, so want to focus on MPICH (Intel MPI versions 2019.7.217 and 2019.8.254). Also, ucx isn't installed on the system, uses the environment settings I_MPI_*

Default setting from the module system:
I_MPI_HYDRA_BOOTSTRAP slurm
I_MPI_PLATFORM hsw
I_MPI_PIN_DOMAIN auto
I_MPI_FABRICS shm:ofi
I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS --ntasks-per-node=1
I_MPI_HYDRA_BRANCH_COUNT 128
I_MPI_OFI_PROVIDER verbs

I've tried setting FI_PROVIDER=tcp and I_MPI_FABRICS=tcp
but still getting the error:

[mpiexec@i22r07c05s06] check_exit_codes (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on i22r07c05s06 (pid 20924, exit code 65280)
[mpiexec@i22r07c05s06] poll_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@i22r07c05s06] HYD_dmx_poll_wait_for_proxy_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@i22r07c05s06] HYD_bstrap_setup (../../../../../src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:772): error waiting for event
[mpiexec@i22r07c05s06] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1938): error setting up the boostrap proxies

I will try it on another system and get back to you.

David

0 replies

DavidBrayford · 2021-04-19T08:40:36Z

DavidBrayford
Apr 19, 2021
Author

For clarification, if I execute mpiexec -n 2 ch-run container -- mpi_app The container is replicated on both nodes and executes the same application twice rather than distribute the single parallel application across the 2 nodes.

As I am able to successfully execute mpiexec -n 2 container -- mpi_app but it executes the job twice (job replicates itself on nodes 1 & 2) rather than distributing the job.

Is this correct?

0 replies

heasterday · 2021-04-27T21:12:11Z

heasterday
Apr 27, 2021

You could bind in libpmi from the host but I would recommend the container image having that already. On that note, the recommended way to inject a host install of libraries is ch-fromhost which we currently use to inject cray-mpich, I can look into extending this functionality to intel-mpi (likely a very similar process) if that interests you?

Re: mpiexec behavior:
That command uses the parallel launcher's standard mechanisms for starting up a process on each node (in this case a container), the containers then perform their setup process and then execvp the mpi_app, the mpi applications then follow standard mechanisms to wire up. If something about the MPI wire up fails you can see two independent applications rather than one, this is not a behavior indicating success.

0 replies

reidpr · 2021-05-20T23:32:28Z

reidpr
May 20, 2021
Maintainer

We're trying out the new “Discussions” feature, so I am going to move this thread to that section. Please LMK if anything goes wrong.

0 replies

venod · 2021-07-25T12:51:28Z

venod
Jul 25, 2021

when im trying to srun in cluster i am getting issue can any one help to rectify the issue
####srun -n1 ./hello_world
This is the eror i am getting
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[node1:132671] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
srun: error: node1: task 0: Exited with exit code 1

1 reply

reidpr Jul 26, 2021
Maintainer

Hello @venod; I don't see any Charliecloud commands there. Can you explain what Charliecloud commands you tried to get the above?

MPI Slurm interaction #1077

Uh oh!

And get the following error:

The SLURM process starter for OpenMPI was unable to locate a usable "srun" command in its path. Please check your path and try again.

This is something that should be reported to the developers.

Replies: 13 comments · 1 reply

Uh oh!

Uh oh!

DavidBrayford Mar 19, 2021 Author

Uh oh!

Uh oh!

Uh oh!

DavidBrayford Mar 22, 2021 Author

Uh oh!

Uh oh!

Uh oh!

DavidBrayford Mar 26, 2021 Author

Uh oh!

Uh oh!

DavidBrayford Apr 13, 2021 Author

Uh oh!

DavidBrayford Apr 19, 2021 Author

Uh oh!

Uh oh!

reidpr May 20, 2021 Maintainer

Uh oh!

Uh oh!

reidpr Jul 26, 2021 Maintainer

The SLURM process starter for OpenMPI was unable to locate a
usable "srun" command in its path. Please check your path
and try again.

Replies: 13 comments 1 reply

DavidBrayford
Mar 19, 2021
Author

DavidBrayford
Mar 22, 2021
Author

DavidBrayford
Mar 26, 2021
Author

DavidBrayford
Apr 13, 2021
Author

DavidBrayford
Apr 19, 2021
Author

reidpr
May 20, 2021
Maintainer

reidpr Jul 26, 2021
Maintainer