-
Notifications
You must be signed in to change notification settings - Fork 14
Open
Description
using Distributed on a Slurm cluster, I am unable to connect to the workers. The reason is that SlurmClusterManager
requires the output from start_worker
to be in the form julia_worker:PORT#IP.IP.IP.IP
. However, when using srun
to launch workers across allocated resources, this is what I get:
┌ Debug: srun command: `srun -D /home/affans /home/affans/.julia/juliaup/julia-1.10.3+0.x64.linux.gnu/bin/julia --worker`
└ @ SlurmClusterManager REPL[7]:25
julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:9427#172.16.1.26
9431#172.1.1.26
9425#172.1.1.26
9428#172.1.1.26
9430#172.1.1.26
9429#172.1.1.26
9423#172.1.1.26
9432#172.1.1.26
9426#172.1.1.26
9424#172.1.1.26
So somehow the print statements (to stdout) are in a race? I asked for 10 workers here, and it seemed to print all 10 julia_workers
all on the same line. Here is another example of the print :
[ Info: Worker 1 output: julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_wor$[ Info: Worker 2 output: julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_wor$[ Info: Worker 3 output: julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_wor$[ Info: Worker 4 output: 9369#172.16.1.41
[ Info: Worker 5 output: julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_worker:julia_wor$[ Info: Worker 6 output: 9361#172.16.1.41
[ Info: Worker 7 output: julia_worker:9360#172.16.1.41
[ Info: Worker 8 output: 9365#172.16.1.41
Any reason what could be causing this?
Version info:
julia> versioninfo()
Julia Version 1.10.3
Commit 0b4590a5507 (2024-04-30 10:59 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 32 × Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, broadwell)
Threads: 32 default, 0 interactive, 16 GC (on 32 virtual cores)
Environment:
LD_LIBRARY_PATH = /cm/shared/apps/slurm/16.05.8/lib64/slurm:/cm/shared/apps/slurm/16.05.8/lib64:/cm/shared/apps/openmpi/gcc/64/1.10.1/lib64
JULIA_NUM_THREADS = 32
LD_RUN_PATH = /cm/shared/apps/openmpi/gcc/64/1.10.1/lib64
julia> Distributed.VERSION
v"1.10.3"
Metadata
Metadata
Assignees
Labels
No labels