-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Closed
Closed
Copy link
Labels
parallelismParallel or distributed computationParallel or distributed computation
Description
This works on julia v1.6.3
but fails on v1.7.0-rc1
and nightly on a Slurm cluster (using ClusterManagers
v0.4.2)
the main julia script named slurmtrial.jl
:
using Distributed, ClusterManagers
addprocs_slurm(parse(Int, ENV["SLURM_NTASKS"]));
@everywhere begin
using Pkg
Pkg.activate(Base.dirname(Base.active_project()))
end
rmprocs.(workers())
The jobscript that I use to submit this (change the julia path and the output file names to run the same code on different julia versions):
#!/bin/bash
#SBATCH --time="10"
#SBATCH --job-name=test
#SBATCH -o test18.out
#SBATCH -e test18.err
#SBATCH --ntasks=56
cd $SCRATCH/jobs
julia18="$SCRATCH/julia/julia-82d8a36491/bin/julia"
julia17="$SCRATCH/julia/julia-1.7.0-rc1/bin/julia"
julia16="$SCRATCH/julia/julia-1.6.3/bin/julia"
$julia18 -e 'include("$(ENV["HOME"])/slurmtrial.jl")'
I am using 2 nodes with 28 cores each, so a total of 56 workers. The error sometimes doesn't happen if I only use a few cores on one node (eg. 2 cores).
output on v1.6 (this is what is expected):
$ cat test16.err
Activating environment at `/scratch/username/.julia/environments/v1.6/Project.toml`
output on v1.7 & v1.8
$ cat test17.err [14:12:10]
Activating project at `/scratch/username/.julia/environments/v1.7`
ERROR: LoadError: On worker 2:
IOError: unlink("/scratch/username/.julia/logs/manifest_usage.toml"): no such file or directory (ENOENT)
Stacktrace:
[1] uv_error
@ ./libuv.jl:97 [inlined]
[2] unlink
@ ./file.jl:958
[3] #rm#12
@ ./file.jl:276
[4] #checkfor_mv_cp_cptree#13
@ ./file.jl:323
[5] #mv#17
@ ./file.jl:411 [inlined]
[6] write_env_usage
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/Types.jl:495
[7] EnvCache
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/Types.jl:337
[8] EnvCache
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/Types.jl:317 [inlined]
[9] add_snapshot_to_undo
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/API.jl:1627
[10] add_snapshot_to_undo
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/API.jl:1623 [inlined]
[11] #activate#282
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/API.jl:1589
[12] activate
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/API.jl:1552
[13] top-level scope
@ ~/slurmtrial.jl:5
[14] eval
@ ./boot.jl:373
[15] #103
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:274
[16] run_work_thunk
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:63
[17] run_work_thunk
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:72
[18] #96
@ ./task.jl:411
...and 39 more exceptions.
Stacktrace:
[1] sync_end(c::Channel{Any})
@ Base ./task.jl:369
[2] macro expansion
@ ./task.jl:388 [inlined]
[3] remotecall_eval(m::Module, procs::Vector{Int64}, ex::Expr)
@ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/macros.jl:223
[4] top-level scope
@ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/macros.jl:207
[5] include(fname::String)
@ Base.MainInclude ./client.jl:451
[6] top-level scope
@ none:1
in expression starting at /home/username/slurmtrial.jl:3
Note that the number of exceptions raised here is 40 and not 56 (this number is variable).
JonasIsensee
Metadata
Metadata
Assignees
Labels
parallelismParallel or distributed computationParallel or distributed computation