Skip to content

Activating environment on remote workers on a cluster fails on v1.7 (works on 1.6) #42405

@jishnub

Description

@jishnub

This works on julia v1.6.3 but fails on v1.7.0-rc1 and nightly on a Slurm cluster (using ClusterManagers v0.4.2)

the main julia script named slurmtrial.jl:

using Distributed, ClusterManagers
addprocs_slurm(parse(Int, ENV["SLURM_NTASKS"]));
@everywhere begin
    using Pkg
    Pkg.activate(Base.dirname(Base.active_project()))
end
rmprocs.(workers())

The jobscript that I use to submit this (change the julia path and the output file names to run the same code on different julia versions):

#!/bin/bash
#SBATCH --time="10"
#SBATCH --job-name=test
#SBATCH -o test18.out
#SBATCH -e test18.err
#SBATCH --ntasks=56

cd $SCRATCH/jobs
julia18="$SCRATCH/julia/julia-82d8a36491/bin/julia"
julia17="$SCRATCH/julia/julia-1.7.0-rc1/bin/julia"
julia16="$SCRATCH/julia/julia-1.6.3/bin/julia"
$julia18 -e 'include("$(ENV["HOME"])/slurmtrial.jl")'

I am using 2 nodes with 28 cores each, so a total of 56 workers. The error sometimes doesn't happen if I only use a few cores on one node (eg. 2 cores).

output on v1.6 (this is what is expected):

$ cat test16.err
  Activating environment at `/scratch/username/.julia/environments/v1.6/Project.toml`

output on v1.7 & v1.8

$ cat test17.err                                                                                                                                  [14:12:10]
  Activating project at `/scratch/username/.julia/environments/v1.7`
ERROR: LoadError: On worker 2:
IOError: unlink("/scratch/username/.julia/logs/manifest_usage.toml"): no such file or directory (ENOENT)
Stacktrace:
  [1] uv_error
    @ ./libuv.jl:97 [inlined]
  [2] unlink
    @ ./file.jl:958
  [3] #rm#12
    @ ./file.jl:276
  [4] #checkfor_mv_cp_cptree#13
    @ ./file.jl:323
  [5] #mv#17
    @ ./file.jl:411 [inlined]
  [6] write_env_usage
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/Types.jl:495
  [7] EnvCache
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/Types.jl:337
  [8] EnvCache
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/Types.jl:317 [inlined]
  [9] add_snapshot_to_undo
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/API.jl:1627
 [10] add_snapshot_to_undo
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/API.jl:1623 [inlined]
 [11] #activate#282
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/API.jl:1589
 [12] activate
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Pkg/src/API.jl:1552
 [13] top-level scope
    @ ~/slurmtrial.jl:5
 [14] eval
    @ ./boot.jl:373
 [15] #103
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:274
 [16] run_work_thunk
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:63
 [17] run_work_thunk
    @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/process_messages.jl:72
 [18] #96
    @ ./task.jl:411

...and 39 more exceptions.

Stacktrace:
 [1] sync_end(c::Channel{Any})
   @ Base ./task.jl:369
 [2] macro expansion
   @ ./task.jl:388 [inlined]
 [3] remotecall_eval(m::Module, procs::Vector{Int64}, ex::Expr)
   @ Distributed /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/macros.jl:223
 [4] top-level scope
   @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.7/Distributed/src/macros.jl:207
 [5] include(fname::String)
   @ Base.MainInclude ./client.jl:451
 [6] top-level scope
   @ none:1
in expression starting at /home/username/slurmtrial.jl:3

Note that the number of exceptions raised here is 40 and not 56 (this number is variable).

Metadata

Metadata

Assignees

No one assigned

    Labels

    parallelismParallel or distributed computation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions