-
Notifications
You must be signed in to change notification settings - Fork 714
Description
Bug report
Nextflow with Azure batch appears to fail when reading from multiple containers. The files are reported by processes as not existing, and .fusion.log
contains messages about 403 authentication errors. This is apparently similar to a previously fixed issue, but persists in 24.10.0, so it may be a different cause?
See also discussion on Slack.
Expected behavior and actual behavior
We expected that it was possible to read form multiple Azure containers in the same workflow; it seems not to be.
Steps to reproduce the problem
Here is a small workflow to illustrate the problem:
process multi {
conda "conda-forge::gawk"
input:
path(p1)
path(p2)
output:
path("both.txt")
"""
cat ${p1} ${p2} > both.txt
"""
}
workflow {
p1 = Channel.fromPath(params.p1)
p2 = Channel.fromPath(params.p2)
multi(p1, p2)
}
Running
nextflow run main.nf \
-profile azure_batch \
-w az://output/multi \
--p1 az://input1/foo.txt \
--p2 az://input2/bar.txt
fails, whereas
nextflow run main.nf \
-profile azure_batch \
-w az://output/multi \
--p1 az://output/foo.txt \
--p2 az://output/bar.txt
works fine.
The config in question, containing the azure_batch
profile (with some redacted info):
nextflow.enable.moduleBinaries = true
process {
resourceLimits = [ cpus: 128, memory: 200.GB, time: 24.h ]
errorStrategy = { task.exitStatus in [143, 137, 104, 134, 139] ? 'retry' : 'finish' }
maxRetries = 1
maxErrors = '-1'
cpus = { 1 * task.attempt }
memory = { 10.GB * task.attempt }
time = { 12.h * task.attempt }
}
profiles {
azure_batch {
process {
executor = 'azurebatch'
machineType = "Standard_D2_v3,Standard_D4_v3,Standard_D8_v3,Standard_D16_v3,Standard_D32_v3"
}
managedIdentity {
system = true
}
wave {
enabled = true
strategy = ['conda']
}
fusion {
enabled = true
exportStorageCredentials = true
}
azure {
managedIdentity {
system = true
}
storage {
accountName = '[...]'
}
batch {
location = '[...]'
accountName = '[...]'
autoPoolMode = true
deletePoolsOnCompletion = true
pools {
auto {
autoScale = true
vmCount = 1
maxVmCount = 100
virtualNetwork = '[...]'
}
}
}
}
}
}
Program output
Running nextflow prints:
executor > azurebatch (fusion enabled) (1)
[22/bfb52c] multi (1) [100%] 1 of 1, failed: 1
Execution cancelled -- Finishing pending tasks before exit
ERROR ~ Error executing process > 'multi (1)'
Caused by:
The task exited with an exit code representing a failure
Command executed:
cat foo.txt bar.txt > both.txt
Command exit status:
1
Command output:
(empty)
Command error:
+ cat foo.txt bar.txt
cat: foo.txt: No such file or directory
cat: bar.txt: No such file or directory
Work dir:
[...]
Container:
[...]
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
-- Check '.nextflow.log' file for details
The .nextflow.log
does not contain anything that stands out, whereas the .fusion.log
contains:
RESPONSE 403: 403 Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
Environment
- Nextflow version: 24.10.0
- Java version: 21.0.4
- Operating system: Ubuntu 24.04.1 LTS
- Bash version: fish 3.7.0/bash 5.2.21(1)
Additional context
We have not been able to verify whether the problem is fusion-related. The pipeline still fails (with a similar but different error message) when running with fusion.enabled: false
, but it has been difficult to diagnose whether this is the same issue or an unrelated problem with getting azcopy
to where it needs to be during execution.