Skip to content

Nextflow with Azure batch appears to fail when reading from multiple containers #5448

@zjupNN

Description

@zjupNN

Bug report

Nextflow with Azure batch appears to fail when reading from multiple containers. The files are reported by processes as not existing, and .fusion.log contains messages about 403 authentication errors. This is apparently similar to a previously fixed issue, but persists in 24.10.0, so it may be a different cause?

See also discussion on Slack.

Expected behavior and actual behavior

We expected that it was possible to read form multiple Azure containers in the same workflow; it seems not to be.

Steps to reproduce the problem

Here is a small workflow to illustrate the problem:

process multi {
  conda "conda-forge::gawk"

  input:
  path(p1)
  path(p2)

  output:
  path("both.txt")

  """
  cat ${p1} ${p2} > both.txt
  """
}

workflow {
  p1 = Channel.fromPath(params.p1)
  p2 = Channel.fromPath(params.p2)
  multi(p1, p2)
}

Running

nextflow run main.nf \
  -profile azure_batch \
  -w az://output/multi \
  --p1 az://input1/foo.txt \
  --p2 az://input2/bar.txt

fails, whereas

nextflow run main.nf \
  -profile azure_batch \
  -w az://output/multi \
  --p1 az://output/foo.txt \
  --p2 az://output/bar.txt

works fine.

The config in question, containing the azure_batch profile (with some redacted info):

nextflow.enable.moduleBinaries = true

process {
    resourceLimits = [ cpus: 128, memory: 200.GB, time: 24.h ]

    errorStrategy = { task.exitStatus in [143, 137, 104, 134, 139] ? 'retry' : 'finish' }
    maxRetries = 1
    maxErrors = '-1'

    cpus = { 1 * task.attempt }
    memory = { 10.GB * task.attempt }
    time = { 12.h * task.attempt }
}

profiles {
  azure_batch {
    process {
      executor = 'azurebatch'
      machineType = "Standard_D2_v3,Standard_D4_v3,Standard_D8_v3,Standard_D16_v3,Standard_D32_v3"
    }

    managedIdentity {
          system = true
      }

        wave {
            enabled = true
            strategy = ['conda']
        }

        fusion { 
            enabled = true
            exportStorageCredentials = true
        }

    azure {
      managedIdentity {
        system = true
      }

      storage {
        accountName = '[...]'
      }

      batch {
        location = '[...]'
        accountName = '[...]'

        autoPoolMode = true
        deletePoolsOnCompletion = true

        pools {
                auto {
           autoScale = true
              vmCount = 1
              maxVmCount = 100
                       virtualNetwork = '[...]'
                    }
         }
      }
    }
  }
}

Program output

Running nextflow prints:

executor >  azurebatch (fusion enabled) (1)
[22/bfb52c] multi (1) [100%] 1 of 1, failed: 1
Execution cancelled -- Finishing pending tasks before exit
ERROR ~ Error executing process > 'multi (1)'

Caused by:
  The task exited with an exit code representing a failure


Command executed:

  cat foo.txt bar.txt > both.txt

Command exit status:
  1

Command output:
  (empty)

Command error:
  + cat foo.txt bar.txt
  cat: foo.txt: No such file or directory
  cat: bar.txt: No such file or directory

Work dir:
  [...]

Container:
  [...]

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

-- Check '.nextflow.log' file for details

The .nextflow.log does not contain anything that stands out, whereas the .fusion.log contains:

RESPONSE 403: 403 Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.

Environment

  • Nextflow version: 24.10.0
  • Java version: 21.0.4
  • Operating system: Ubuntu 24.04.1 LTS
  • Bash version: fish 3.7.0/bash 5.2.21(1)

Additional context

We have not been able to verify whether the problem is fusion-related. The pipeline still fails (with a similar but different error message) when running with fusion.enabled: false, but it has been difficult to diagnose whether this is the same issue or an unrelated problem with getting azcopy to where it needs to be during execution.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions