Skip to content

feat: remove storage account name if present from azure path #4692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

endre-seqera
Copy link

Closes: #4683

Remove storage account name if present from azure path

Tests

Tested using nextflow-publishdir pipeline.

Before:

nextflow run https://github.com/endre-seqera/nextflow-publishdir -r main --outdir "az://nfazurestore.eend-test" #FAILS

ERROR ~ Error executing process > 'TEST_PUBLISH_DIR (3)'

Caused by:
  /nfazurestore.eend-test: Unable to determine if root directory exists


 -- Check '.nextflow.log' file for details

After:

 ./launch.sh run -c ~/azure-nextflow.config https://github.com/endre-seqera/nextflow-publishdir -r main --outdir "az://nfazurestore.eend-test"
[00/6cfd76] process > TEST_PUBLISH_DIR (3) [100%] 3 of 3 ✔
Screenshot 2024-01-26 at 14 32 03

Copy link

netlify bot commented Jan 26, 2024

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit dd0e720
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/686c1044b77587000899ee3a

Signed-off-by: endre-seqera <endre.sukosd@seqera.io>
@endre-seqera endre-seqera force-pushed the publishdir-azure-storage-account-name-remove branch from aee2c56 to fcd676f Compare January 26, 2024 16:09
Signed-off-by: endre-seqera <endre.sukosd@seqera.io>
@endre-seqera endre-seqera force-pushed the publishdir-azure-storage-account-name-remove branch from f64509e to 4704d95 Compare January 26, 2024 16:48
@pditommaso
Copy link
Member

Provided this may work, i'm bit concerned altering the container aka bucket name under the hood. This could result having nextflow reporting a different object file path compared to the one specified by the user, in the trace files, report, and other provenance record.

It may be worth to add full support for it

@endre-seqera
Copy link
Author

endre-seqera commented Jan 29, 2024

Provided this may work, i'm bit concerned altering the container aka bucket name under the hood. This could result having nextflow reporting a different object file path compared to the one specified by the user, in the trace files, report, and other provenance record.

I understand your concern, but the bucket (container) name is not altered, only the additional information (storage account name) is removed. So object file path will still be correct, just missing this extra information.

But this extra information is redundant in a way, because it is already known to the user, since the user has to configure it with their azure credentials in the nextflow.config, so the user should not doubt which storage account is being used:

 azure {
   storage {
     accountName = '<YOUR STORAGE ACCOUNT NAME>'
   }
 }

The "convention" used in SeqeraPlatform "${providerSchema}://${storageName}.${bucket}" is arbitrary, it is just to have a unique path and differentiate between potentially duplicated bucket (container) names in different storage accounts (like in Data Explorer, where results from multiple credentials can show up).

But no one is expecting azure paths in this format, and for a given, specific nextflow configuration (or for one specific credential) this information (storage account name) is not necessary.

@endre-seqera endre-seqera force-pushed the publishdir-azure-storage-account-name-remove branch from 456c4c3 to 4704d95 Compare February 2, 2024 10:01
@adamrtalbot
Copy link
Collaborator

The "convention" used in SeqeraPlatform "${providerSchema}://${storageName}.${bucket}" is arbitrary, it is just to have a unique path and differentiate between potentially duplicated bucket (container) names in different storage accounts (like in Data Explorer, where results from multiple credentials can show up).

But no one is expecting azure paths in this format, and for a given, specific nextflow configuration (or for one specific credential) this information (storage account name) is not necessary.

Agreed, if we can at least click the data explorer interface to add a file or dir as input this would make the platform compatible. Extending this to actually support multiple storage accounts would be the next step: #4445 (comment)

@eparisis
Copy link

Hi there, this still seems to be an issue. I've had this problem when selecting my data in seqera platform from the data explorer. The path would be parsed as az://storageaccount.container/data in my pipeline and this would result in an ERROR ~ Status code 400, InvalidResourceName error. When putting in the path manually as described in the docs (az://container/data) the pipeline works as expected. Took me a long time to figure out what was wrong.

@robnewman
Copy link
Contributor

@endre-seqera @bentsherman @pditommaso Do we need to sit down and work through this to get to resolution next week? Customers are encountering this and it would be good to resolve/merge. Thanks!

@pditommaso
Copy link
Member

I still skeptic about this approach, especially considering a future goal is to allow the use of multiple blob containers across different accounts, therefore the account name information should not stripped from the URI.

I have a tentative implementation retaining the account name in the URI that I'll upload at my earliest convenience. it's also required to make sure both azcopy and Fusion are able to handle this change. @jordeu any clue about the latter ?

@jordeu
Copy link
Collaborator

jordeu commented Jul 1, 2025

Right now, Fusion gets the storage account from an environment variable only. If necessary, we could find other solutions.

But I'd like to challenge the need to make the URI universal with all the keys to identify a data source. There are other similar scenarios where we need more pieces to identify a source, and right now, Nextflow is not flexible enough.

Let me explain an example, imagine that we want to use the same pipeline for data that we've in a local MinIO S3 storage, with data that is at AWS S3 or data on LakeFS S3. Currently, this is not possible with Nextflow because we identify the "source" from the URI schema (i.e s3://...) and it is hardcoded.

What if it was possible to define our own "schemes" at the Nextflow config level? Something like:

sources {
     s3 {
          provider = aws
          config {
                  region = eu-west1
                  ...
          }

     minio {
          provider = s3
          config {
                  server = ...
                  ....
          }
}

Then all the paths that start with s3://... will use the first source, and all the paths that start with minio://... will use the second source.

This will allow other use cases, like overwriting the "source" location.

Example of general config:

sources {
     refgenomes {
           provider = aws 
           prefix = ngi-igenomes
           config {
                  ...
           }
     }
}

That with a profile can be overwritten by a local copy of the source "refgenomes":

sources {
     refgenomes {
             provider = s3
             prefix = my-igenomes-mirror
             config {
                   server = my-minio
                   ...
             }
      }
}

So we could always use the path refgenomes://homo_sapiens/.... independently of which replica of the source we are using.

I think that this concept of "sources" maps very well to the concept of "data link" that is associated with the Seqera platform.

Overall, what I mean is that what we are trying to solve now here for Azure is a more generic problem that we'll face in other use cases that want more flexibility, mixing multiple data sources in a single pipeline.

@pditommaso
Copy link
Member

Think this is a good idea but beyond the goal of this ticket, that consists on supporting the syntax az://ACCOUNT.BUCKER/path introduced by Seqera Platform.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

publishDir compatibility with Azure Blob Storage
7 participants