Skip to content

Data lineage tracking (aka CID store) #5715

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 96 commits into from
Apr 23, 2025
Merged

Data lineage tracking (aka CID store) #5715

merged 96 commits into from
Apr 23, 2025

Conversation

pditommaso
Copy link
Member

@pditommaso pditommaso commented Jan 27, 2025

Tentative implementation for addressable data store (very basic POC so far).

Update on 1 Mar 2025 from #5787 by @jorgee

M1 Implementation of CID store for provenance

Changes:

  • CID store is specified by workflow.data.store.location
  • Workflow Hash is created based on the workflow and parameters description
  • workflow, tasks and outputs metadata are stored in <cid.store.location>/.meta
  • references to other cid metadata are cid://<workflow_hash|task_hash/output_target_path
  • CID NIO Filesystem to access data based on CIS URLs
  • nextflow cid command to log, show and get lineage from CID store metadata

Known Limitations:

  • Outputs which are not published in absolutePaths or URLs which are not subfolders both the outputDir, we can not infer the relative output target path. They are not currently tracked in the CID store. We could create a hash for the parent directory of the URL or absolute path and use it as relative folder.

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
@pditommaso pditommaso marked this pull request as draft January 27, 2025 13:15
Copy link

netlify bot commented Jan 27, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 854a9de
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/68089db9fce4e30008a8fa2c

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
@pditommaso pditommaso force-pushed the master branch 2 times, most recently from 5a93547 to 27345a6 Compare February 10, 2025 21:46
@pditommaso
Copy link
Member Author

@jorgee apologies, can latest changes be made as PR against this branch? so it will be much simpler do understand what's new for me

@jorgee
Copy link
Contributor

jorgee commented Feb 13, 2025

@jorgee apologies, can latest changes be made as PR against this branch? so it will be much simpler do understand what's new for me

I have reverted the changes in this branch and created a new one in PR #5787

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
jorgee and others added 15 commits February 17, 2025 18:16
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Copy link
Member

@bentsherman bentsherman Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After playing around with the lineage command, I am skeptical about how much we are overloading this lid pseudo-filesystem. I thought it was just a nice add-on that we could experiment with, but now I think it's just getting in the way.

Currently there are three main uses for lid paths:

  1. lid://<hash>[#props]: returns a metadata record or sub-path. This has no practical utility in a Nextflow script, not even for workflow outputs. Now that #outputs is a list, I can't access an output by name (e.g. #outputs.samples), which means I can't use channel.fromPath() to access an LID output in the same way as a samplesheet. So the LID output is no longer a drop-in replacement for samplesheets.

On the command line, it would be simpler to just provide the hash and use jq:

# before
# oops, forgot to escape the #...
nextflow li describe lid://<hash>#params

# after
nextflow li describe <hash> | jq .params

In a web interface like the platform, you'll use a graphical interface to navigate this metadata, so the fragment syntax is not needed there.

  1. lid:///?<name>=<value>&...: used by the find command to retrieve a collection of metadata records. This also has no utility in a Nextflow script, because it is unrelated to domain-specific data like #outputs. It is only used by the find command, so the URI syntax is just getting in the way:
# before
# oops, forgot to escape the & ...
nextflow li find lid:///?type=DataOutput&workflowRun=lid://2265a814fd1c205ecc5b629070d759e2

# after
nextflow li find type=DataOutput workflowRun=2265a814fd1c205ecc5b629070d759e2
  1. lid://<hash>/<path>: returns a content-addressed file. This is the original use case and the only one that still makes sense as far as I can tell. I think this works perfectly both on the command line and in the Nextflow script/runtime.

Based on this analysis, I think we should ditch (1) and (2) entirely and use lid:// only to refer to files.

Maybe we could use the fragment to refer to a specific output, e.g. lid://<hash>#samples. That would at least restore the original use case of passing a workflow output as input to a downstream pipeline.

@jorgee
Copy link
Contributor

jorgee commented Apr 22, 2025

8. Command lineage find 'type=DataOutput' work OK, but equivalent channel.fromPath('lid:///?type=DataOutput') seems not working 🔴

The ? is interpreted as a glob. But setting the option glob: false is also failing. It is not detecting is a global query. I will check it. However, I think the result in this case could be a bit undetermined. what do you expect to have in this case? the DataOutput descriptions or the real target paths?

jorgee added 2 commits April 22, 2025 11:20
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee
Copy link
Contributor

jorgee commented Apr 22, 2025

Based on previous comments, I have pushed some minor changes:

  • WorkflowRun.resolvedConfig to config
  • XxxOutputs -> XxxOutput, outputs to output and inputs to input
  • Command describe to view and log to list
  • TaskRun.scriptChckesum -> TaskRun.script including the resolved task script
  • DataOutput -> FileOutput

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@bentsherman
Copy link
Member

TODOs from our discussion

This PR:

  • rename log to list
  • rename describe to view

Separate PR(s);

  • document lineage command
  • add channel.fromLineage() factory to query metadata
  • remove URL query syntax
    • use key-value pairs on command line
    • use key-value pairs in channel factory, e.g. channel.fromLineage(foo: 'bar')
  • simplify URL fragment syntax
    • use jq on command line
    • use json-path in channel factory (I think splitJson operator already implements json-path)
    • use URL fragment to retrieve workflow output by name e.g. channel.fromPath('lid://<hash>#samples')

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
@pditommaso
Copy link
Member Author

Ok, I've moved the h2 stuff in the corresponding repo https://github.com/nextflow-io/nf-lineage-h2

@jorgee
Copy link
Contributor

jorgee commented Apr 23, 2025

This PR:

  • rename log to list
  • rename describe to view

This is already changed in the current PR

@pditommaso
Copy link
Member Author

Let's move ahead, then

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
@pditommaso
Copy link
Member Author

@pditommaso pditommaso merged commit 20e06da into master Apr 23, 2025
9 checks passed
@pditommaso pditommaso deleted the cid-store branch April 23, 2025 08:24

private static String HISTORY_FILE_NAME = ".history"
private static final String METADATA_FILE = '.data.json'
private static final String METADATA_PATH = '.meta'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorgee what is the rationale for the .meta subfolder? It looks like the only thing that is created in the lineage folder. Do you intend to store other things under lineage as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bentsherman, it was storing the output data and the metadata in the first implementation, but currently this sub-folder is not needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, not urgent but something to consider in the final cleanup before 25.04

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants