-
Notifications
You must be signed in to change notification settings - Fork 700
Data lineage tracking (aka CID store) #5715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
5a93547
to
27345a6
Compare
@jorgee apologies, can latest changes be made as PR against this branch? so it will be much simpler do understand what's new for me |
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
modules/nextflow/src/main/groovy/nextflow/cli/CmdLineage.groovy
Outdated
Show resolved
Hide resolved
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After playing around with the lineage
command, I am skeptical about how much we are overloading this lid
pseudo-filesystem. I thought it was just a nice add-on that we could experiment with, but now I think it's just getting in the way.
Currently there are three main uses for lid
paths:
lid://<hash>[#props]
: returns a metadata record or sub-path. This has no practical utility in a Nextflow script, not even for workflow outputs. Now that#outputs
is a list, I can't access an output by name (e.g.#outputs.samples
), which means I can't usechannel.fromPath()
to access an LID output in the same way as a samplesheet. So the LID output is no longer a drop-in replacement for samplesheets.
On the command line, it would be simpler to just provide the hash and use jq
:
# before
# oops, forgot to escape the #...
nextflow li describe lid://<hash>#params
# after
nextflow li describe <hash> | jq .params
In a web interface like the platform, you'll use a graphical interface to navigate this metadata, so the fragment syntax is not needed there.
lid:///?<name>=<value>&...
: used by thefind
command to retrieve a collection of metadata records. This also has no utility in a Nextflow script, because it is unrelated to domain-specific data like#outputs
. It is only used by thefind
command, so the URI syntax is just getting in the way:
# before
# oops, forgot to escape the & ...
nextflow li find lid:///?type=DataOutput&workflowRun=lid://2265a814fd1c205ecc5b629070d759e2
# after
nextflow li find type=DataOutput workflowRun=2265a814fd1c205ecc5b629070d759e2
lid://<hash>/<path>
: returns a content-addressed file. This is the original use case and the only one that still makes sense as far as I can tell. I think this works perfectly both on the command line and in the Nextflow script/runtime.
Based on this analysis, I think we should ditch (1) and (2) entirely and use lid://
only to refer to files.
Maybe we could use the fragment to refer to a specific output, e.g. lid://<hash>#samples
. That would at least restore the original use case of passing a workflow output as input to a downstream pipeline.
The |
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Based on previous comments, I have pushed some minor changes:
|
TODOs from our discussion This PR:
Separate PR(s);
|
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Ok, I've moved the h2 stuff in the corresponding repo https://github.com/nextflow-io/nf-lineage-h2 |
This is already changed in the current PR |
Let's move ahead, then |
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
|
||
private static String HISTORY_FILE_NAME = ".history" | ||
private static final String METADATA_FILE = '.data.json' | ||
private static final String METADATA_PATH = '.meta' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorgee what is the rationale for the .meta
subfolder? It looks like the only thing that is created in the lineage
folder. Do you intend to store other things under lineage
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bentsherman, it was storing the output data and the metadata in the first implementation, but currently this sub-folder is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, not urgent but something to consider in the final cleanup before 25.04
Tentative implementation for addressable data store (very basic POC so far).
Update on 1 Mar 2025 from #5787 by @jorgee
M1 Implementation of CID store for provenance
Changes:
Known Limitations: