-
Notifications
You must be signed in to change notification settings - Fork 11
Description
This is an example of how to use meta_conduct
from python-code to extract metadata from all files of a datalad dataset.
In this example, we assume that we want to execute three extractors on every file. Also, the results of the extraction should be written to stdout
.
The example uses meta-conduct
to perform the extraction. It defines a pipeline that traverses over all files of a datalad dataset and feeds each file into a processing pipeline of three extractors. Each "extractor" is an instance of the generic class datalad_metalad.pipeline.processor.MetadataExtractor
. The instances are configured during instantiation through the parameters extractor_name
and extractor_type
. The values for those parameters are provided as arguments to meta_conduct
.
This is the example code. The dataset directory is given in the global variable dataset_dir
and the names of the three extractors are given in the global variables X
, Y
, and Z
(the code executes all three extractors as file-level extractors and assumes that all provided extractors are either file-level extractors or legacy extractors):
from datalad.api import meta_conduct
# The dataset on which the extractors should run
dataset_dir = "<path to your datalad-dataset>"
# The extractors that should be executed
X = "metalad_core"
Y = "datalad_core"
Z = "metalad_example_file"
pipeline = {
'provider': {
'module': 'datalad_metalad.pipeline.provider.datasettraverse',
'class': 'DatasetTraverser',
'name': 'trav',
'arguments': {}
},
'processors': [
{
'module': 'datalad_metalad.pipeline.processor.extract',
'class': 'MetadataExtractor',
'name': 'ext1',
'arguments': {}
},
{
'module': 'datalad_metalad.pipeline.processor.extract',
'class': 'MetadataExtractor',
'name': 'ext2',
'arguments': {}
},
{
'module': 'datalad_metalad.pipeline.processor.extract',
'class': 'MetadataExtractor',
'name': 'ext3',
'arguments': {}
}
]
}
adding_pipeline = {
**pipeline,
'consumer': {
'module': 'datalad_metalad.pipeline.consumer.add_stdin',
'class': 'StdinAdder',
'name': 'adder',
'arguments': {}
}
}
arguments = [
f'trav.top_level_dir={dataset_dir}', 'trav.item_type=file',
f'ext1.extractor_name={X}', 'ext1.extractor_type=file',
f'ext2.extractor_name={Y}', 'ext2.extractor_type=file',
f'ext3.extractor_name={Z}', 'ext3.extractor_type=file',
]
for result in meta_conduct(configuration=pipeline,
arguments=arguments,
result_renderer='disabled'):
print(result)
The code above will print the resulting metadata records to stdout
. They can then be processed further, and, for example, be added to the git-repo of the dataset via meta-add
. To this end, it would be sufficient to pipe the results into a call like this, which instructs meta-add
to read content from stdin
and to expect the content to be in JSON-lines format, i.e. to contain one metadata-record per input line.
datalad meta-add --json-lines -
Adding to the git-repo can also be achieved by adding a consumer-component to the pipeline definition. In the code above the variable adding-pipeline
contains such a consumer. The result of running the pipeline adding-pipeline
would be equivalent to running datalad meta-conduct
and piping the result into datalad meta-add
as described above. (If you want to try it, you might have to specify the parameter adder.dataset=<path to your datalad-dataset>
to tell the consumer in which git-repo the metadata records should be stored.)