Skip to content

Information issue: example for using meta_conduct from python-code #395

@christian-monch

Description

@christian-monch

This is an example of how to use meta_conduct from python-code to extract metadata from all files of a datalad dataset.

In this example, we assume that we want to execute three extractors on every file. Also, the results of the extraction should be written to stdout.

The example uses meta-conduct to perform the extraction. It defines a pipeline that traverses over all files of a datalad dataset and feeds each file into a processing pipeline of three extractors. Each "extractor" is an instance of the generic class datalad_metalad.pipeline.processor.MetadataExtractor. The instances are configured during instantiation through the parameters extractor_name and extractor_type. The values for those parameters are provided as arguments to meta_conduct.

This is the example code. The dataset directory is given in the global variable dataset_dir and the names of the three extractors are given in the global variables X, Y, and Z (the code executes all three extractors as file-level extractors and assumes that all provided extractors are either file-level extractors or legacy extractors):

from datalad.api import meta_conduct

# The dataset on which the extractors should run
dataset_dir = "<path to your datalad-dataset>"

# The extractors that should be executed
X = "metalad_core"
Y = "datalad_core"
Z = "metalad_example_file"


pipeline = {
    'provider': {
        'module': 'datalad_metalad.pipeline.provider.datasettraverse',
        'class': 'DatasetTraverser',
        'name': 'trav',
        'arguments': {}
    },
    'processors': [
        {
            'module': 'datalad_metalad.pipeline.processor.extract',
            'class': 'MetadataExtractor',
            'name': 'ext1',
            'arguments': {}
        },
        {
            'module': 'datalad_metalad.pipeline.processor.extract',
            'class': 'MetadataExtractor',
            'name': 'ext2',
            'arguments': {}
        },
        {
            'module': 'datalad_metalad.pipeline.processor.extract',
            'class': 'MetadataExtractor',
            'name': 'ext3',
            'arguments': {}
        }
    ]
}

adding_pipeline = {
    **pipeline,
    'consumer': {
        'module': 'datalad_metalad.pipeline.consumer.add_stdin',
        'class': 'StdinAdder',
        'name': 'adder',
        'arguments': {}
    }
}


arguments = [
    f'trav.top_level_dir={dataset_dir}', 'trav.item_type=file',
    f'ext1.extractor_name={X}', 'ext1.extractor_type=file',
    f'ext2.extractor_name={Y}', 'ext2.extractor_type=file',
    f'ext3.extractor_name={Z}', 'ext3.extractor_type=file',
]

for result in meta_conduct(configuration=pipeline,
                           arguments=arguments,
                           result_renderer='disabled'):
    print(result)

The code above will print the resulting metadata records to stdout. They can then be processed further, and, for example, be added to the git-repo of the dataset via meta-add. To this end, it would be sufficient to pipe the results into a call like this, which instructs meta-add to read content from stdin and to expect the content to be in JSON-lines format, i.e. to contain one metadata-record per input line.

datalad meta-add --json-lines -

Adding to the git-repo can also be achieved by adding a consumer-component to the pipeline definition. In the code above the variable adding-pipeline contains such a consumer. The result of running the pipeline adding-pipeline would be equivalent to running datalad meta-conduct and piping the result into datalad meta-add as described above. (If you want to try it, you might have to specify the parameter adder.dataset=<path to your datalad-dataset> to tell the consumer in which git-repo the metadata records should be stored.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions