Information issue: example for using `meta_conduct` from python-code

This is an example of how to use `meta_conduct` from python-code to extract metadata from all files of a datalad dataset.

In this example, we assume that we want to execute three extractors on every file. Also, the results of the extraction should be written to `stdout`.

The example uses `meta-conduct` to perform the extraction. It defines a pipeline that traverses over all files of a datalad dataset and feeds each file into a processing pipeline of three extractors. Each "extractor" is an instance of the generic class `datalad_metalad.pipeline.processor.MetadataExtractor`. The instances are configured during instantiation through the parameters `extractor_name` and `extractor_type`. The values for those parameters are provided as arguments to `meta_conduct`.

This is the example code. The dataset directory is given in the global variable `dataset_dir` and the names of the three extractors are given in the global variables `X`, `Y`, and `Z` (the code executes all three extractors as file-level extractors and assumes that all provided extractors are either file-level extractors or legacy extractors):

```python
from datalad.api import meta_conduct

# The dataset on which the extractors should run
dataset_dir = "<path to your datalad-dataset>"

# The extractors that should be executed
X = "metalad_core"
Y = "datalad_core"
Z = "metalad_example_file"


pipeline = {
    'provider': {
        'module': 'datalad_metalad.pipeline.provider.datasettraverse',
        'class': 'DatasetTraverser',
        'name': 'trav',
        'arguments': {}
    },
    'processors': [
        {
            'module': 'datalad_metalad.pipeline.processor.extract',
            'class': 'MetadataExtractor',
            'name': 'ext1',
            'arguments': {}
        },
        {
            'module': 'datalad_metalad.pipeline.processor.extract',
            'class': 'MetadataExtractor',
            'name': 'ext2',
            'arguments': {}
        },
        {
            'module': 'datalad_metalad.pipeline.processor.extract',
            'class': 'MetadataExtractor',
            'name': 'ext3',
            'arguments': {}
        }
    ]
}

adding_pipeline = {
    **pipeline,
    'consumer': {
        'module': 'datalad_metalad.pipeline.consumer.add_stdin',
        'class': 'StdinAdder',
        'name': 'adder',
        'arguments': {}
    }
}


arguments = [
    f'trav.top_level_dir={dataset_dir}', 'trav.item_type=file',
    f'ext1.extractor_name={X}', 'ext1.extractor_type=file',
    f'ext2.extractor_name={Y}', 'ext2.extractor_type=file',
    f'ext3.extractor_name={Z}', 'ext3.extractor_type=file',
]

for result in meta_conduct(configuration=pipeline,
                           arguments=arguments,
                           result_renderer='disabled'):
    print(result)
```


The code above will print the resulting metadata records to `stdout`. They can then be processed further, and, for example, be added to the git-repo of the dataset via `meta-add`. To this end, it would be sufficient to pipe the results into a call like this, which instructs `meta-add` to read content from `stdin` and to expect the content to be in JSON-lines format, i.e. to contain one metadata-record per input line.

```
datalad meta-add --json-lines -
```

Adding to the git-repo can also be achieved by adding a consumer-component to the pipeline definition. In the code above the variable `adding-pipeline` contains such a consumer. The result of running the pipeline `adding-pipeline` would be equivalent to running `datalad meta-conduct` and piping the result into `datalad meta-add` as described above. (If you want to try it, you might have to specify the parameter `adder.dataset=<path to your datalad-dataset>` to  tell the consumer in which git-repo the metadata records should be stored.)





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Information issue: example for using `meta_conduct` from python-code #395

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Information issue: example for using meta_conduct from python-code #395

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Information issue: example for using `meta_conduct` from python-code #395