Skip to content

Browsing resources and releases

Fabio Cumbo edited this page Sep 20, 2021 · 2 revisions

In order to easily retrieve genomes, samples, datasets, and clusters information from MetaRefSGB, we provide an inspector tool that is already integrated into the pipeline and can be called by running the following command on your terminal:

MetaRefSGB --inspect --genome=663737656 --db=~/db --release=Jan21

In particular, the last command will search for the MetaRefSGB Unique Genome Identifier 663737656 into the Jan21 release and will print the results on screen as a dictionary, as reported below.

{
    "hits": [
        {
            "category": "Metagenome-assembled Genome",
            "closest_references": [
                "152744499"
            ],
            "completeness": "96.64",
            "contamination": "1.34",
            "dataset_id": "AsnicarF_2020",
            "ecosystem": "Host-associated",
            "ecosystem_category": "Human,Mammals",
            "ecosystem_subtype": "Gut",
            "ecosystem_type": "Digestive system",
            "fgb": "FGB1476",
            "ggb": "GGB3740",
            "mag_id": "AsnicarF_2020__833__bin.16",
            "metarefsgb_id": "663737656",
            "notes": null,
            "sample_id": "833",
            "sgb": "SGB5075",
            "sgb_centroid": "384699434",
            "specific_ecosystem": "Fecal",
            "strain_heterogeneity": "50.0"
        }
    ]
}

Remember that all the releases are linked together. This means that when you specify a release, all the previous releases up to the specified one will be loaded.

Similarly, the inspector can be used by also specifying a sample (e.g. --sample=833), a dataset (e.g. --dataset=AsnicarF_2020), or a cluster (e.g. --cluster=SGB5075, --cluster=GGB3740, --cluster=FGB1476).

Remember to always add the --db argument followed by the path to the main folder of the MetaRefSGB database, which is the same used for running the main pipeline of assigning new genomes to SGBs.

In case you need to run the inspector a multitude of IDs, you are encouraged to use the --file argument followed by the path to a one-column file with a predefined list of genomes, samples, datasets, or clusters. Please note that the first line of this file must contain a header that describe the data in your list. For instance, if you need to run the inspector on a list of MetaRefSGB Unique Genome Identifiers, you can run the following command:

MetaRefSGB --inspect --file=~/mygenomes.txt --db=~/db --release=Jan21

Where ~/mygenomes.txt should look like the following snippet:

# metarefsgb_id
663737656
841942266
618704549

Remember to change the header with sample_id or dataset_id in case you need to search for multiple samples or datasets. The --file option does not work in case you need to search for clusters.

It is worth noting that the output will be printed on screen. To redirect the output, please use the --output argument as shown below:

MetaRefSGB --inspect --genome=663737656 --db=~/db --release=Jan21 --output=~/663737656.json

This tool can also be used to inspect the MetaRefSGB Data Models (MDM) in order to help the contributors understand how to share their data. You can inspect the MDM by running:

MetaRefSGB --inspect --schema=MAG

This will print on screen the content of the MAG model. In order to inspect also the genome and metadata models, just replace MAG with genome or metadata. Please have a look at the MDM Schema section of this Wiki for a deep explanation about how we manage data in MetaRefSGB.

Clone this wiki locally