Skip to content

8b. Building a GTDB database and optional modification

jaclew edited this page Nov 21, 2023 · 3 revisions

Example: Building a GTDB database and optional modification

In this example, taxonomy and genomes will be obtained from GTDB to create the FlexTaxD database (Fdb). After the Fdb is built, FlexTaxD is used to compile the Fdb into a Kraken2 classification-database.

Create environment using mamba (or conda) and install FlexTaxD with dependencies for visualisation and compilation of Kraken2 database:

mamba create -n flextaxd_example flextaxd ncbi-datasets-cli inquirer biopython matplotlib kraken2
conda activate flextaxd_example

Obtain input taxonomy-files from GTDB:

wget https://data.gtdb.ecogenomic.org/releases/latest/bac120_taxonomy.tsv.gz
gunzip bac120_taxonomy.tsv.gz

Create the Fdb:

flextaxd --db gtdb.fdb --taxonomy_file bac120_taxonomy.tsv --taxonomy_type GTDB

Downloading genomes (GTDB representative genomes; warning: this file is ~61 GB):

flextaxd-create --db gtdb.fdb --genomes_path genomes --download

Download alternative 1: Custom version of GTDB.

flextaxd-create --db gtdb.fdb --genomes_path genomes --download --rep_path <URL to GTDB representative genome tarball>

Download alternative 2: Manual relocation of files.

# Download GTDB representative genome tarball
wget https://data.gtdb.ecogenomic.org/releases/latest/genomic_files_reps/gtdb_genomes_reps.tar.gz

# Unpack tarball
tar -zxf gtdb_genomes_reps_r207.tar.gz

# Move files to genomes-directory
mkdir genomes
find gtdb_genomes_reps* -name "*fna.gz" -exec mv {} genomes/ \;

Because the GTDB taxonomy file include all GTDB datasets in addition to the representative genomes, a purge of genomes from the Fdb is required:

flextaxd --db gtdb.fdb --purge_database genomes --purge_database_force

The argument --purge_database traverses the directory of genome-files and removes nodes in the Fdb that do not have a genome. Because GTDB have leaf-nodes with multiple genomes, the representative-genome and sister-genomes, the sister-genomes have no distinct nodes. Thus, the argument --purge_database_force is applied to remove these genomes from the leaf-nodes, leaving only the representative genome.

Building the Kraken2 database:

flextaxd-create --db gtdb.fdb --genomes_path genomes --dbprogram kraken2 --create_db --db_name kraken2.gtdb_bac120 --processes 20

Optional: Expansion of tularensis group

The GTDB-database provides general resolution of bacteria. However, specialists may want to expand taxonomy of interest. Below, the tularensis group is expanded with custom taxonomy to increase precision.

Obtain modification-files (taxonomy and genome-map):

wget https://github.com/FOI-Bioinformatics/flextaxd/raw/master/wiki/example_data/tularensis/ftd.tree2tax.tul.tsv
wget https://github.com/FOI-Bioinformatics/flextaxd/raw/master/wiki/example_data/tularensis/genomes_map.tul.tsv

Make a copy of the GTDB-database for modification:

cp gtdb.fdb gtdb_tularensis.fdb

Visualise the Francisellaceae sub-tree of the Fdb:

flextaxd --db gtdb_tularensis.fdb --vis_type tree --vis_node Francisellaceae --vis_depth 0 --vis_label_size 7

gtdb_francisellaceae Figure 1: The GTDB database, showing Francisellaceae-subtree. The tularensis node is indicated by a red box.

Expanding the tularensis node with custom taxonomy:

flextaxd --db gtdb_tularensis.fdb --mod_file ftd.tree2tax.tul.tsv --genomeid2taxid genomes_map.tul.tsv --parent "Francisella tularensis" --replace

Visualise the Francisellaceae sub-tree of the Fdb, after modification:

flextaxd --db gtdb_tularensis.fdb --vis_type tree --vis_node Francisellaceae --vis_depth 0 --vis_label_size 7

gtdb_francisellaceae_tularensis Figure 2: The GTDB database, showing Francisellaceae-subtree including the expansion of tularensis (indicated by a red box).

Building the Kraken2 database:

flextaxd-create --db gtdb_tularensis.fdb --genomes_path genomes --dbprogram kraken2 --create_db --db_name kraken2.gtdb_bac120_tularensis --processes 20

# When prompted, let FlexTaxD download the genomes of the tularensis-expansion:

>There is a discrepancy of genomes found in the database and the specified genome-folder, 62290 genomes were found and 9 genomes are missing.
>You may want to purge your database from missing genomes using "flextaxd --purge_database"
>Do you want to download these genomes from NCBI? (y)es, (n)o, (c)ancel: y

Clone this wiki locally