Skip to content

8e. Walkthrough 16S GTDB kraken2 database

jaclew edited this page Nov 21, 2023 · 2 revisions

(This walkthrough is work in progress and is not yet fully tested)

In this walkthrough we will show two ways of creating a Kraken2 16S database with FlexTaxD using GTDB taxonomy. One major difference between the two methods is that the second method keeps the NCBI level structure in GTDB (Genus Speces etc) which can make intepretations easier.

1 - Setup

1.1 - Environments

Install mamba in your base conda environment.

conda install mamba -n base -c conda-forge

We will then create a self-contained conda environment with flextaxd and kraken2:

mamba create -n flextaxd flextaxd kraken2

Please note that you can also use for example KrakenUniq or Ganon.

(For user new to conda please see the conda guide)

# Remember to activate the flextaxd environment
conda activate flextaxd

1.2 Fetch source files

from zedono_record Ali Alishum. (2021). DADA2 formatted 16S rRNA gene sequences for both bacteria & archaea (Version Version 4.1) [Data set]. Zenodo.

16S database files
wget https://zenodo.org/record/4409439/files/GTDB_arc_bact_taxo.txt
wget https://zenodo.org/record/4409439/files/GTDB_bac120_arc122_ssu_r95_Species.fa.gz

Download the GTDBtaxonomy: Parks, D.H., Chuvochina, M., Chaumeil, PA. et al. A complete domain-to-species taxonomy for Bacteria and Archaea. Nat Biotechnol 38, 1079–1086 (2020). GTDB

GTDB source directly
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/bac120_taxonomy.tsv
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/ar122_taxonomy.tsv

1.3 Prepare genomes

Method 1 (zedodo source)

awk '/^>/ {close(F) ; F = substr($1,2,length($1)-1)".fasta"} {print >> F}' GTDB_bac120_arc122_ssu_r95_Species.fa

Method 2 (GTDB source)

The GTDB_bac120_arc122_ssu_r95_Species_GTDB.fa file was modified changing GB_GCF_00000000_1 "[1-5]" to GCF_00000000.1 .[1-5] as this is the standard format from Genbank/Refseq and GTDB. The command below also removes the "GB" and or "RS_" from the output. The file headers in the fasta file are expected to change from

>GS_GCF_XXX_1
to
>GCF_XXX.1
sed -i 's/_1 /.1 /g' GTDB_bac120_arc122_ssu_r95_Species_GTDB.fa
sed -i 's/_2 /.2 /g' GTDB_bac120_arc122_ssu_r95_Species_GTDB.fa
#Then for 3-5

# Observe the difference from version 1 here the substr start has changed from 2 to 5 to remove "GB_" "RS_" in the headers
awk '/^>/ {close(F) ; F = substr($1,5,length($1)-1)".fasta"} {print >> F}' GTDB_bac120_arc122_ssu_r95_Species_GTDB.fa

Thanks to @HitMonk for the awk commands preparing the 16S datafile for FlexTaxD

Create genomeid2taxid file

grep ">" GTDB_bac120_arc122_ssu_r95_Species_GTDB.fa | cut -c2- | sed 's/ /      /' > g2id.txt

1.4 Two ways of creating your Kraken2 16S database with FlexTaxD

2.1 Process method 1

prepare files

awk -F'(' '{print $1}' GTDB_arc_bact_taxo.txt | awk '{print $2"_"$3}' > GTDB_arc_bact_taxo_tree.txt
sed -i 's/Bacteria/root;Bacteria/g' GTDB_arc_bact_taxo_tree.txt
sed -i 's/Archaea/root;Archaea/g' GTDB_arc_bact_taxo_tree.txt

Make unique tree

cat GTDB_arc_bact_taxo_tree.txt | sort | uniq > GTDB_arc_bact_taxo_tree_unique.txt

Run flextaxd

Create flextaxd taxonomy database

flextaxd -db 16S_database.db -tf GTDB_arc_bact_taxo_tree_unique.txt -tt CanSNPer --genomeid2taxid g2id.txt --dump --dbprogram kraken2 -o taxonomy --verbose --logs logs/zenodo

Create kraken database

flextaxd-create -db 16S_database.db --create_db --db_name GTDB_arc_bact_taxo_krakendb --genomes_path GBRS_genomes --dbprogram kraken2 -o taxonomy --verbose --processes 40

3.1 Process method 2

3.1.1 Database root

We start this process with creating the base taxonomy to allow the two sources Archaea and Bacteria to be merged. Below is the most basic tree with the two nodes that we want to merge. Check the formats to make sure that the file is properly formatted.

parent  child   rank
root    cellular_organism       no rank
cellular_organism       Archaea kingdom
cellular_organism       Bacteria        kingdom

One alternative is to use NCBI as a source check points 2.1 to 3 here Merge NCBI and GTDB walkthrough

3.1.2 Create FlexTaxD databases

Create the nessesary FlexTaxD databases for the merge.

flextaxd -db base.db -tf base_tree.txt --verbose --logs logs/GTDB
flextaxd -db bac120.db -tf bac120_taxonomy.tsv -tt QIIME --verbose --logs logs/GTDB
flextaxd -db ar122.db -tf ar122_taxonomy.tsv -tt QIIME --verbose --logs logs/GTDB
cp base.db GTDB_16S_database.db

3.1.3 Merge databases

Merge the databases with the base tree. If NCBI is used as a base remember to add the --replace parameter.

flextaxd -db GTDB_16S_database.db -md bac120.db --parent "Bacteria" --verbose --logs logs/GTDB
flextaxd -db GTDB_16S_database.db -md ar122.db --parent "Archaea" --verbose --logs logs/GTDB

3.1.4 Dump database

Export the database to nodes.dmp and names.dmp and the build kraken2 database.

flextaxd -db GTDB_16S_database.db --dump -o taxonomy --dbprogram kraken2 --verbose --logs logs/GTDB
flextaxd-create -db GTDB_16S_database.db --create_db --db_name GTDB_orig_krakendb --genomes_path genomes --dbprogram kraken2 -o taxonomy --verbose

3.1.5 Bracken analysis

It is possible to use this database for further downstream analysis with bracken, to simplify this process and the bracken database build add the ´--keep´parameter to the flextaxd-createcommand in 3.1.4 this will keep the library and taxonomy files in the kraken2 directory which are nessesary for bracken database build. Here is a guide to use bracken. Below is an example how create a bracken database on the merged GTDB 16S database

conda install bracken
flextaxd-create -db GTDB_16S_database.db --create_db --db_name GTDB_orig_krakendb --genomes_path genomes --dbprogram kraken2 -o taxonomy --verbose --keep
bracken-build -d GTDB_orig_krakendb  -t 8 -k 35 -l 600
Clone this wiki locally