-
Notifications
You must be signed in to change notification settings - Fork 8
8e. Walkthrough 16S GTDB kraken2 database
(This walkthrough is work in progress and is not yet fully tested)
In this walkthrough we will show two ways of creating a Kraken2 16S database with FlexTaxD using GTDB taxonomy. One major difference between the two methods is that the second method keeps the NCBI level structure in GTDB (Genus Speces etc) which can make intepretations easier.
Install mamba in your base conda environment.
conda install mamba -n base -c conda-forge
We will then create a self-contained conda environment with flextaxd and kraken2:
mamba create -n flextaxd flextaxd kraken2
Please note that you can also use for example KrakenUniq or Ganon.
(For user new to conda please see the conda guide)
# Remember to activate the flextaxd environment
conda activate flextaxd
from zedono_record Ali Alishum. (2021). DADA2 formatted 16S rRNA gene sequences for both bacteria & archaea (Version Version 4.1) [Data set]. Zenodo.
16S database files
wget https://zenodo.org/record/4409439/files/GTDB_arc_bact_taxo.txt
wget https://zenodo.org/record/4409439/files/GTDB_bac120_arc122_ssu_r95_Species.fa.gz
Download the GTDBtaxonomy: Parks, D.H., Chuvochina, M., Chaumeil, PA. et al. A complete domain-to-species taxonomy for Bacteria and Archaea. Nat Biotechnol 38, 1079–1086 (2020). GTDB
GTDB source directly
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/bac120_taxonomy.tsv
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/ar122_taxonomy.tsv
awk '/^>/ {close(F) ; F = substr($1,2,length($1)-1)".fasta"} {print >> F}' GTDB_bac120_arc122_ssu_r95_Species.fa
The GTDB_bac120_arc122_ssu_r95_Species_GTDB.fa file was modified changing GB_GCF_00000000_1 "[1-5]" to GCF_00000000.1 .[1-5] as this is the standard format from Genbank/Refseq and GTDB. The command below also removes the "GB" and or "RS_" from the output. The file headers in the fasta file are expected to change from
>GS_GCF_XXX_1
to
>GCF_XXX.1
sed -i 's/_1 /.1 /g' GTDB_bac120_arc122_ssu_r95_Species_GTDB.fa
sed -i 's/_2 /.2 /g' GTDB_bac120_arc122_ssu_r95_Species_GTDB.fa
#Then for 3-5
# Observe the difference from version 1 here the substr start has changed from 2 to 5 to remove "GB_" "RS_" in the headers
awk '/^>/ {close(F) ; F = substr($1,5,length($1)-1)".fasta"} {print >> F}' GTDB_bac120_arc122_ssu_r95_Species_GTDB.fa
Thanks to @HitMonk for the awk commands preparing the 16S datafile for FlexTaxD
grep ">" GTDB_bac120_arc122_ssu_r95_Species_GTDB.fa | cut -c2- | sed 's/ / /' > g2id.txt
prepare files
awk -F'(' '{print $1}' GTDB_arc_bact_taxo.txt | awk '{print $2"_"$3}' > GTDB_arc_bact_taxo_tree.txt
sed -i 's/Bacteria/root;Bacteria/g' GTDB_arc_bact_taxo_tree.txt
sed -i 's/Archaea/root;Archaea/g' GTDB_arc_bact_taxo_tree.txt
Make unique tree
cat GTDB_arc_bact_taxo_tree.txt | sort | uniq > GTDB_arc_bact_taxo_tree_unique.txt
Create flextaxd taxonomy database
flextaxd -db 16S_database.db -tf GTDB_arc_bact_taxo_tree_unique.txt -tt CanSNPer --genomeid2taxid g2id.txt --dump --dbprogram kraken2 -o taxonomy --verbose --logs logs/zenodo
Create kraken database
flextaxd-create -db 16S_database.db --create_db --db_name GTDB_arc_bact_taxo_krakendb --genomes_path GBRS_genomes --dbprogram kraken2 -o taxonomy --verbose --processes 40
We start this process with creating the base taxonomy to allow the two sources Archaea and Bacteria to be merged. Below is the most basic tree with the two nodes that we want to merge. Check the formats to make sure that the file is properly formatted.
parent child rank
root cellular_organism no rank
cellular_organism Archaea kingdom
cellular_organism Bacteria kingdom
One alternative is to use NCBI as a source check points 2.1 to 3 here Merge NCBI and GTDB walkthrough
Create the nessesary FlexTaxD databases for the merge.
flextaxd -db base.db -tf base_tree.txt --verbose --logs logs/GTDB
flextaxd -db bac120.db -tf bac120_taxonomy.tsv -tt QIIME --verbose --logs logs/GTDB
flextaxd -db ar122.db -tf ar122_taxonomy.tsv -tt QIIME --verbose --logs logs/GTDB
cp base.db GTDB_16S_database.db
Merge the databases with the base tree. If NCBI is used as a base remember to add the --replace
parameter.
flextaxd -db GTDB_16S_database.db -md bac120.db --parent "Bacteria" --verbose --logs logs/GTDB
flextaxd -db GTDB_16S_database.db -md ar122.db --parent "Archaea" --verbose --logs logs/GTDB
Export the database to nodes.dmp and names.dmp and the build kraken2 database.
flextaxd -db GTDB_16S_database.db --dump -o taxonomy --dbprogram kraken2 --verbose --logs logs/GTDB
flextaxd-create -db GTDB_16S_database.db --create_db --db_name GTDB_orig_krakendb --genomes_path genomes --dbprogram kraken2 -o taxonomy --verbose
It is possible to use this database for further downstream analysis with bracken, to simplify this process and the bracken database build add the ´--keep´parameter to the flextaxd-create
command in 3.1.4 this will keep the library and taxonomy files in the kraken2 directory which are nessesary for bracken database build. Here is a guide to use bracken. Below is an example how create a bracken database on the merged GTDB 16S database
conda install bracken
flextaxd-create -db GTDB_16S_database.db --create_db --db_name GTDB_orig_krakendb --genomes_path genomes --dbprogram kraken2 -o taxonomy --verbose --keep
bracken-build -d GTDB_orig_krakendb -t 8 -k 35 -l 600