Prepare a report for taxonomic assignment based on ITS sequences, using BLAST.
graph TD
A[Input FASTA Sequences] --> B[Sequence Quality Control]
B --> C[BLAST Search]
C --> DA(NCBI RefSeq Database)
C --> DB(UNITE Database)
C --> DC(Custom BCCDC Database)
DA --> E[Taxonomy Lookup]
DB --> E
DC --> E
E --> F[Collect & Filter Results]
F --> G[Build HTML Report]
    The pipeline requires a list of BLAST databases to run against. It should follow the following format:
ID,DBNAME,PATH
ncbi,its_ncbi,/path/to/ncbi/2024-05-16_its_ncbi
unite,its_unite,/path/to/unite/2024-05-13_its_unite
...where we expect to find the actual database files at:
/path/to/ncbi/2024-05-16_its_ncbi/its_ncbi.ndb
/path/to/ncbi/2024-05-16_its_ncbi/its_ncbi.nhr
/path/to/ncbi/2024-05-16_its_ncbi/its_ncbi.nin
...etc
/path/to/unite/2024-05-13_its_unite/its_unite.ndb
/path/to/unite/2024-05-13_its_unite/its_unite.nhr
/path/to/unite/2024-05-13_its_unite/its_unite.nin
...etc
The pipeline also assumes that there is a metadata.json file alongside the database files
/path/to/ncbi/2024-05-16_its_ncbi/metadata.json
/path/to/unite/2024-05-13_its_unite/metadata.json
The contents of the metadata file may vary by database, but we assume that:
- The file contains a single top-level object (not an array or atomic value).
- The top-level object includes these fields:
version
date
The values associated with those fields will be incorporated into the blast results. All other fields in
the metadata.json file are ignored.
nextflow run BCCDC-PHL/its-nf \
  --databases </path/to/blast/databases.csv> \
  --taxonkit_db </path/to/taxonkit/database/> \
  --fasta_input </path/to/fasta_dir> \
  --outdir </path/to/output_dir>
By default, minimum identity and coverage thresholds of 95% will be applied to the blast results.
Alternate thresholds can be applied using the --minid and --mincov flags.
nextflow run BCCDC-PHL/its-nf \
  --databases </path/to/blast/databases.csv> \
  --taxonkit_db </path/to/taxonkit/database/> \
  --fasta_input </path/to/fasta_dir> \
  --minid 99.0 \
  --mincov 97.5 \
  --outdir </path/to/output_dir>
Collecting database metadata from the metadata.json file can be skipped using the --no_db_metadata flag.
nextflow run BCCDC-PHL/its-nf \
  --databases </path/to/blast/databases.csv> \
  --taxonkit_db </path/to/taxonkit/database/> \
  --no_db_metadata \
  --fasta_input </path/to/fasta_dir> \
  --outdir </path/to/output_dir>
Each sequence will have a separate output directory, named using the seq ID parsed from the fasta header. That directory will contain:
<seq_id>_<db_id>_blast.csv
<seq_id>_<db_id>_blast_best_bitscore.csv
<seq_id>_<db_id>_blast_filtered.csv
<seq_id>_<db_id>_lineages.tsv
<seq_id>_<db_id>_seq_qc.csv
The _blast.csv, _blast_filtered.csv and blast_best_bitscore.csv files have the following headers:
query_seq_id
subject_accession
subject_strand
query_length
query_start
query_end
subject_length
subject_start
subject_end
alignment_length
percent_identity
percent_coverage
num_mismatch
num_gaps
e_value
bitscore
subject_taxids
subject_names
genus
species
database_name
database_version
database_date
...though if the --no_db_metadata flag is used when running the pipeline, the last three fields will be omitted.
The _blast_best_bitscore.csv file will only include one entry per species per database if there are multiple matches from the same
species with equally-good bitscores.
The _lineages.tsv file is generated by taxonkit, and has the following headers:
query_taxid
lineage
lineage_taxids
query_taxon_name
lineage_ranks
...where the lineage, lineage_taxids, and lineage_ranks are themselves semicolon-separated lists.
The seq_qc.csv file has the following headers:
seq_length
num_ambiguous_bases
num_n_bases
There will also be collected ouputs in the top-level of the --outdir directory, named:
collected_blast.csv
collected_blast_best_bitscore.csv
...which will include results from all sequences.