ITS BLAST Pipeline (its-nf)

Prepare a report for taxonomic assignment based on ITS sequences, using BLAST.

Workflow

graph TD
A[Input FASTA Sequences] --> B[Sequence Quality Control]
B --> C[BLAST Search]
C --> DA(NCBI RefSeq Database)
C --> DB(UNITE Database)
C --> DC(Custom BCCDC Database)
DA --> E[Taxonomy Lookup]
DB --> E
DC --> E
E --> F[Collect & Filter Results]
F --> G[Build HTML Report]

Usage

The pipeline requires a list of BLAST databases to run against. It should follow the following format:

ID,DBNAME,PATH
ncbi,its_ncbi,/path/to/ncbi/2024-05-16_its_ncbi
unite,its_unite,/path/to/unite/2024-05-13_its_unite

...where we expect to find the actual database files at:

/path/to/ncbi/2024-05-16_its_ncbi/its_ncbi.ndb
/path/to/ncbi/2024-05-16_its_ncbi/its_ncbi.nhr
/path/to/ncbi/2024-05-16_its_ncbi/its_ncbi.nin
...etc
/path/to/unite/2024-05-13_its_unite/its_unite.ndb
/path/to/unite/2024-05-13_its_unite/its_unite.nhr
/path/to/unite/2024-05-13_its_unite/its_unite.nin
...etc

The pipeline also assumes that there is a metadata.json file alongside the database files

/path/to/ncbi/2024-05-16_its_ncbi/metadata.json
/path/to/unite/2024-05-13_its_unite/metadata.json

The contents of the metadata file may vary by database, but we assume that:

The file contains a single top-level object (not an array or atomic value).
The top-level object includes these fields:

version
date

The values associated with those fields will be incorporated into the blast results. All other fields in the metadata.json file are ignored.

nextflow run BCCDC-PHL/its-nf \
  --databases </path/to/blast/databases.csv> \
  --taxonkit_db </path/to/taxonkit/database/> \
  --fasta_input </path/to/fasta_dir> \
  --outdir </path/to/output_dir>

By default, minimum identity and coverage thresholds of 95% will be applied to the blast results. Alternate thresholds can be applied using the --minid and --mincov flags.

nextflow run BCCDC-PHL/its-nf \
  --databases </path/to/blast/databases.csv> \
  --taxonkit_db </path/to/taxonkit/database/> \
  --fasta_input </path/to/fasta_dir> \
  --minid 99.0 \
  --mincov 97.5 \
  --outdir </path/to/output_dir>

Collecting database metadata from the metadata.json file can be skipped using the --no_db_metadata flag.

nextflow run BCCDC-PHL/its-nf \
  --databases </path/to/blast/databases.csv> \
  --taxonkit_db </path/to/taxonkit/database/> \
  --no_db_metadata \
  --fasta_input </path/to/fasta_dir> \
  --outdir </path/to/output_dir>

Outputs

Each sequence will have a separate output directory, named using the seq ID parsed from the fasta header. That directory will contain:

<seq_id>_<db_id>_blast.csv
<seq_id>_<db_id>_blast_best_bitscore.csv
<seq_id>_<db_id>_blast_filtered.csv
<seq_id>_<db_id>_lineages.tsv
<seq_id>_<db_id>_seq_qc.csv

The _blast.csv, _blast_filtered.csv and blast_best_bitscore.csv files have the following headers:

query_seq_id
subject_accession
subject_strand
query_length
query_start
query_end
subject_length
subject_start
subject_end
alignment_length
percent_identity
percent_coverage
num_mismatch
num_gaps
e_value
bitscore
subject_taxids
subject_names
genus
species
database_name
database_version
database_date

...though if the --no_db_metadata flag is used when running the pipeline, the last three fields will be omitted.

The _blast_best_bitscore.csv file will only include one entry per species per database if there are multiple matches from the same species with equally-good bitscores.

The _lineages.tsv file is generated by taxonkit, and has the following headers:

query_taxid
lineage
lineage_taxids
query_taxon_name
lineage_ranks

...where the lineage, lineage_taxids, and lineage_ranks are themselves semicolon-separated lists.

The seq_qc.csv file has the following headers:

seq_length
num_ambiguous_bases
num_n_bases

There will also be collected ouputs in the top-level of the --outdir directory, named:

collected_blast.csv
collected_blast_best_bitscore.csv

...which will include results from all sequences.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
bin		bin
environments		environments
modules		modules
test		test
.gitignore		.gitignore
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ITS BLAST Pipeline (its-nf)

Workflow

Usage

Outputs

About

Uh oh!

Releases 5

Packages

Contributors 2

Uh oh!

Languages

BCCDC-PHL/its-nf

Folders and files

Latest commit

History

Repository files navigation

ITS BLAST Pipeline (its-nf)

Workflow

Usage

Outputs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 2

Uh oh!

Languages

Packages