Skip to content

BCCDC-PHL/its-nf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ITS BLAST Pipeline (its-nf)

Prepare a report for taxonomic assignment based on ITS sequences, using BLAST.

Workflow

graph TD
A[Input FASTA Sequences] --> B[Sequence Quality Control]
B --> C[BLAST Search]
C --> DA(NCBI RefSeq Database)
C --> DB(UNITE Database)
C --> DC(Custom BCCDC Database)
DA --> E[Taxonomy Lookup]
DB --> E
DC --> E
E --> F[Collect & Filter Results]
F --> G[Build HTML Report]

Loading

Usage

The pipeline requires a list of BLAST databases to run against. It should follow the following format:

ID,DBNAME,PATH
ncbi,its_ncbi,/path/to/ncbi/2024-05-16_its_ncbi
unite,its_unite,/path/to/unite/2024-05-13_its_unite

...where we expect to find the actual database files at:

/path/to/ncbi/2024-05-16_its_ncbi/its_ncbi.ndb
/path/to/ncbi/2024-05-16_its_ncbi/its_ncbi.nhr
/path/to/ncbi/2024-05-16_its_ncbi/its_ncbi.nin
...etc
/path/to/unite/2024-05-13_its_unite/its_unite.ndb
/path/to/unite/2024-05-13_its_unite/its_unite.nhr
/path/to/unite/2024-05-13_its_unite/its_unite.nin
...etc

The pipeline also assumes that there is a metadata.json file alongside the database files

/path/to/ncbi/2024-05-16_its_ncbi/metadata.json
/path/to/unite/2024-05-13_its_unite/metadata.json

The contents of the metadata file may vary by database, but we assume that:

  • The file contains a single top-level object (not an array or atomic value).
  • The top-level object includes these fields:
version
date

The values associated with those fields will be incorporated into the blast results. All other fields in the metadata.json file are ignored.

nextflow run BCCDC-PHL/its-nf \
  --databases </path/to/blast/databases.csv> \
  --taxonkit_db </path/to/taxonkit/database/> \
  --fasta_input </path/to/fasta_dir> \
  --outdir </path/to/output_dir>

By default, minimum identity and coverage thresholds of 95% will be applied to the blast results. Alternate thresholds can be applied using the --minid and --mincov flags.

nextflow run BCCDC-PHL/its-nf \
  --databases </path/to/blast/databases.csv> \
  --taxonkit_db </path/to/taxonkit/database/> \
  --fasta_input </path/to/fasta_dir> \
  --minid 99.0 \
  --mincov 97.5 \
  --outdir </path/to/output_dir>

Collecting database metadata from the metadata.json file can be skipped using the --no_db_metadata flag.

nextflow run BCCDC-PHL/its-nf \
  --databases </path/to/blast/databases.csv> \
  --taxonkit_db </path/to/taxonkit/database/> \
  --no_db_metadata \
  --fasta_input </path/to/fasta_dir> \
  --outdir </path/to/output_dir>

Outputs

Each sequence will have a separate output directory, named using the seq ID parsed from the fasta header. That directory will contain:

<seq_id>_<db_id>_blast.csv
<seq_id>_<db_id>_blast_best_bitscore.csv
<seq_id>_<db_id>_blast_filtered.csv
<seq_id>_<db_id>_lineages.tsv
<seq_id>_<db_id>_seq_qc.csv

The _blast.csv, _blast_filtered.csv and blast_best_bitscore.csv files have the following headers:

query_seq_id
subject_accession
subject_strand
query_length
query_start
query_end
subject_length
subject_start
subject_end
alignment_length
percent_identity
percent_coverage
num_mismatch
num_gaps
e_value
bitscore
subject_taxids
subject_names
genus
species
database_name
database_version
database_date

...though if the --no_db_metadata flag is used when running the pipeline, the last three fields will be omitted.

The _blast_best_bitscore.csv file will only include one entry per species per database if there are multiple matches from the same species with equally-good bitscores.

The _lineages.tsv file is generated by taxonkit, and has the following headers:

query_taxid
lineage
lineage_taxids
query_taxon_name
lineage_ranks

...where the lineage, lineage_taxids, and lineage_ranks are themselves semicolon-separated lists.

The seq_qc.csv file has the following headers:

seq_length
num_ambiguous_bases
num_n_bases

There will also be collected ouputs in the top-level of the --outdir directory, named:

collected_blast.csv
collected_blast_best_bitscore.csv

...which will include results from all sequences.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •