Skip to content

CallinCeriani/blast2tree

Repository files navigation

Blast2Tree

install with bioconda

An experimental Linux pipeline optimized for haploid fungi, enabling rapid genus-to-species-level identification of multiple genomes with uncertain classification at a user-defined taxonomic level. Additionally, it extracts sequences of interest for manual review. Sequences should be single-copy and taxonomically informative.

Requires:

  • Working conda or miniconda installation miniconda (to make sure it's up to date, do conda update -n base --all)
  • A fasta file (.fa) with your reference markers (headers in default NCBI format) for each of the known species across your chosen classification level (e.g. ITS.fa)
  • A fasta file (.fa) containing a single sequence best representing the reference marker, if you are not sure, you can use the best hit marker after blast and extraction (e.g. reference.fa)
  • Assembled genomes in the .fasta or .fna format (e.g. isolate_100.fasta)

Utilizes:

How to install

  1. Download
  2. Install the conda environment with
conda env create -f /path/to/download/blast2tree_environment.yml
  1. Set the script blast2tree to path do
 pwd

to get the directory, followed by

 echo 'export PATH="$PATH:/path/to/script/dir"' >> ~/.bashrc && source ~/.bashrc

and then

chmod +x /path/to/blast2tree

To run:

  • Add your genome files (either .fasta or .fna) to a folder containing a file for the reference (.fa) and the markers (.fa)

  • Then do

conda activate Blast2Tree
  • To get the help menu, do
blast2tree -h
  • To view your phylogenetic tree, activate the Blast2Tree conda environment and do
figtree

After which, your results can be found in the .treefile in the _Out file

Processing parameters

Threads|-t

Default = 2

Working directory|--wd

Uses your current directory as the expected working directory.

Run name|--s

Run name and corresponding logfile output identifier.

--MARKER_NAME

Name of your gene marker, e.g. ITS or BT

--Input_seq

This fasta file contains the reference sequences at your specific taxonomic level. e.g. ITS.fa

--CutValue

This value is the minimum length you are willing to compare the genes you specified after extraction. Sequences above this Cutvalue will not be reconstructed. Therefore, knowing your expected sequence size (65% is a good starting point) is important, as the greater the length of the sequence, the more resolution you will be able to achieve.

--THRESHOLD

This is the minimum length required for final processing to ensure quality through higher-length sequences. Sequences that are less than this value are removed from the final analysis (tree making process) and are moved to the leftovers.fasta file

Analysis functions

Pre-align & trim|--Z

Standardises reference markers before using them in blast search and downstream processing

Build|--A

Creates blastdb for each genome and does a blast search against your genomes using your provided reference markers (e.g. ITS.fa). Thereafter, it extracts the relevant hit sequences.

Extract|--B

This determines the longest hit from your blast search, and extracts it, and any other shorter sequences related to the relative marker that produced a hit. After extraction, determine the marker that had the best hit for your data and add it to a file called reference.fa with a unique header, e.g. >best_ref

Reconstruct|--C

If sequences are below the --THRESHOLD value, this script attempts to reconstruct these markers through both overlapping and non-overlapping sequences from separate contigs, regardless of directionality, to improve their length. In addition, it filters the relevant hits in preparation for --tree.

Tree|--D

This does alignment, trimming, and construction of a standard phylogenetic tree.

Utility functions

Rename contigs|--K

Renames all the .fasta files' contigs in a directory, based on the filename(s). Output is in the directory renamed_contigs.

Make files|--M

Makes a folder for all .fasta's in a directory based on their names and moves them into their corresponding folder.

How to uninstall

  1. To remove enviroment conda remove -n Blast2Tree --all
  2. To remove the pathing nano ~/.bashrc
  3. To remove the downloaded program, e.g. rm -rf /path/to/blast2tree-v0.0.1

Disclaimer

This version of the code is still being developed.

About

Extract and compare single copy DNA sequences of taxonomic interest

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •