Skip to content

A complete Starship and Starship-like region detection tool combining genome-graph based and starfish detection

License

Notifications You must be signed in to change notification settings

SAMtoBAM/stargraph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Zenodo DOI Anaconda_version Anaconda_platforms Anaconda_downloads Anaconda-Server Badge

stargraph is a tool that detects Starship-like regions (SLRs) using a genome-graph based approach and combines this with Starship results from the more conservative tool starfish
The combination of both tools provides a comprehensive view of genomic regions impacted by Starships
stargraph requires a minimum of two contiguous long-read assemblies in order to identify these Starship and Starship-like regions

stargraph requires the same input as starfish and some starfish output
As stargraph requies starfish input; The starfish pipeline should be run first in order to feed stargraph with both tyrosine recombinase (TyrR) and Starship positions. See Step 0.

Pipeline

Using the tool requires 7 steps:
-1 --> 0 Preprocessing/Set up:
              -1: Preprocessing of input assembly data
               0: Running starfish
 1 --> 5 stargraph:
               1: Generating a genome-graph
               2: Identifying Presence/Absence Variants (PAVs)
               3: Elevating PAVs to Starship-like regions, identifying 'haplotypes' and plotting insertion sites
               4: Combining SLRs with Starships to generate a non-redundant dataset
               5: Generating alignments and Network analyses
 6 --> 7 additional stargraph modules:
               6: COMING SOON; allstars to classify newly found elements by similarity to a database of described/named elements
               7: cargobay use a database of public fungal assemblies to find evidence of HGT

Apptainer usage

docker pull ghcr.io/samtobam/stargraph:latest

Conda installation

conda install samtobam::stargraph

STEP -1. Preprocessing of input assembly data

To help with building genome-graphs (and starfish compatability) it helps if sample information is stored in the header of each contig for each assembly
               [sample_name][delim][contig/scaffold_name]
e.g. The fasta header for a contig called 'CP097570.1' from a sample/strain called CEA10 using the underscore as a delimiter:
               >CEA10_CP097570.1
The '_' underscore is highly recommended due to compatability with starfish/other tools and lower frequency in names/use than other separators.
And just to be sure; the separator (use the underscore please) cannot be in the sample or contig/scaffold name

This PanSN-spec-like naming modification needs to be done for all assemblies in your dataset
For assemblies directly downloaded from NCBI, the trailing information (e.g. 'Aspergillus fumigatus CEA10 chromosome 8') after the contig accession can be left as is and will be ignored (due to the space seperation from the contig name)

Following this you need to create a txt file (e.g. assemblies_panSN.txt) containing one path per line to each of the PanSN-spec-like renamed assemblies
And voila, the primary input required for stargraph is ready.
Feed this assemblies_panSN.txt file to stargraph; input parameter -a | --assemblies.

STEP 0. Running starfish first (wrapper included)

stargraph requires some starfish input in order to run in its entirety
This includes:
           1. The de-novo annotations of Tyrosine recombinases used to elevate PAVs to SLRs
                          (usually can use: starfish_output/geneFinder/*.filt.gff or starfish_output/${prefix}.filt.SRGs_combined.gff )
                          (stargraph input parameter -r | --tyrRs)
               2. A final list of curated Starship elements (combined with SLRs to generate the non-redundant dataset)
                          (usually can use: 'starfish_output/elementFinder/*.elements.ann.feat')
                          (stargraph input parameter -e | --elements)

Therefore starfish needs to be run first (installed in stargraph environment)
You can follow the starfish tutorials running; provided on the github/wiki
Or
To simplify running starfish and ensure compatability etc: you can use the wrapper starfish_wrapper.sh provided by stargraph
The wrapper runs the primary steps required with most default parameters
In this case the input used will be the same list of paths to the PanSN-spec-like renamed assemblies as used for stargraph e.g. assemblies_panSN.txt
Note: You can use all available assemblies for starfish but only long-contiguous assemblies for stargraph

starfish_wrapper.sh -a assemblies_panSN.txt

The final wrapper output includes a set of putative Starships.

Additional steps in the wrapper include the de-novo detection of DUF3723 and MYB/SANT genes associated with Starships to be used in Starship and SLR visualisation
These annotations are combined in the output file starfish_output/${prefix}.filt.SRGs_combined.gff (best input for stargraph --tyrRs )
Due to the putative assocaition of these MYB/SANT elements near the opposite Starship edge to the captain; they are particularly helpful in visualising the ends of some elements

STEP 1-5 Running stargraph

stargraphs initial module will find Starship-like regions and combine them with Starships
To do this combine the list of assembly paths with starfish output:
the tyrosine recombinase annotations (starfish_output/geneFinder/*.filt.gff or starfish_output/${prefix}.filt.SRGs_combined.gff | --tyrRs)
the Starship annotations (starfish_output/elementFinder/*.elements.ann.feat | --elements)

stargraph.sh -a assemblies_panSN.txt -r starfish_output/geneFinder/*.filt.gff -e starfish_output/elementFinder/*.elements.ann.feat

Required inputs:
-a | --assemblies		A txt file with each line containing the path to an assembly using the PanSN-spec-like naming scheme for each contig ([sample][delim][contig/scaffold])
-r | --tyrRs			Output file from starfish annotate that contains locations for the tyrosine recombinases in all assemblies (geneFinder/*.filt.gff)
-e | --elements			Output file from starfish insert (preferably manually curated) that contains locations for the Starships (elementFinder/*.elements.ann.feat)


Recommended inputs:
-t | --threads			Number of threads for tools that accept this option (default: 1)
-i | --identifier		The identifying tag used for tyrosine recombinases; given as the -i option for starfish annotate (Default: tyr)

pggb specific inputs:
-i | --identity			-p option in pggb (Default: Automatically calulated using mash distances)
-l | --length			-s option in pggb (Default: 20000 ; a conservative value increased from default pggb values)
-k | --kmersize			-k option in pggb (Default: 19 ; same as pggb)
-G | --poaparam			-G option in pggb (Default: 7919,8069; a conservative value increased from default pggb values)

Optional parameters:
-s | --separator		PanSN-spec-like naming separator used (Default: _)
-w | --window			Size of windows used for PAV detection (Default: 1000)
-m | --minsize			Minimum size of PAVs to be kept (Default: 30000)
-x | --maxsize			Maximum size of SLRs to be kept; filter only applied after starship merging (Default: 2000000)
-k | --kmerthreshold	The minimum 'max-containment/jaccard-similarity' value to be used for clustering of elements for visualiation (Default: 0.3)
-f | --flank			Size of flanking region used when plotting element alignments (Default: 75000)
-p | --prefix			Prefix for output (Default: stargraph)
-o | --output			Name of output folder for all results (Default: stargraph_output)
-c | --cleanup			Remove a large number of files produced by each of the tools that can take up a lot of space. Choose between 'yes' or 'no' (default: 'yes')
-h | --help			Print this help message

The final output stargraph_output/${prefix}.starships_SLRs.tsv contains the final results of stargraph; a nonredundant list of Starships and Starship-like elements
Additionally:
1.*.pggb : all genome-graph output from both pggb and odgi
                  information about the identified regions of Presence/Absence variation
2.PAVs_to_SLRs : information on the elevation of PAVs to SLRs using the provided tyrosine recombinase locations
                            contains information on the elevation of PAVs not elevated to SLR status but containing DUF3723 or MYB genes
3.SLR_plots : plots showing the alignment of SLRs clustered based on k-mer max-containment, including one insertion site per cluster
4.SLR_starship_combination : information on generating the non-redundant dataset combining the provided Starships with the newly identified SLRs
5.SLR_starship_network_alignments : plots showing the alignment of Starships and SLRs clustered together based on k-mer max-containment
                                                               plots of networks using both Jaccard similarity and containment

COMING SOON STEP 6 Running allstars

Use the allstars module in order to classify your elements using a manually curated database of named elements (db/named_starships_database.curated.fa)

allstars.sh -e stargraph_output/${prefix}.starships_SLRs.fa -l $CONDA_PREFIX/db/named_starships_database.curated.fa

STEP 7 Running cargobay

Use a database of all public fungal assemblies on NCBI (thank you sourmash team!) in order to look for your elements in other species

The elements fasta file (-e) can be the stargraph_output/4.SLR_starship_combination/*.starships_SLRs.fa
The bed file (-b) can be the stargraph_output/4.SLR_starship_combination/*.starships_SLRs.bed
the assemblies fasta (-a) can be stargraph_output/*.assemblies.fa.gz
The gff3 file (-g) can be the gff3 with all the SRGs starfish_output/${prefix}.filt.SRGs_combined.gff The metadata file (-m) needs to be created, a simple tsv with two column, sample in the first column and then species (as given on ncbi) in the second column

cargobay.sh -e elements.fa -b elements.bed -a assemblies.fa -g annotation.gff3 -m metadata.tsv

Required inputs:
-e | --elements		A multifasta file containing all the elements (Starships and SLRs) to be searched for
-b | --elementsbed	A bed file containing all the positions of the elements (Starships and SLRs)
-a | --assemblies	A multifasta file containing all the assemblies used to detect the Starships and SLRs
-g | --gff3			An annotation file containing all or a subset of genes of interest for plotting (at a minimum the de-novo annotated tyrRs)
-m | --metadata		A tsv file containing metadata two columns; first column is the sample name and the second an NCBI genus species name (e.g. Aspergillus fumigatus)

Recommended inputs:
-t | --threads		Number of threads for tools that accept this option (default: 1)
-s | --separator	Separator used to split sample and Starship/SLR names (Default: "_")
-i | --identifier	Identifier for gff3 to find and highlight tyrosine recombinase genes (Default: 'tyr')

Optional parameters:
-c | --containment	The minimum proportion containment threshold for identifying candidates for HGT using the sourmash database (Default: 0.5)
-f | --flank		Number of basepairs up and downstream of the element to be used for plotting (Default: 50000)
-p | --prefix		Prefix for output (Default: cargobay)
-o | --output		Name of output folder for all results (Default: cargobay_output)
-c | --cleanup		Remove a large number of files produced by each of the tools that can take up a lot of space. Choose between 'yes' or 'no' (default: 'yes')
-h | --help			Print this help message

About

A complete Starship and Starship-like region detection tool combining genome-graph based and starfish detection

Resources

License

Stars

Watchers

Forks

Packages