stargraph is a tool that detects Starship-like regions (SLRs) using a genome-graph based approach and combines this with Starship results from the more conservative tool starfish
The combination of both tools provides a comprehensive view of genomic regions impacted by Starships
stargraph requires a minimum of two contiguous long-read assemblies in order to identify these Starship and Starship-like regions
stargraph requires the same input as starfish
and some starfish
output
As stargraph
requies starfish
input; The starfish
pipeline should be run first in order to feed stargraph
with both tyrosine recombinase (TyrR) and Starship positions. See Step 0.
Using the tool requires 7 steps:
-1 --> 0 Preprocessing/Set up:
-1: Preprocessing of input assembly data
0: Running starfish
1 --> 5 stargraph
:
1: Generating a genome-graph
2: Identifying Presence/Absence Variants (PAVs)
3: Elevating PAVs to Starship-like regions, identifying 'haplotypes' and plotting insertion sites
4: Combining SLRs with Starships to generate a non-redundant dataset
5: Generating alignments and Network analyses
6 --> 7 additional stargraph modules:
6: COMING SOON; allstars
to classify newly found elements by similarity to a database of described/named elements
7: cargobay
use a database of public fungal assemblies to find evidence of HGT
docker pull ghcr.io/samtobam/stargraph:latest
conda install samtobam::stargraph
To help with building genome-graphs (and starfish compatability) it helps if sample information is stored in the header of each contig for each assembly
[sample_name][delim][contig/scaffold_name]
e.g. The fasta header for a contig called 'CP097570.1' from a sample/strain called CEA10 using the underscore as a delimiter:
>CEA10_CP097570.1
The '_' underscore is highly recommended due to compatability with starfish/other tools and lower frequency in names/use than other separators.
And just to be sure; the separator (use the underscore please) cannot be in the sample or contig/scaffold name
This PanSN-spec-like naming modification needs to be done for all assemblies in your dataset
For assemblies directly downloaded from NCBI, the trailing information (e.g. 'Aspergillus fumigatus CEA10 chromosome 8') after the contig accession can be left as is and will be ignored (due to the space seperation from the contig name)
Following this you need to create a txt file (e.g. assemblies_panSN.txt) containing one path per line to each of the PanSN-spec-like renamed assemblies
And voila, the primary input required for stargraph
is ready.
Feed this assemblies_panSN.txt file to stargraph
; input parameter -a | --assemblies
.
stargraph
requires some starfish
input in order to run in its entirety
This includes:
1. The de-novo annotations of Tyrosine recombinases used to elevate PAVs to SLRs
(usually can use: starfish_output/geneFinder/*.filt.gff or starfish_output/${prefix}.filt.SRGs_combined.gff )
(stargraph
input parameter -r | --tyrRs
)
2. A final list of curated Starship elements (combined with SLRs to generate the non-redundant dataset)
(usually can use: 'starfish_output/elementFinder/*.elements.ann.feat')
(stargraph
input parameter -e | --elements
)
Therefore starfish
needs to be run first (installed in stargraph environment)
You can follow the starfish tutorials running; provided on the github/wiki
Or
To simplify running starfish
and ensure compatability etc: you can use the wrapper starfish_wrapper.sh
provided by stargraph
The wrapper runs the primary steps required with most default parameters
In this case the input used will be the same list of paths to the PanSN-spec-like renamed assemblies as used for stargraph
e.g. assemblies_panSN.txt
Note: You can use all available assemblies for starfish but only long-contiguous assemblies for stargraph
starfish_wrapper.sh -a assemblies_panSN.txt
The final wrapper output includes a set of putative Starships.
Additional steps in the wrapper include the de-novo detection of DUF3723 and MYB/SANT genes associated with Starships to be used in Starship and SLR visualisation
These annotations are combined in the output file starfish_output/${prefix}.filt.SRGs_combined.gff (best input for stargraph --tyrRs )
Due to the putative assocaition of these MYB/SANT elements near the opposite Starship edge to the captain; they are particularly helpful in visualising the ends of some elements
stargraphs initial module will find Starship-like regions and combine them with Starships
To do this combine the list of assembly paths with starfish
output:
the tyrosine recombinase annotations (starfish_output/geneFinder/*.filt.gff or starfish_output/${prefix}.filt.SRGs_combined.gff | --tyrRs)
the Starship annotations (starfish_output/elementFinder/*.elements.ann.feat | --elements)
stargraph.sh -a assemblies_panSN.txt -r starfish_output/geneFinder/*.filt.gff -e starfish_output/elementFinder/*.elements.ann.feat
Required inputs:
-a | --assemblies A txt file with each line containing the path to an assembly using the PanSN-spec-like naming scheme for each contig ([sample][delim][contig/scaffold])
-r | --tyrRs Output file from starfish annotate that contains locations for the tyrosine recombinases in all assemblies (geneFinder/*.filt.gff)
-e | --elements Output file from starfish insert (preferably manually curated) that contains locations for the Starships (elementFinder/*.elements.ann.feat)
Recommended inputs:
-t | --threads Number of threads for tools that accept this option (default: 1)
-i | --identifier The identifying tag used for tyrosine recombinases; given as the -i option for starfish annotate (Default: tyr)
pggb specific inputs:
-i | --identity -p option in pggb (Default: Automatically calulated using mash distances)
-l | --length -s option in pggb (Default: 20000 ; a conservative value increased from default pggb values)
-k | --kmersize -k option in pggb (Default: 19 ; same as pggb)
-G | --poaparam -G option in pggb (Default: 7919,8069; a conservative value increased from default pggb values)
Optional parameters:
-s | --separator PanSN-spec-like naming separator used (Default: _)
-w | --window Size of windows used for PAV detection (Default: 1000)
-m | --minsize Minimum size of PAVs to be kept (Default: 30000)
-x | --maxsize Maximum size of SLRs to be kept; filter only applied after starship merging (Default: 2000000)
-k | --kmerthreshold The minimum 'max-containment/jaccard-similarity' value to be used for clustering of elements for visualiation (Default: 0.3)
-f | --flank Size of flanking region used when plotting element alignments (Default: 75000)
-p | --prefix Prefix for output (Default: stargraph)
-o | --output Name of output folder for all results (Default: stargraph_output)
-c | --cleanup Remove a large number of files produced by each of the tools that can take up a lot of space. Choose between 'yes' or 'no' (default: 'yes')
-h | --help Print this help message
The final output stargraph_output/${prefix}.starships_SLRs.tsv contains the final results of stargraph
; a nonredundant list of Starships and Starship-like elements
Additionally:
1.*.pggb : all genome-graph output from both pggb and odgi
information about the identified regions of Presence/Absence variation
2.PAVs_to_SLRs : information on the elevation of PAVs to SLRs using the provided tyrosine recombinase locations
contains information on the elevation of PAVs not elevated to SLR status but containing DUF3723 or MYB genes
3.SLR_plots : plots showing the alignment of SLRs clustered based on k-mer max-containment, including one insertion site per cluster
4.SLR_starship_combination : information on generating the non-redundant dataset combining the provided Starships with the newly identified SLRs
5.SLR_starship_network_alignments : plots showing the alignment of Starships and SLRs clustered together based on k-mer max-containment
plots of networks using both Jaccard similarity and containment
Use the allstars module in order to classify your elements using a manually curated database of named elements (db/named_starships_database.curated.fa)
allstars.sh -e stargraph_output/${prefix}.starships_SLRs.fa -l $CONDA_PREFIX/db/named_starships_database.curated.fa
Use a database of all public fungal assemblies on NCBI (thank you sourmash team!) in order to look for your elements in other species
The elements fasta file (-e) can be the stargraph_output/4.SLR_starship_combination/*.starships_SLRs.fa
The bed file (-b) can be the stargraph_output/4.SLR_starship_combination/*.starships_SLRs.bed
the assemblies fasta (-a) can be stargraph_output/*.assemblies.fa.gz
The gff3 file (-g) can be the gff3 with all the SRGs starfish_output/${prefix}.filt.SRGs_combined.gff
The metadata file (-m) needs to be created, a simple tsv with two column, sample in the first column and then species (as given on ncbi) in the second column
cargobay.sh -e elements.fa -b elements.bed -a assemblies.fa -g annotation.gff3 -m metadata.tsv
Required inputs:
-e | --elements A multifasta file containing all the elements (Starships and SLRs) to be searched for
-b | --elementsbed A bed file containing all the positions of the elements (Starships and SLRs)
-a | --assemblies A multifasta file containing all the assemblies used to detect the Starships and SLRs
-g | --gff3 An annotation file containing all or a subset of genes of interest for plotting (at a minimum the de-novo annotated tyrRs)
-m | --metadata A tsv file containing metadata two columns; first column is the sample name and the second an NCBI genus species name (e.g. Aspergillus fumigatus)
Recommended inputs:
-t | --threads Number of threads for tools that accept this option (default: 1)
-s | --separator Separator used to split sample and Starship/SLR names (Default: "_")
-i | --identifier Identifier for gff3 to find and highlight tyrosine recombinase genes (Default: 'tyr')
Optional parameters:
-c | --containment The minimum proportion containment threshold for identifying candidates for HGT using the sourmash database (Default: 0.5)
-f | --flank Number of basepairs up and downstream of the element to be used for plotting (Default: 50000)
-p | --prefix Prefix for output (Default: cargobay)
-o | --output Name of output folder for all results (Default: cargobay_output)
-c | --cleanup Remove a large number of files produced by each of the tools that can take up a lot of space. Choose between 'yes' or 'no' (default: 'yes')
-h | --help Print this help message