STEP -1. Preprocessing of input assembly data

stargraph is a tool that detects Starship-like regions (SLRs) using a genome-graph based approach and combines this with Starship results from the more conservative tool starfish
The combination of both tools provides a comprehensive view of genomic regions impacted by Starships
stargraph requires a minimum of two contiguous long-read assemblies in order to identify these Starship and Starship-like regions

stargraph requires the same input as starfish and some starfish output
As stargraph requies starfish input; The starfish pipeline should be run first in order to feed stargraph with both tyrosine recombinase (TyrR) and Starship positions. See Step 0.

Pipeline

Using the tool requires 7 steps:
-1 --> 0 Preprocessing/Set up:
   -1: Preprocessing of input assembly data
   0: Running starfish
1 --> 5 stargraph:
   1: Generating a genome-graph
   2: Identifying Presence/Absence Variants (PAVs)
   3: Elevating PAVs to Starship-like regions, identifying 'haplotypes' and plotting insertion sites
   4: Combining SLRs with Starships to generate a non-redundant dataset
   5: Generating alignments and Network analyses
6 --> 7 additional stargraph modules:
   6: COMING SOON; allstars to classify newly found elements by similarity to a database of described/named elements
   7: cargobay use a database of public fungal assemblies to find evidence of HGT

Apptainer usage

docker pull ghcr.io/samtobam/stargraph:latest

Conda installation

conda install samtobam::stargraph

STEP -1. Preprocessing of input assembly data

To help with building genome-graphs (and starfish compatability) it helps if sample information is stored in the header of each contig for each assembly
[sample_name][delim][contig/scaffold_name]
e.g. The fasta header for a contig called 'CP097570.1' from a sample/strain called CEA10 using the underscore as a delimiter:
>CEA10_CP097570.1
The '_' underscore is highly recommended due to compatability with starfish/other tools and lower frequency in names/use than other separators.
And just to be sure; the separator (use the underscore please) cannot be in the sample or contig/scaffold name

This PanSN-spec-like naming modification needs to be done for all assemblies in your dataset
For assemblies directly downloaded from NCBI, the trailing information (e.g. 'Aspergillus fumigatus CEA10 chromosome 8') after the contig accession can be left as is and will be ignored (due to the space seperation from the contig name)

Following this you need to create a txt file (e.g. assemblies_panSN.txt) containing one path per line to each of the PanSN-spec-like renamed assemblies
And voila, the primary input required for stargraph is ready.
Feed this assemblies_panSN.txt file to stargraph; input parameter -a | --assemblies.

STEP 0. Running starfish first (wrapper included)

stargraph requires some starfish input in order to run in its entirety
This includes:
   1. The de-novo annotations of Tyrosine recombinases used to elevate PAVs to SLRs
      (usually can use: starfish_output/geneFinder/*.filt.gff or starfish_output/${prefix}.filt.SRGs_combined.gff )
      (stargraph input parameter -r | --tyrRs)
   2. A final list of curated Starship elements (combined with SLRs to generate the non-redundant dataset)
      (usually can use: 'starfish_output/elementFinder/*.elements.ann.feat')
      (stargraph input parameter -e | --elements)

Therefore starfish needs to be run first (installed in stargraph environment)
You can follow the starfish tutorials running; provided on the github/wiki
Or
To simplify running starfish and ensure compatability etc: you can use the wrapper starfish_wrapper.sh provided by stargraph
The wrapper runs the primary steps required with most default parameters
In this case the input used will be the same list of paths to the PanSN-spec-like renamed assemblies as used for stargraph e.g. assemblies_panSN.txt
Note: You can use all available assemblies for starfish but only long-contiguous assemblies for stargraph

starfish_wrapper.sh -a assemblies_panSN.txt

The final wrapper output includes a set of putative Starships.

Additional steps in the wrapper include the de-novo detection of DUF3723 and MYB/SANT genes associated with Starships to be used in Starship and SLR visualisation
These annotations are combined in the output file starfish_output/${prefix}.filt.SRGs_combined.gff (best input for stargraph --tyrRs )
Due to the putative assocaition of these MYB/SANT elements near the opposite Starship edge to the captain; they are particularly helpful in visualising the ends of some elements

STEP 1-5 Running stargraph

stargraphs initial module will find Starship-like regions and combine them with Starships
To do this combine the list of assembly paths with starfish output:
the tyrosine recombinase annotations (starfish_output/geneFinder/*.filt.gff or starfish_output/${prefix}.filt.SRGs_combined.gff | --tyrRs)
the Starship annotations (starfish_output/elementFinder/*.elements.ann.feat | --elements)

stargraph.sh -a assemblies_panSN.txt -r starfish_output/geneFinder/*.filt.gff -e starfish_output/elementFinder/*.elements.ann.feat

Required inputs:
-a | --assemblies		A txt file with each line containing the path to an assembly using the PanSN-spec-like naming scheme for each contig ([sample][delim][contig/scaffold])
-r | --tyrRs			Output file from starfish annotate that contains locations for the tyrosine recombinases in all assemblies (geneFinder/*.filt.gff)
-e | --elements			Output file from starfish insert (preferably manually curated) that contains locations for the Starships (elementFinder/*.elements.ann.feat)


Recommended inputs:
-t | --threads			Number of threads for tools that accept this option (default: 1)
-i | --identifier		The identifying tag used for tyrosine recombinases; given as the -i option for starfish annotate (Default: tyr)

pggb specific inputs:
-i | --identity			-p option in pggb (Default: Automatically calulated using mash distances)
-l | --length			-s option in pggb (Default: 20000 ; a conservative value increased from default pggb values)
-k | --kmersize			-k option in pggb (Default: 19 ; same as pggb)
-G | --poaparam			-G option in pggb (Default: 7919,8069; a conservative value increased from default pggb values)

Optional parameters:
-s | --separator		PanSN-spec-like naming separator used (Default: _)
-w | --window			Size of windows used for PAV detection (Default: 1000)
-m | --minsize			Minimum size of PAVs to be kept (Default: 30000)
-x | --maxsize			Maximum size of SLRs to be kept; filter only applied after starship merging (Default: 2000000)
-k | --kmerthreshold	The minimum 'max-containment/jaccard-similarity' value to be used for clustering of elements for visualiation (Default: 0.3)
-f | --flank			Size of flanking region used when plotting element alignments (Default: 75000)
-p | --prefix			Prefix for output (Default: stargraph)
-o | --output			Name of output folder for all results (Default: stargraph_output)
-c | --cleanup			Remove a large number of files produced by each of the tools that can take up a lot of space. Choose between 'yes' or 'no' (default: 'yes')
-h | --help			Print this help message

The final output stargraph_output/${prefix}.starships_SLRs.tsv contains the final results of stargraph; a nonredundant list of Starships and Starship-like elements
Additionally:
1.*.pggb : all genome-graph output from both pggb and odgi
information about the identified regions of Presence/Absence variation
2.PAVs_to_SLRs : information on the elevation of PAVs to SLRs using the provided tyrosine recombinase locations
contains information on the elevation of PAVs not elevated to SLR status but containing DUF3723 or MYB genes
3.SLR_plots : plots showing the alignment of SLRs clustered based on k-mer max-containment, including one insertion site per cluster
4.SLR_starship_combination : information on generating the non-redundant dataset combining the provided Starships with the newly identified SLRs
5.SLR_starship_network_alignments : plots showing the alignment of Starships and SLRs clustered together based on k-mer max-containment
plots of networks using both Jaccard similarity and containment

COMING SOON STEP 6 Running allstars

Use the allstars module in order to classify your elements using a manually curated database of named elements (db/named_starships_database.curated.fa)

allstars.sh -e stargraph_output/${prefix}.starships_SLRs.fa -l $CONDA_PREFIX/db/named_starships_database.curated.fa

STEP 7 Running cargobay

Use a database of all public fungal assemblies on NCBI (thank you sourmash team!) in order to look for your elements in other species

The elements fasta file (-e) can be the stargraph_output/4.SLR_starship_combination/*.starships_SLRs.fa
The bed file (-b) can be the stargraph_output/4.SLR_starship_combination/*.starships_SLRs.bed
the assemblies fasta (-a) can be stargraph_output/*.assemblies.fa.gz
The gff3 file (-g) can be the gff3 with all the SRGs starfish_output/${prefix}.filt.SRGs_combined.gff The metadata file (-m) needs to be created, a simple tsv with two column, sample in the first column and then species (as given on ncbi) in the second column

cargobay.sh -e elements.fa -b elements.bed -a assemblies.fa -g annotation.gff3 -m metadata.tsv

Required inputs:
-e | --elements		A multifasta file containing all the elements (Starships and SLRs) to be searched for
-b | --elementsbed	A bed file containing all the positions of the elements (Starships and SLRs)
-a | --assemblies	A multifasta file containing all the assemblies used to detect the Starships and SLRs
-g | --gff3			An annotation file containing all or a subset of genes of interest for plotting (at a minimum the de-novo annotated tyrRs)
-m | --metadata		A tsv file containing metadata two columns; first column is the sample name and the second an NCBI genus species name (e.g. Aspergillus fumigatus)

Recommended inputs:
-t | --threads		Number of threads for tools that accept this option (default: 1)
-s | --separator	Separator used to split sample and Starship/SLR names (Default: "_")
-i | --identifier	Identifier for gff3 to find and highlight tyrosine recombinase genes (Default: 'tyr')

Optional parameters:
-c | --containment	The minimum proportion containment threshold for identifying candidates for HGT using the sourmash database (Default: 0.5)
-f | --flank		Number of basepairs up and downstream of the element to be used for plotting (Default: 50000)
-p | --prefix		Prefix for output (Default: cargobay)
-o | --output		Name of output folder for all results (Default: cargobay_output)
-c | --cleanup		Remove a large number of files produced by each of the tools that can take up a lot of space. Choose between 'yes' or 'no' (default: 'yes')
-h | --help			Print this help message

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
.github/workflows		.github/workflows
apptainer		apptainer
bin		bin
db		db
images		images
logo		logo
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
meta.yaml		meta.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pipeline

Apptainer usage

Conda installation

STEP -1. Preprocessing of input assembly data

STEP 0. Running starfish first (wrapper included)

STEP 1-5 Running stargraph

COMING SOON STEP 6 Running allstars

STEP 7 Running cargobay

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

SAMtoBAM/stargraph

Folders and files

Latest commit

History

Repository files navigation

Pipeline

Apptainer usage

Conda installation

STEP -1. Preprocessing of input assembly data

STEP 0. Running starfish first (wrapper included)

STEP 1-5 Running stargraph

COMING SOON STEP 6 Running allstars

STEP 7 Running cargobay

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages