-
Notifications
You must be signed in to change notification settings - Fork 6
Manual
This manual describes important details and considerations for each command in the starfish
workflow. Additional utility scripts in the aux/
directory are not explained here but rather have in-built manuals.
TIP: If you set up starfish
's command auto-completion, you can use the tab key to auto-complete commands and quickly access the -h
option, which brings up a concise help menu.
Term | Description |
---|---|
captain | a tyrosine recombinase (tyr/YR) belonging to the Starship superfamily of fungal DNA transposons |
unaffiliated captain | a captain not associated with a cargo-carrying mobile element |
cargo | all sequence within a Starship excluding the captain |
Starship | a mobile genetic element that consists of a captain YR at the 5' end and some length of cargo sequence downstream from the captain. Typically bounded by direct repeats and occasionally by terminal inverted repeats |
neighbourhood | a group of 1 or more genes located within some arbitrary distance of each other in a genome. Output by starfish sketch
|
empty/insertion site | a site that is missing an element in one genome but contains an element in at least 1 other genome. Can either be 'clean', in the case where the site consists of the known target site of the element, or messy, where no target site is identifiable. Messy sites are typically much longer than clean sites |
homologous genomic region | a region within one genome that is homologous with at least 1 other region in another genome. In the case of starfish dereplicate , homology between regions is determined through a user-defined threshold of conserved gene orthogroups |
fragment | a gene neighbourhood containing some number of Starship associated genes |
starfish
predicts the beginning and end coordinates of elements using three types of boundaries, listed here in decreasing order of confidence:
Boundary | Description |
---|---|
flank | Determined by aligning the flanking sequence of an element with the flanking sequence of an empty target site. Indicates that candidate direct repeats corresponding to the target site could be found at the 5' and 3' ends of the element. Typically indicative of a clean insertion site. Output by starfish flank
|
insert | Determined by aligning the flanking sequence of an element with the flanking sequence of an empty target site. Indicates that while an empty site was found, no direct repeats could be annotated. Typically indicative of a messy insertion site. Output by starfish insert
|
extend | Determined by aligning a known element to the sequence downstream of an unaffiliated captain Output by starfish extend
|
Starships are currently classified into 11 phylogenetic families that are named after the first element in that family to be described Gluck-Thaler and Vogan 2024. Membership to a family is determined based on similarity to HMM profiles of YR reference sequences: YRsuperfams.p1-512.hmm. Each family is further subdivided into element “naves” (“navis”, singular; latin for “ship”) based on orthology relationships between captain YRs. Naves are typically calculated on the fly using either Orthofinder or mmseqs easy-cluster
. Elements from the same navis typically have the same target site, although may carry different cargo. It is therefore useful to further group Starships into 'haplotypes' based on pairwise k-mer or sequence-based similarities between their cargo sequences (see starfish sim
and starfish group
). This lets you distinguish between elements piloted by closely related captains but that carry different cargo.
Superfamily | Family | Navis |
---|---|---|
Starship | Phoenix | e.g., Horizon |
Starship | Hephaestus | e.g., Mithridate |
Starship | Tardis | |
Starship | Serenity | |
Starship | Prometheus | e.g., Chrysaor |
Starship | Enterprise | |
Starship | Galactica | e.g., Defiant |
Starship | Moya | |
Starship | Arwing | |
Starship | Voyager | e.g., Voyager |
Starship | Family 11 |
All genomes in your analysis must have a unique genomeID. All headers in assembly and protein FASTA files, as well as gene names in GFF3 files MUST be formatted as follows:
<genomeID><separator><uniqueIdentifier>
This is generically referred to as a featureID or gene name. Gene names in GFF3 files must be stored in the attributes column as:
<nameField><geneName>
Most commands have options to specify a character for separator
as well as a string for the nameField
. Do not use :
, ;
, or |
as the separator character, and make sure it is not present in either genomeID or featureID, with the exception of files output by starfish annotate
. If providing GFF3 files, only features labeled as gene
will be parsed and all others will be ignored. Although not recommended, it is possible for separator
to appear in the uniqueIdentifier
field (e.g., aspfla1_scaffold_1)
starfish
makes extensive use of BED files, which are formatted as follows:
Column | Content |
---|---|
1 | contigID |
2 | start coordinate |
3 | end coordinate |
4 | featureID |
5 | tagID |
6 | strand |
7 | neighbourhoodID/elementID |
8 | attributes with ; delimited fields |
usage: starfish annotate [args]
de novo HMM-validated gene annotation with metaeuk.
Required:
-a, --assembly FILE 2 column tsv: genomeID, path to assembly FASTA.
-p, --profile FILE profile HMM file.
(can contain multiple)
-P, --proteins FILE FASTA file of query amino acid sequences.
-x, --prefix STR prefix for naming all output files.
-i, --idtag STR string used as prefix for predicted gene featureIDs.
-o, --outdir DIR output directory.
Required, with defaults:
-s, --separator STR character separating genomeID from featureID.
(default: '_')
-n, --nameField STR GFF attribute field where features are named.
(default: 'Name=')
--targetFeat STR feature to parse from GFF file
(default: mRNA)
-t, --tempdir DIR directory for writing temporary files.
(default: --outdir)
-T, --threads INT number of threads to use for parallel processes.
(default: 6)
--metaeukopts STR options to pass to metaeuk easy-predict.
(default: see manual)
--hmmsearchopts STR options to pass to hmmsearch.
(default: '--max -E 0.001')
Optional:
-g, --gff FILE 2 column tsv: genome code, path to GFF.
--noCheck skip format checking.
-f, --force force all steps, instead of skipping if output already exists.
-h, --help print more details and exit.
This command facilitates de novo annotating assemblies for genes of interest using the metaeuk easy-predict
workflow. It will first annotate assemblies for protein coding sequences with similarity to the provided amino acid sequences. Then, it will validate the newly predicted sequences using the provided profile HMMs using hmmsearch
, and will retain only those sequences with a match. HMM validation is recommended to decrease the probability of false positives.
NB: do NOT run this command for different genes using the same output directory. Unique output directories should be used for each gene if annotating multiple genes! Failure to do so will result in unexpected errors.
If a GFF3 file is provided, the coordinates of all the new sequences will be compared against existing gene annotations using bedtools
. If overlap is found, the existing gene names will be lifted over and will replace the metaeuk-assigned names. If multiple existing features overlap with a newly predicted gene, their existing featureIDs will be concatenated by :
into a hybrid gene name as follows:
<genomeID><separator><featureID1>:<featureID2>[:<featureIDn>]
Otherwise, all newly predicted genes will be named as follows:
<genomeID><separator><idtag><count>
where 'idtag' is specified by -i and 'count' is an analysis specific on-the-fly assignment.
The results from multiple annotate
analyses are often combined with each other. Make sure that each analysis was run with a unique 'idtag' to avoid duplicated gene names.
usage: starfish consolidate [args]
combine new and existing gene annotations.
Required:
-o, --outdir DIR output directory.
Required, with defaults:
-s, --separator STR character separating genomeID from featureID.
(default: '_')
-n, --nameField STR GFF attribute field where features are named.
(default: 'Name=')
--targetFeat STR feature to parse from GFF file
(default: mRNA)
At least one pair of old and new files:
-g, --oldGff FILE GFF file of existing feature coordinates.
-G, --newGff FILE GFF file of new coordinates output by starfish annotate.
-p, --oldProtein FILE FASTA file of existing AA sequences.
(matching -g)
-P, --newProtein FILE FASTA file of new AA sequences output by starfish annotate.
(matching -G)
-a, --oldAnn FILE 3 column tsv: sequenceID (matching -g), fieldID, ann.
-A, --newAnn FILE 3 column tsv: sequenceID (matching -G), fieldID, ann.
Optional:
--noCheck skip format checking.
-h, --help print more details and exit.
This command integrates newly predicted gene annotations and amino acid sequences into existing GFF3, FASTA, and annotation tsv files. Old genes from existing GFF3 and FASTA files that overlap with new genes will have their names, sequences and coordinates replaced with the those of the new genes. If consolidating annotation tsv files, the old annotations will be added to the new annotations instead of replacing them.
Old genes that don't overlap with any new genes will retain their information, as will new genes that don't overlap with any old genes. Consolidated GFF3, FASTA, and annotation tsv files will be printed to a new file with the original file prefix and the additional suffix .consolidated.
Annotation tsv files can have multiple entries per gene. Genes with multiple entries will have all annotations concatenated into a single string with ,
as a separator in the consolidated output file.
usage: starfish sketch [args]
identify genomic neighborhoods containing genes of interest.
Required:
-q, --queries FILE 2 column tsv: geneID of interest, tag.
-g, --gff FILE 2 column tsv: genomeID, path to GFF.
-m, --merge INT max distance to merge query genes, in bp.
-i, --idtag STR prefix for naming neighborhoods.
-x, --prefix STR prefix for naming output files.
-o, --outdir DIR output directory.
Required, with defaults:
-f, --flanking INT size of flanking region to extract, ≤ -m, in bp.
(default: 0)
-s, --separator STR character separating genomeID from featureID.
(default: '_')
-n, --nameField STR GFF3 attribute field where gene features are named.
(default: 'Name=')
--targetFeat STR feature to parse from GFF file
(default: mRNA)
Optional:
-r, --rules FILE tag-based rules for identifying regions of interest.
-a, --annotation FILE 3 column tsv: sequenceID, fieldID, annotation.
--noCheck skip format checking.
-h, --help print more details and exit.
This command identifies genomic neighbourhoods containing query genes of interest. Query genes within --merge
distance are first grouped into the same neighbourhood. Singleton genes are assigned to their own neighbourhood. Then, the up- and downstream --flanking
regions of each neighbourhood are extracted, and neighbourhoods within --merge
distance of each other are combined.
Outputs a list of mutually exclusive genomic neighbourhoods that contain the query genes of interest separated by at least --merge
distance from each other. Genes flanking a neighbourhood will sometimes be printed even if --flanking == 0
when their coordinates overlap with a query gene at the neighbourhood boundary.
To help identify neighbourhoods of interest, tags specified in the --rules
file are used to print out a list of neighbourhoodIDs that contain genes annotated with the specified tags. Tags correspond to those in the tag column of the output BED file. Each rule should consist of a string of comma-separated tags, and each line in --rules
should contain 1 rule.
usage: starfish insert [args]
predict element boundaries and insertion sites.
Required:
-a, --assembly FILE 2 column tsv: genomeID, path to assembly FASTA.
-d, --database PATH BLAST database containing all assemblies in --assembly.
-b, --bed FILE BED file with coordinates of candidate captains.
(output by starfish sketch)
-i, --idtag STR string used for candidate captains in tag col of --bed.
-x, --prefix STR prefix for naming all output files.
-o, --outdir DIR output directory.
Required, with defaults:
-s, --separator STR character separating genomeID from featureID.
(default: '_')
-T, --threads INT number of threads for parallel BLASTn search.
(default: 2)
--upstream I-I bp range to search upstream of candidate captains.
(default: 0-17000)
--downstream I-I bp range to search downstream of hits to upstreams.
(default: 0-14000)
--length I-I min-max insertion length.
(default: 2000-700000)
--pid INT min perc identity of alignments in blastn and nucmer.
(default: 90)
--hsp INT min alignment length for blastn and nucmer.
(in bp; default: 1000)
--insertcov FLT max perc coverage of an insert in an empty contig.
(default: 0.25)
--flankcov FLT max perc coverage of an up- to a downstream region.
(default: 0.25)
--blastopts STR options to pass to blastn.
(default: '-task dc-megablast -evalue 0.0001
-max_target_seqs 1000000')
--nucmeropts STR options to pass to nucmer.
(default: '--mum')
--deltafilteropts STR options to pass to delta-filter.
(default: '-m')
Optional:
-t, --taxonomy FILE 3 column tsv: genome code, taxID to include in
alignments, taxID to exclude from alignments.
(or '.' instead of taxID if not required)
-h, --help print more details and exit.
This command predicts the boundaries of mobile elements based on pairwise alignments between the flanking regions of the element and candidate empty sites. First, the upstream region of each candidate captain is used to query each provided assembly using blastn
. Then, the downstream region of each resulting hit is extracted and used to query the sequence downstream of the original candidate captain. Boundaries are then determined for each set of alignments that pass the filters. As input, it is highly recommended to provide a file to --bed
that contains only the coordinates of candidate captain genes (e.g., tyrs).
For each possible alignment between two different genomes for a given element, the single longest blastn
alignment is chosen and all others ignored. This may not always be the best way to identify a homologous pairwise alignment, especially between more divergent genomes. This feature may be improved in future releases, e.g., by chaining together alignments within a certain distance of each other and identifying the longest of such alignment neighborhoods.
To reduce the risk of false positives, several filtering steps are implemented:
- Any pair of boundaries whose length falls outside the range of
--length
is removed - Any alignment with less than
--pid
percent identity and--hsp
alignment length is ignored - Any pair of boundaries whose upstream and downstream flanking sequences align over at least
--flankcov
of their length is removed. Helps avoid false boundaries identified when the same TE flanking a captain. - Any pair of boundaries whose sequence aligns over at least
--insertcov
of its length to the contig containing a candidate insertion site is removed. Helps avoid recovering element fragments instead of actual elements.
Several hardcoded values within the command script are used to flag possible false positives in the outputted .stats file based on the expected minimum and maximum length of the direct repeats, the maximum length of an empty site, and the minimum and maximum lengths of predicted elements. You should never have to edit these, but if you want to, they are located at the top of the command script.
Checkpoint files are periodically printed out to help restart interrupted analyses. If you change ANY filtering parameters, you should remove any existing checkpoint files and restart the analysis.
The outputted BED file may contain multiple candidate 'insert' boundary features for a given element that will be compared against each other in flank
or summarize
. Boundaries associated with putatively 'clean' target sites will have the sequence of the target site printed in their attributes column.
usage: starfish flank [args]
annotate flanking repeats at element boundaries.
Required:
-a, --assembly FILE 2 column tsv: genomeID, path to assembly FASTA.
-b, --bed FILE BED file with insert features of mobile elements.
(output by starfish insert)
-x, --prefix STR prefix for naming all output files.
-o, --outdir DIR the output directory.
Required, with defaults:
-u, --upstream INT bp to search up and down from upstream boundary.
(default: 100)
-d, --downstream INT bp to search up and down from downstream boundary.
(default: 100)
-i, --maxTIR INT TIRs farther than this from DRs will be filtered out.
(in bp; default: 50)
-e, --edit FLT TIR and DR pairs must be at least this similar.
(based on Levenshtein edit distance; default: 0.75)
Optional:
-h, --help print more details and exit
This command annotates direct repeats (DRs) and terminal inverted repeats (TIRs) around candidate element boundaries using a modified version of CNEfinder
. Although the resulting DR and TIR annotations are not always correct (and the output often messy), they provide an informed estimate of where the true boundaries of an element lies, and are helpful for comparing different candidate boundaries to each other. If you are interested in correctly annotating the DRs of a given element, it is strongly recommended to manually examine the element boundaries yourself.
For each candidate boundary associated with a putatively 'clean' target site, the sequence of the target site will be used as a query to search around the upstream and downstream boundaries. Boundary features that are not associated with a clean target site will not be searched for DRs. If no DRs were found for any insert boundary associated with a given element, this command will select the pair of insert boundaries whose upstream boundary is closest to the 5' end of the captain. If a pair of DRs is found, then the sequence surrounding them will be compared to each other for TIR-like motifs.
If multiple DRs and TIRs are found for a given element across different insert boundaries, then the DR or DR-TIR pair with the longest combined sequence will be chosen and all others ignored. The boundaries of the element will be defined by this single DR or DR-TIR pair. The empty target site from which that DR-TIR pair was derived is then chosen as the ‘reference’ target site for that element.
usage: starfish summarize [args]
pick element boundaries, identify overlaps, and name sites.
Required:
-a, --assembly FILE 2 column tsv: genomeID, path to assembly FASTA.
-b, --bed FILE BED file with feature coordinates of mobile elements.
-x, --prefix STR prefix for naming all output files.
-o, --outdir DIR output directory.
Required, with defaults:
-s, --separator STR the character separating genomeID from featureID.
(default: '_')
Optional:
-S, --stats FILE stats.txt file.
(output by starfish insert)
-f, --flank FILE singleDR.stats file.
(output by starfish flank)
-g, --gff FILE 2 column tsv: genome code, path to GFF.
-n, --nameField STR GFF attribute field where features are named.
(default: 'Name=')
--targetFeat STR feature to parse from GFF file
(default: mRNA)
-A, --ann FILE 3 column tsv: sequenceID, fieldID, annotation.
(for ann field in --bed)
-t, --tag FILE 2 column tsv: sequenceID, tag.
(for tag field in --bed)
-h, --help print more details and exit.
This command summarizes and consolidates the output of the Element Finder module. It identifies all candidate Starship elements from --bed
that have predicted boundary features, will choose the boundary feature that is closest to the captain (if multiple boundary features are present), will identify any elements that are nested or overlap each other by at least 1 bp, will print all elements to a new BED file, and will then print out their nucleic acid sequence. Relevant metadata, such as DRs and TIRs associated with each named element will be printed out to a summary file. Any features that are not assigned to an element (e.g., unaffiliated candidate captains) will not be printed out. Genes part of nested or overlapping elements will be assigned multiple elementIDs. Element regions are populated with gene features parsed from --gff
. For reproducibility, this command will also assign names to all insertion sites as follows <genomeID><SEP><count>
and will print out a new .named.stats
file with these names.
usage: starfish dereplicate [args]
situate element, site, and frag haplotypes in genomic regions using flanking orthologs.
Required:
-e, --element FILE MCL-formatted group file with element IDs.
(e.g., output by starfish group)
-t, --tagged FILE BED or GFF file of genes commonly found in elements
but not currently associated with any elements.
(e.g., non-captain tyrs)
-F, --feat FILE elements.feat file.
(output by starfish summarize)
-S, --stats FILE named.stats file.
(output by starfish summarize)
-O, --orthologs FILE MCL-formatted group file of all genes in all genomes.
-g, --gff FILE 2 column tsv: genomeID, path to GFF.
-x, --prefix STR prefix for naming all output files.
-o, --outdir DIR output directory.
Required, with defaults:
-s, --separator STR character separating genomeID from featureID.
(default: '_')
-n, --nameField STR GFF attribute field where features are named.
(default: 'Name=')
--targetFeat STR feature to parse from GFF file
(default: mRNA)
-f, --flanking INT number of flanking OGs up- and downstream of
neighborhoods for determining region homology.
(default: 6)
-M, --mismatching INT max number of OG mismatches when grouping into regions.
(default: 1)
-d, --distance INT max length of up- and downstream sequence to search for
flanking OGs.
(in bp; default: 600000)
--emptyconstant FLT multiplier controlling max length of empty haplotypes
at a region with respect to ref element.
(default: 2.0)
--fragconstant FLT multiplier controlling max length of fragmented
haplotypes at a region with respect to ref element.
(default: 2.0)
--minrepeat INT min length of repeats to parse from --repeats.
(default: 250)
--mergecount FLT min fraction of haplotypes that must overlap between
2 regions to merge them.
(default: 0.5)
--mergelength FLT min fraction of bp overlap between two haplotypes to
consider them overlapping.
(default: 0.25)
Optional:
--repeats FILE 2 column tsv: genome code, path to repeat GFF.
(repeats will be ignored when comparing flanks)
--maxfrag consider any empty haplotype with at least
1 element-associated OG as fragmented.
--restrict do not use element-associated OGs for determining
region homology.
-h, --help print more details and exit.
This command enables the comparative analysis of elements, fragmented elements, and insertion sites across multiple individuals by grouping haplotypes into homologous genomic regions.
First, the nearest --flanking
gene ortholog groups within --distance
bp of each element's upstream and downstream boundaries are retrieved and sequentially compared against each other. Ortholog groups present in a greater number of individuals will be prioritized. Similar sets of upstream and downstream orthologs are then grouped into homologous genomic regions, allowing for up to --mismatching
missing or mismatching ortholog groups. The set of examined genes and ortholog groups can be further filtered by providing a --repeats
file (in which case any gene that overlaps with a repeat will be ignored) and/or by using the --restrict option (ignores any ortholog group that is present at least once in an element).
Each region is then used to search for other regions containing similar sets of gene ortholog groups. These new regions are labeled as either empty haplotypes, if the ortholog groups are all adjacent to each other, or fragmented haplotypes, if exactly one gap no longer than (--fragconstant
* the longest element length) separates the sets of upstream and downstream ortholog groups. Note that because only genes assigned to ortholog groups in the --orthologs
file are examined, two ortholog groups may be considered adjacent even though intervening genes may exists between them. As a sanity check, the maximum allowed adjacency distance between two ortholog groups for a given region is (--emptyconstant
* the longest element length). To try and ensure that empty haplotypes do not represent fragmented elements, any empty haplotype containing at least 1 gene present in --tagged will be considered a fragmented haplotype. Using the --maxfrag
option will add any ortholog group present at least once in an element to --tagged
.
To ensure some level of mutual exclusivity between regions, all haplotypes from all regions are compared against each other to identify overlapping haplotypes, where haplotypes are considered overlapping if at least --mergelength
fraction of at least one haplotype's bp coordinates overlap with the other. Regions are then merged together if at least --mergecount
fraction of haplotypes within at least 1 region are overlapping.
Predicted insertion sites from the --stats
file are cross-referenced against the list of empty and fragmented haplotypes and labeled as 'validated' if they overlap. A single reference element per homologous region will be chosen for each element family present in that region, prioritizing elements with flank boundaries and validated insertion sites. If a BED file with unaffiliated captain genes tagged as 'tyr' in the 5th field is provided, the coordinates of these genes will specifically be linked to particular fragmented haplotypes because these haplotypes may represent an unannotated or fragmented element.
Note that only genomes with genes present in --orthologs
can be analyzed with this command. For best practices, --orthologs
should be a file of low-copy number orthologs and it is recommended that users experiment with different --orthologs
files and different --flanking
and --mismatching
values to ensure consistency across different parameter settings.
usage: starfish sim [args]
calculate k-mer similarity between captains or elements.
Required:
-m, --mode STR what features to compare, either 'cap' or 'element'.
-t, --type STR sequence type to compare, either 'nucl' or 'prot'.
-b, --bed FILE BED file with feature coordinates of mobile elements.
-x, --prefix STR prefix for naming all output files.
-o, --outdir DIR output directory.
Required, if --type nucl:
-a, --assembly FILE 2 column tsv: genomeID, path to assembly FASTA.
Required, if --type prot:
-p, --protein FILE 2 column tsv: genomeID, path to protein FASTA.
Required, with defaults:
--kmer STR k-mer size for sourmash.
(default: 510 if 'nucl', 17 if 'prot')
--scaled STR scaled parameter for sourmash.
(default: 100 if 'nucl', 20 if 'prot')
Optional:
-d, --dereplicated FILE dereplicated.txt file.
(output by starfish dereplicate)
-g, --group FILE MCL-formatted group file with elementIDs.
-F, --feat FILE elements.feat file.
(output by starfish summarize; if -m element)
-r, --restrict FILE file where each row contains tab-separated regionIDs.
-h, --help print more details and exit.
This command facilitates pairwise k-mer comparisons between captain sequences or full length element sequences using sourmash noabund
. Comparisons can either be based on nucleotide or predicted amino acid sequences, but --mode element
is restricted to nucleotide-based comparisons only. All possible comparisons will be printed out, even those with 0% similarity. Providing a --restrict
file will restrict comparisons to only those regionIDs listed together on the same line
usage: starfish group [args]
sort captains or elements into homologous groups using mcl.
Required:
-s, --sim FILE 3-column tsv: refID, queID, similarity.
(e.g., output by starfish sim)
-i, --idtag STR string used as a prefix to name groups.
-o, --outdir DIR output directory.
Required, with defaults:
-t, --thresh FLT min similarity threshold [0-1).
(default: 0)
-T, --threads INT number of threads.
(default: 1)
-I, --infl FLT inflation paramenter.
(default: 1.5)
Optional:
-f, --feat FILE .feat file.
(output by starfish summarize; if comparing elements)
-h, --help print more details and exit.
This command sorts captains and elements into groups using the mcl
algorithm and a user-provided --sim
file with all pairwise similarities between entities. Higher inflation values mean more starting nodes for clustering, and thus will result in a more fine-grained analysis. As recommended by the mcl manual, if a minimum threshold is specified, each value x
will be remapped to value y
where y = x - threshold
. When remapping, values below threshold are ignored. Entities not part of a multi-member group are automatically assigned the id prefix 'sng' for singleton. The prefix of the --sim
file will automatically be parsed and used as a prefix to name all output files for good record keeping.
usage: starfish extend [args]
extend downstream element boundary using BLASTn alignments.
Required:
-a, --assembly FILE 2 column tsv with: genomeID, path to assembly FASTA.
-q, --query FILE BED file with feature coordinates of mobile elements.
(output by starfish summarize)
-t, --target FILE BED file with coordinates of candidate captains.
(output by starfish sketch)
-i, --idtag STR tag used to identify candidate captains in --target.
-g, --group FILE MCL-formatted group file with captain geneIDs.
-x, --prefix STR prefix for naming all output files.
-o, --outdir DIR output directory.
Required, with defaults:
--hsp INT min HSP length for BLASTn alignments.
(default: 5000)
--pid INT min percent identity of HSPs.
(default: 90)
--gap INT max gap between adjacent HSPs before extension terminates.
(in bp; default: 20000)
--max INT max length of extended elements.
(in bp; default: 1000000)
--min INT min length of extended elements.
(in bp; default: 15000)
--blastopts STR options to pass to BLASTn.
(default: '-task dc-megablast -evalue 0.0001
-max_target_seqs 1000000')
--upstream INT bp to begin aligning upstream of captains.
(default: 10000)
Optional:
-h, --help print more details and exit
This command annotates genomic regions that are or most likely have been elements based on similarity to known elements. It uses blastn
to search for elements and element fragments that could not be recovered using the Element Finder module. The universal principal of 'garbage in, garbage out' could not be more true for this command: please ensure that all elements in the provided --query
file are high confidence elements or else you will propagate low quality element predictions throughout your dataset.
First, the flanking regions of all unaffiliated captain genes in --target
will be progressively aligned to the flanking regions of known captains with either an 'insert' or 'flank' boundary in --query
. Alignments begin --upstream
bp from the 5' end of the captain-coding genes, and only captains part of the same homologous group in the --group
file will be compared to each other. The script will progressively align HSPs starting from the target captain's 5' upstream and moving in the 3' downstream direction until a gap greater than --gap
is encountered between adjacent HSPs. Captains nested within known elements are not aligned, even if present in --target
: this often leads to false positive boundary estimates
usage: starfish augment [args]
de novo gene annotation with metaeuk and an amino acid database.
Required:
-m, --mode STR either 'region' or 'contig'
-d, --db FILE mmseqs2 amino acid database of interest.
-a, --assembly FILE 2 column tsv with: genomeID, path to assembly FASTA.
-x, --prefix STR prefix for naming all output files.
-i, --idtag STR string used as prefix for predicted gene featureIDs.
-o, --outdir DIR output directory.
Required, if --mode region:
-F, --annField STR fieldID in --bed ann column to add new ann to.
-b, --bed FILE BED file with neighborhoods or mobile elements.
(output by starfish sketch or summarize)
Required, with defaults:
-s, --separator STR character separating genomeID from featureID.
(default: '_')
-n, --nameField STR GFF attribute field where features are named.
(default: 'Name=')
--targetFeat STR feature to parse from GFF file
(default: mRNA)
-T, --threads INT number of threads to use for parallel processes.
(default: 6)
--metaeukopts STR options to pass to metaeuk easy-predict.
(default: see -h)
Optional, if --mode contig:
--bygenome annotate all contigs from the same genome at once
--allcontigs annotate all contigs at once
Optional:
--mappings FILE 2 column tsv: database entry, annotation.
(entries must be found in --db)
--noCheck skip format checking.
-f, --force force all steps even if output already exists.
-h, --help print more details and exit.
This command facilitates annotating genomic regions of interest (e.g., elements, neighbourhoods) with the metaeuk easy-predict
workflow and a mmseqs2
profile database of interest. The coordinates of all new sequences will be compared against existing gene annotations in the --bed
tile using bedtools, such that existing annotationIDs will be given priority and any new overlapping annotations will be ignored. Otherwise, if a new gene does not overlap at all with an old gene, the new genes will be assigned a de novo identifier on the fly: <genomeID><separator><idtag><count>
.
usage: starfish cargo [args]
define element pangenomes and calculate cargo-based similarities between elements.
Required:
-b, --bed FILE BED file with feature coordinates of mobile elements.
-x, --prefix STR prefix for naming all output files.
-o, --outdir DIR output directory.
Required, with defaults:
-t, --thresh LIST thresholds for determining core, common,
and rare annotations.
(default: 0.2,0.9)
One of the following, for grouping genes together by annotations:
-f, --field STR annotation field in --bed for parsing gene annotations.
-a, --ann FILE 2 column tsv: geneID, annotation.
One or more of the following, for defining sets of elements to compare:
-g, --group FILE MCL-formatted group file with element IDs.
-d, --dereplicated FILE dereplicated.txt file with all elements per region.
(output by starfish dereplicate)
-l, --list STR 1 column list of elementIDs.
Optional:
-i, --idtag STR prefix of members of interest in --group.
(only members with this prefix will be parsed)
-h, --help print more details and exit.
This command identifies cargo genes of interest in various user-defined sets of mobile elements. Users first define which attributes should be used to identify cargo genes as part of the same anotation group (e.g., by an annotation field in the *.elements.bed
file, or by a two column --ann
file). Users then provide a list of two thresholds used to determine unique, rare, common and core cargo (e.g., a value of 0.2,0.9 means that core cargo is defined as being present in >= 0.9, common cargo >=0.2 and <0.9, rare cargo >1/N and <0.2, and unique cargo will always =1/N.
Sets of elements from which the pangenomes are derived may be defined multiple ways. If a --group
file is provided, a file listing the cargo pangenome per element group will be printed. If a --dereplicated
file is provided, a file listing the cargo pangenome per homologous genomic region will be printed. Finally, if a --list
of elementIDs is provided, a file listing the cargo pangenome across all elements in the list will be printed. All cargo summaries are based on an annotation's presence/absence, and will not take copy number into account. For each set, a reference element representing the element with the most annotation will be printed to the summary file, along with a count of the total number of elements in the set.
Only cargo with annotations will be considered. If a gene has multiple annotations (e.g., in the case of a multi-domain protein), then all annotations will be considered individualy. However, each unique annotation will only be considered once for each gene. A Jaccard index is then calculated for all pairwise combinations of elements within each set based on overlap in their cargo gene annotations, and the results are printed to a similarity file for further analysis. All genes are used for the calculation.
usage: starfish format [args]
format fasta and gff files and databases (assumes genomeIDs assigned).
At least one of the following:
-a, --assembly FILE 2 column tsv: genomeID, path to assembly FASTA.
-p, --protein FILE 2 column tsv: genomeID, path to protein FASTA.
-g, --gff FILE 2 column tsv: genomeID, path to GFF.
Required, with defaults:
-s, --separator STR character separating genomeID from featureID.
(default: '_')
-n, --nameField STR GFF attribute field where features are named.
(default: 'Name=')
--targetFeat STR feature to parse from GFF file
(default: mRNA)
Optional:
--check check file formatting and skip reformatting
--truncate truncate sequence headers to max 50 chars
--gffdb FILE print concatenated gffs to FILE
--assemblydb STR create a blastn database with prefix STR
--proteindb STR create a blastp database with prefix STR
-h, --help print more details and exit.
This command will format all sequence names in the supplied fasta and GFF3 files to conform to the formatting expectations of starfish
: <genomeID><separator><featureID>
. It will optionally truncate any sequence header to 50chars or less, because headers longer than this may lead to downstream errors in some of the dependencies. Headers in fasta files, and contigIDs and gene names in GFF3 files are first checked if they already conform to expectations, and are then formatted as necessary. If characters are already present in a sequence name, they will be removed. Only gene
features will be printed out to the formatted gff. As a sanity check, the format of new files are validated after printing, so they need not be checked for proper formatting again. Reformatted files have the prefix of their parent files and are printed to their the parent file's directory, with the suffix: starfish_format.<file_suffix>
Each genomeID can only be present once per file provided to --fasta
file, and multiple fasta path files may be provided (e.g., in the case where both protein fastas and assembly fastas need to be re-formatted).
Finally, this script will also remove any :
, ;
or |
present in the sequence name, because :
is a special character reserved for starfish annotate
and ;
and |
will mess up gff parsing and system calls. If a fasta and gff files are provided, this script will check that each sequence in fasta has an entry in the gff, and vice versa and will print out warnings for sequences that are uniquely present in one file.
usage: starfish format-ncbi [args]
format ncbi assemblies and assign genomeIDs.
Required:
-r, --report FILE an ncbi genome report file.
-a, --assemblies FILE 1 column tsv: paths to assemblies.
Required, with defaults:
-s, --separator STR the character separating genomeID from featureID.
(default: '_')
Optional:
-c, --codes FILE 1 column file listing existing genome codes.
-h, --help print more details and exit.
This command will format assemblies downloaded directly from ncbi to conform to the formatting expectations of starfish
: <genomeID><separator><featureID>
. New genomeIDs are derived from the --report file and are formatted: <first three letters of genus><first three letters of species><count>
. A --codes file with existing genome codes may be provided: any newly generated genome codes will not overlap with these, and for a given species with multiple codes, the command will attempt to continue counting from the genome code with the highest numberical count suffix.
Formatted files are printed to the directory where their unformatted parent file is located. This command will print each assemblies new genome code to STDOUT for record keeping. As a sanity check, the format of new files are validated after printing, so they need not be checked for proper formatting again.
an example of an ncbi genome report file can be found here: ftp:\/\/ftp.ncbi.nlm.nih.gov\/genomes\/GENOME_REPORTS
usage: starfish dereplicate-hood [args]
situate neighborhoods in genomic regions using flanking orthologs.
Required:
-b, --bed FILE BED file of neighborhoods of interest.
(output by starfish sketch)
-O, --orthologs FILE MCL-formatted group file of all genes in all genomes.
-g, --gff FILE 2 column tsv: genomeID, path to GFF.
-x, --prefix STR prefix for naming all output files.
-o, --outdir DIR output directory.
Required, with defaults:
-s, --separator STR character separating genomeID from featureID.
(default: '_')
-n, --nameField STR GFF attribute field where features are named.
(default: 'Name=')
--targetFeat STR feature to parse from GFF file
(default: mRNA)
-f, --flanking INT number of flanking OGs up- and downstream of
neighborhoods for determining region homology.
(default: 6)
-M, --mismatching INT max number of OG mismatches when grouping into regions.
(default: 1)
-d, --distance INT max length of up- and downstream sequence to search for
flanking OGs.
(in bp; default: 600000)
--emptyconstant FLT multiplier controlling max length of empty haplotypes
at a region with respect to ref neighborhood.
(default: 2.0)
--fragconstant FLT multiplier controlling max length of fragmented
haplotypes at a region with respect to ref neighborhood.
(default: 2.0)
--minrepeat INT min length of repeats to parse from --repeats.
(default: 250)
--mergecount FLT min fraction of haplotypes that must overlap between
2 regions to merge them.
(default: 0.5)
--mergelength FLT min fraction of bp overlap between two haplotypes to
consider them overlapping.
(default: 0.25)
Optional:
--groups FILE MCL-formatted group file with neighborhood IDs.
--repeats FILE 2 column tsv: genome code, path to repeat GFF.
(repeats will be ignored when comparing flanks)
--maxfrag consider any empty haplotype with at least
1 neighborhood-associated OG as fragmented.
--restrict do not use neighborhood-associated OGs for
determining region homology.
-h, --help print more details and exit.
This command is a derivative of starfish dereplicate
that is adapted for any generic
neighbourhood of interest. It takes a set of neighbourhoods and finds the corresponding
homologous regions that lack that neighbourhood among the provided genomes (i.e., empty
haplotypes). See the dereplicate
command for more details.
usage: starfish pair-viz [args]
execute element- element/site/region alignments, with optional circos visualization.
Required:
-m, --mode STR either 'all', 'regions', or 'pairwise'.
-t, --type STR subject sequence type, either 'element' or 'empty'.
(queries are always elements)
-A, --aligner STR alignment software to use, either 'mafft' or 'nucmer'.
-a, --assembly FILE 2 column tsv: genomeID, path to assembly FASTA.
-b, --bed FILE BED file with feature coordinates of mobile elements.
(output by starfish summarize)
-o, --outdir DIR output directory.
Required, with defaults:
-T, --threads INT number of threads.
(default: 1)
-u, --upstream INT bp to align and visualize upstream.
(default: 100000)
-d, --downstream INT bp to align and visualize downstream.
(default: 100000)
Required, if --mode all/pairwise and --type empty:
-f, --flank FILE singleDR.stats file.
(output by starfish flank)
-S, --stats FILE named.stats file.
(output by starfish summarize)
Required, if --mode pairwise:
--que STR query elementID.
--sub STR subject elementID or empty site contigID.
Required, if --mode regions:
-r, --regions FILE regions.txt file.
(output by starfish dereplicate)
Required if --aligner mafft, with defaults:
--mafftopts STR options to pass to mafft.
(default: \'--auto\')
Required if --aligner nucmer, with defaults:
--nucmeropts STR options to pass to nucmer.
(default: \'--mum\')
--deltafilteropts STR options to pass to delta-filter.
(default: \'-m -l 1000 -i 90\')
--color1 HEX hexcode of query element contig.
(without leading '#'; default: b3b1b2)
--color2 HEX hexcode of subject element/empty site contig.
(without leading '#'; default: b3b1b2)
--color3 HEX hexcode of links between subject and query.
(without leading '#'; default: b3b1b2)
--color4 HEX hexcode of elements.
(without leading '#'; default: ff264e)
Optional:
--tags print gene tags in the BED file as gene labels.
--force force all steps even if output already exists.
--noali skip alignment step and keep unaligned fastas.
-h, --help print more details and exit.
This command facilitates alignment generation and visualization for assessing element annotation quality and fuelling exploratory analyses. It makes extensive use of the mafft
and nucmer
alignment softwares, as well as circos
visualization package.
--mode all
comparisons of --type empty
visualize the alignment of an element against its reference empty insertion site (as annotated in --flank
or --stats
).
--mode all
comparisons of --type element
visualize a target element with extend boundaries against its query element
--mode regions
visualize comparisons of element haplotypes-empty haplotypes for all element haplotypes in homologous genomic regions that also have at least 1 empty haplotype. For each homologous region group, the empty region on the longest contig will be selected to serve as the visualization subject (prioritizing those with validated insertion sites).
Any element may be aligned to any insertion site present in --stats
using --mode pairwise
. If you want to align multiple specific queries and subjects, subset --bed
to contain only the query elements of interest and use --mode all
circos
visualizations will only be conducted for nucmer
alignments. Default mafft options correspond to einsi alignment method, which is recommended for sequences containing large unalignable regions
usage: starfish locus-viz [args]
execute alignments and generate synteny schematics with gggenomes.
Required:
-m, --mode STR either 'region', 'element' or 'schematic'.
-a, --assembly FILE 2 column tsv: genomeID, path to assembly FASTA.
-b, --bed FILE BED file with feature coordinates of mobile elements.
-x, --prefix STR prefix for naming output files.
-o, --outdir DIR output directory.
Required, with defaults:
-T, --threads INT number of threads for alignment software.
(default: 1)
-U, --upstream INT bp to align and visualize upstream.
(default: 0)
-D, --downstream INT bp to align and visualize downstream.
(default: 0)
--units STR units of .png, either 'cm' or 'in'.
(default: in)
Required, if --mode region or element:
-A, --aligner STR aligner to use, either 'nucmer' or 'minimap2'.
--nucmeropts STR options to pass to nucmer.
(default: \'--mum\')
--deltafilteropts STR options to pass to delta-filter.
(default: \'-m -l 1000 -i 90\')
--minimapopts STR options to pass to minimap2.
(default: \'-x asm20 -N 0 -c\')
Required, if --mode region (all output by starfish dereplicate):
-r, --regions FILE regions.txt file.
-d, --dereplicated FILE dereplicated.txt file.
-j, --jaccard FILE haplotype_jaccard.sim file.
Required, if --mode element or schematic
-l, --list FILE 1 column of elementID(s) to visualize.
(order will reflect figure if --mode element)
Required, if --mode schematic
-t, --ticks INT length in bp between tick marks scale.
(default: 25000)
Optional:
-g, --gff FILE 2 column tsv: genome code, path to GFF.
-n, --nameField STR GFF attribute field where features are named.
(required if --gff; default: 'Name=')
--targetFeat STR feature to parse from GFF file
(required if --gff; default: mRNA)
--gc FILE BED file with GC content per contig.
(e.g., output by /utils/seq-gc.sh)
--tags FILE 2 column tsv: geneID, gene tag.
(tags in --bed have priority)
--dedup only visualize non-redundant empty haplotypes.
(useful if visualizing many genomes)
--featureID label all gene models with their featureID.
--regionOrder arrange haplotypes as they appear in the regions file.
(for --mode region only)
--nocap parse all elements, even if they do not have a captain
--force force all steps even if output already exists.
-h, --help print more details and exit.
This command facilitates the generation of synteny plots using either nucmer
or minimap2
and gggenomes
. It accepts three different types of input data: --mode region-aligner
for homologous genomic regions, --mode element-aligner
for a custom list of elements, and --mode schematic
for individual elements. A 'best guess' dimension for the .svg and .png output files will be determined based on the number of elements and element length. If this best guess results in a wonky figure, or if you want to change any aspect of the generated synteny plot post hoc, edit the .R script that is printed out and execute it directly from the commandline rather than running this command again.
Note: base R
with the gggenomes
package must be installed and accessible from commandline. Extensive benchmarking of minimap2
has revealed known issues regarding its chaining algorithm, which frequenctly results in missing alignments. We recommend using --aligner nucmer
for this reason.
usage: starfish genome-viz [args]
visualize elements and sites in a genome with circos.
Required:
-m, --mode STR genomes to visualize, either 'all' or 'single'.
-a, --assembly FILE 2 column tsv: genomeID, path to assembly FASTA.
-b, --bed FILE BED file with feature coordinates of mobile elements.
-o, --outdir DIR output directory.
At most one of the following:
-S, --stats FILE named.stats file.
(output by starfish summarize)
-r, --regions FILE regions.txt file.
(output by starfish dereplicate)
Required, if --mode single:
-q, --que STR query genomeID.
Required, with defaults:
-s, --separator STR character separating genomeID from featureID.
(default: '_')
-l, --length INT min length of contigs to visualize, in bp.
(default: 50000)
-e, --empty INT max number of elementIDs to show a given insertion site.
(default: 3)
--color1 HEX hexcode of genome contigs.
(without leading '#'; default: b3b1b2)
--color2 HEX hexcode of insert and flank elements.
(without leading '#'; default: ff264e)
--color3 HEX hexcode of extend elements.
(without leading '#'; default: ff264e)
Optional:
--force force all steps even if output already exists.
-h, --help print more details and exit.
This script visualizes genome-wide elements and insertion sites or regions on a genome-by-genome basis using circos
. --mode all
will visualize each genome in the --bed file, while --mode single
will visualize a single query genome in the --bed file. Elements present in the focal genome are colored red. All genomic regions with at least one element in at least one genome or all insertion sites are labeled around the perimeter of the plot.