Skip to content
Emile Gluck-Thaler edited this page Jul 5, 2024 · 8 revisions

Introduction

This manual describes important details and considerations for each command in the starfish workflow. Additional utility scripts in the aux/ directory are not explained here but rather have in-built manuals.

TIP: If you set up starfish's command auto-completion, you can use the tab key to auto-complete commands and quickly access the -h option, which brings up a concise help menu.

Some useful definitions

Term Description
captain a tyrosine recombinase (tyr/YR) belonging to the Starship superfamily of fungal DNA transposons
unaffiliated captain a captain not associated with a cargo-carrying mobile element
cargo all sequence within a Starship excluding the captain
Starship a mobile genetic element that consists of a captain YR at the 5' end and some length of cargo sequence downstream from the captain. Typically bounded by direct repeats and occasionally by terminal inverted repeats
neighbourhood a group of 1 or more genes located within some arbitrary distance of each other in a genome. Output by starfish sketch
empty/insertion site a site that is missing an element in one genome but contains an element in at least 1 other genome. Can either be 'clean', in the case where the site consists of the known target site of the element, or messy, where no target site is identifiable. Messy sites are typically much longer than clean sites
homologous genomic region a region within one genome that is homologous with at least 1 other region in another genome. In the case of starfish dereplicate, homology between regions is determined through a user-defined threshold of conserved gene orthogroups
fragment a gene neighbourhood containing some number of Starship associated genes

starfish predicts the beginning and end coordinates of elements using three types of boundaries, listed here in decreasing order of confidence:

Boundary Description
flank Determined by aligning the flanking sequence of an element with the flanking sequence of an empty target site. Indicates that candidate direct repeats corresponding to the target site could be found at the 5' and 3' ends of the element. Typically indicative of a clean insertion site. Output by starfish flank
insert Determined by aligning the flanking sequence of an element with the flanking sequence of an empty target site. Indicates that while an empty site was found, no direct repeats could be annotated. Typically indicative of a messy insertion site. Output by starfish insert
extend Determined by aligning a known element to the sequence downstream of an unaffiliated captain Output by starfish extend

Starship classification framework

Starships are currently classified into 11 phylogenetic families that are named after the first element in that family to be described Gluck-Thaler and Vogan 2024. Membership to a family is determined based on similarity to HMM profiles of YR reference sequences: YRsuperfams.p1-512.hmm. Each family is further subdivided into element “naves” (“navis”, singular; latin for “ship”) based on orthology relationships between captain YRs. Naves are typically calculated on the fly using either Orthofinder or mmseqs easy-cluster. Elements from the same navis typically have the same target site, although may carry different cargo. It is therefore useful to further group Starships into 'haplotypes' based on pairwise k-mer or sequence-based similarities between their cargo sequences (see starfish sim and starfish group). This lets you distinguish between elements piloted by closely related captains but that carry different cargo.

Superfamily Family Navis
Starship Phoenix e.g., Horizon
Starship Hephaestus e.g., Mithridate
Starship Tardis
Starship Serenity
Starship Prometheus e.g., Chrysaor
Starship Enterprise
Starship Galactica e.g., Defiant
Starship Moya
Starship Arwing
Starship Voyager e.g., Voyager
Starship Family 11

Formatting requirements

All genomes in your analysis must have a unique genomeID. All headers in assembly and protein FASTA files, as well as gene names in GFF3 files MUST be formatted as follows:

<genomeID><separator><uniqueIdentifier>

This is generically referred to as a featureID or gene name. Gene names in GFF3 files must be stored in the attributes column as:

<nameField><geneName>

Most commands have options to specify a character for separator as well as a string for the nameField. Do not use :, ;, or | as the separator character, and make sure it is not present in either genomeID or featureID, with the exception of files output by starfish annotate. If providing GFF3 files, only features labeled as gene will be parsed and all others will be ignored. Although not recommended, it is possible for separator to appear in the uniqueIdentifier field (e.g., aspfla1_scaffold_1)

BED file format

starfish makes extensive use of BED files, which are formatted as follows:

Column Content
1 contigID
2 start coordinate
3 end coordinate
4 featureID
5 tagID
6 strand
7 neighbourhoodID/elementID
8 attributes with ; delimited fields

Gene Finder module

annotate

usage: starfish annotate [args]

de novo HMM-validated gene annotation with metaeuk.

Required:
-a, --assembly    FILE   2 column tsv: genomeID, path to assembly FASTA.
-p, --profile     FILE   profile HMM file.
                         (can contain multiple)
-P, --proteins    FILE   FASTA file of query amino acid sequences.
-x, --prefix      STR    prefix for naming all output files.
-i, --idtag       STR    string used as prefix for predicted gene featureIDs.
-o, --outdir      DIR    output directory.

Required, with defaults:
-s, --separator   STR    character separating genomeID from featureID.
                         (default: '_')
-n, --nameField   STR    GFF attribute field where features are named.
                         (default: 'Name=')
--targetFeat      STR    feature to parse from GFF file
                         (default: mRNA)
-t, --tempdir     DIR    directory for writing temporary files.
                         (default: --outdir)
-T, --threads     INT    number of threads to use for parallel processes.
                         (default: 6)
--metaeukopts     STR    options to pass to metaeuk easy-predict.
                         (default: see manual)
--hmmsearchopts   STR    options to pass to hmmsearch.
                         (default: '--max -E 0.001')

Optional:
-g, --gff         FILE   2 column tsv: genome code, path to GFF.
--noCheck                skip format checking.
-f, --force              force all steps, instead of skipping if output already exists.
-h, --help               print more details and exit.

This command facilitates de novo annotating assemblies for genes of interest using the metaeuk easy-predict workflow. It will first annotate assemblies for protein coding sequences with similarity to the provided amino acid sequences. Then, it will validate the newly predicted sequences using the provided profile HMMs using hmmsearch, and will retain only those sequences with a match. HMM validation is recommended to decrease the probability of false positives.

NB: do NOT run this command for different genes using the same output directory. Unique output directories should be used for each gene if annotating multiple genes! Failure to do so will result in unexpected errors.

If a GFF3 file is provided, the coordinates of all the new sequences will be compared against existing gene annotations using bedtools. If overlap is found, the existing gene names will be lifted over and will replace the metaeuk-assigned names. If multiple existing features overlap with a newly predicted gene, their existing featureIDs will be concatenated by : into a hybrid gene name as follows:

<genomeID><separator><featureID1>:<featureID2>[:<featureIDn>]

Otherwise, all newly predicted genes will be named as follows:

<genomeID><separator><idtag><count>

where 'idtag' is specified by -i and 'count' is an analysis specific on-the-fly assignment.

The results from multiple annotate analyses are often combined with each other. Make sure that each analysis was run with a unique 'idtag' to avoid duplicated gene names.

consolidate

usage: starfish consolidate [args]

combine new and existing gene annotations.

Required:
-o, --outdir      DIR    output directory.

Required, with defaults:
-s, --separator   STR    character separating genomeID from featureID.
                         (default: '_')
-n, --nameField   STR    GFF attribute field where features are named.
                         (default: 'Name=')
--targetFeat      STR    feature to parse from GFF file
                         (default: mRNA)

At least one pair of old and new files:
-g, --oldGff      FILE   GFF file of existing feature coordinates.
-G, --newGff      FILE   GFF file of new coordinates output by starfish annotate.
-p, --oldProtein  FILE   FASTA file of existing AA sequences.
                         (matching -g)
-P, --newProtein  FILE   FASTA file of new AA sequences output by starfish annotate.
                         (matching -G)
-a, --oldAnn      FILE   3 column tsv: sequenceID (matching -g), fieldID, ann.
-A, --newAnn      FILE   3 column tsv: sequenceID (matching -G), fieldID, ann.

Optional:
--noCheck                skip format checking.
-h, --help               print more details and exit.

This command integrates newly predicted gene annotations and amino acid sequences into existing GFF3, FASTA, and annotation tsv files. Old genes from existing GFF3 and FASTA files that overlap with new genes will have their names, sequences and coordinates replaced with the those of the new genes. If consolidating annotation tsv files, the old annotations will be added to the new annotations instead of replacing them.

Old genes that don't overlap with any new genes will retain their information, as will new genes that don't overlap with any old genes. Consolidated GFF3, FASTA, and annotation tsv files will be printed to a new file with the original file prefix and the additional suffix .consolidated.

Annotation tsv files can have multiple entries per gene. Genes with multiple entries will have all annotations concatenated into a single string with , as a separator in the consolidated output file.

sketch

usage: starfish sketch [args]

identify genomic neighborhoods containing genes of interest.

Required:
-q, --queries     FILE   2 column tsv: geneID of interest, tag.
-g, --gff         FILE   2 column tsv: genomeID, path to GFF.
-m, --merge       INT    max distance to merge query genes, in bp.
-i, --idtag       STR    prefix for naming neighborhoods.
-x, --prefix      STR    prefix for naming output files.
-o, --outdir      DIR    output directory.

Required, with defaults:
-f, --flanking    INT    size of flanking region to extract, ≤ -m, in bp.
                         (default: 0)
-s, --separator   STR    character separating genomeID from featureID.
                         (default: '_')
-n, --nameField   STR    GFF3 attribute field where gene features are named.
                         (default: 'Name=')
--targetFeat      STR    feature to parse from GFF file
                         (default: mRNA)

Optional:
-r, --rules       FILE   tag-based rules for identifying regions of interest.
-a, --annotation  FILE   3 column tsv: sequenceID, fieldID, annotation.
--noCheck                skip format checking.
-h, --help               print more details and exit.

This command identifies genomic neighbourhoods containing query genes of interest. Query genes within --merge distance are first grouped into the same neighbourhood. Singleton genes are assigned to their own neighbourhood. Then, the up- and downstream --flanking regions of each neighbourhood are extracted, and neighbourhoods within --merge distance of each other are combined.

Outputs a list of mutually exclusive genomic neighbourhoods that contain the query genes of interest separated by at least --merge distance from each other. Genes flanking a neighbourhood will sometimes be printed even if --flanking == 0 when their coordinates overlap with a query gene at the neighbourhood boundary.

To help identify neighbourhoods of interest, tags specified in the --rules file are used to print out a list of neighbourhoodIDs that contain genes annotated with the specified tags. Tags correspond to those in the tag column of the output BED file. Each rule should consist of a string of comma-separated tags, and each line in --rules should contain 1 rule.

Element Finder module

insert

usage: starfish insert [args]

predict element boundaries and insertion sites. 
 
Required:
-a, --assembly      FILE   2 column tsv: genomeID, path to assembly FASTA.
-d, --database      PATH   BLAST database containing all assemblies in --assembly.
-b, --bed           FILE   BED file with coordinates of candidate captains.
                           (output by starfish sketch)
-i, --idtag         STR    string used for candidate captains in tag col of --bed.
-x, --prefix        STR    prefix for naming all output files.
-o, --outdir        DIR    output directory.

Required, with defaults:
-s, --separator     STR    character separating genomeID from featureID.
                           (default: '_')
-T, --threads       INT    number of threads for parallel BLASTn search.
                           (default: 2)
--upstream          I-I    bp range to search upstream of candidate captains.
                           (default: 0-17000)
--downstream        I-I    bp range to search downstream of hits to upstreams.
                           (default: 0-14000)
--length            I-I    min-max insertion length.
                           (default: 2000-700000)
--pid               INT    min perc identity of alignments in blastn and nucmer.
                           (default: 90)
--hsp               INT    min alignment length for blastn and nucmer.
                           (in bp; default: 1000)
--insertcov         FLT    max perc coverage of an insert in an empty contig.
                           (default: 0.25)
--flankcov          FLT    max perc coverage of an up- to a downstream region.
                           (default: 0.25)
--blastopts         STR    options to pass to blastn.
                           (default: '-task dc-megablast -evalue 0.0001
                           -max_target_seqs 1000000')
--nucmeropts        STR    options to pass to nucmer.
                           (default: '--mum')
--deltafilteropts   STR    options to pass to delta-filter.
                           (default: '-m')
Optional:  
-t, --taxonomy      FILE   3 column tsv: genome code, taxID to include in 
                           alignments, taxID to exclude from alignments.
                           (or '.' instead of taxID if not required)
-h, --help                 print more details and exit.

This command predicts the boundaries of mobile elements based on pairwise alignments between the flanking regions of the element and candidate empty sites. First, the upstream region of each candidate captain is used to query each provided assembly using blastn. Then, the downstream region of each resulting hit is extracted and used to query the sequence downstream of the original candidate captain. Boundaries are then determined for each set of alignments that pass the filters. As input, it is highly recommended to provide a file to --bed that contains only the coordinates of candidate captain genes (e.g., tyrs).

For each possible alignment between two different genomes for a given element, the single longest blastn alignment is chosen and all others ignored. This may not always be the best way to identify a homologous pairwise alignment, especially between more divergent genomes. This feature may be improved in future releases, e.g., by chaining together alignments within a certain distance of each other and identifying the longest of such alignment neighborhoods.

To reduce the risk of false positives, several filtering steps are implemented:

  1. Any pair of boundaries whose length falls outside the range of --length is removed
  2. Any alignment with less than --pid percent identity and --hsp alignment length is ignored
  3. Any pair of boundaries whose upstream and downstream flanking sequences align over at least --flankcov of their length is removed. Helps avoid false boundaries identified when the same TE flanking a captain.
  4. Any pair of boundaries whose sequence aligns over at least --insertcov of its length to the contig containing a candidate insertion site is removed. Helps avoid recovering element fragments instead of actual elements.

Several hardcoded values within the command script are used to flag possible false positives in the outputted .stats file based on the expected minimum and maximum length of the direct repeats, the maximum length of an empty site, and the minimum and maximum lengths of predicted elements. You should never have to edit these, but if you want to, they are located at the top of the command script.

Checkpoint files are periodically printed out to help restart interrupted analyses. If you change ANY filtering parameters, you should remove any existing checkpoint files and restart the analysis.

The outputted BED file may contain multiple candidate 'insert' boundary features for a given element that will be compared against each other in flank or summarize. Boundaries associated with putatively 'clean' target sites will have the sequence of the target site printed in their attributes column.

flank

usage: starfish flank [args]

annotate flanking repeats at element boundaries.

Required:
-a, --assembly    FILE   2 column tsv: genomeID, path to assembly FASTA.
-b, --bed         FILE   BED file with insert features of mobile elements.
                         (output by starfish insert)
-x, --prefix      STR    prefix for naming all output files.
-o, --outdir      DIR    the output directory.

Required, with defaults:
-u, --upstream    INT    bp to search up and down from upstream boundary.
                         (default: 100)
-d, --downstream  INT    bp to search up and down from downstream boundary.
                         (default: 100)
-i, --maxTIR      INT    TIRs farther than this from DRs will be filtered out.
                         (in bp; default: 50)
-e, --edit        FLT    TIR and DR pairs must be at least this similar.
                         (based on Levenshtein edit distance; default: 0.75)

Optional:
-h, --help               print more details and exit

This command annotates direct repeats (DRs) and terminal inverted repeats (TIRs) around candidate element boundaries using a modified version of CNEfinder. Although the resulting DR and TIR annotations are not always correct (and the output often messy), they provide an informed estimate of where the true boundaries of an element lies, and are helpful for comparing different candidate boundaries to each other. If you are interested in correctly annotating the DRs of a given element, it is strongly recommended to manually examine the element boundaries yourself.

For each candidate boundary associated with a putatively 'clean' target site, the sequence of the target site will be used as a query to search around the upstream and downstream boundaries. Boundary features that are not associated with a clean target site will not be searched for DRs. If no DRs were found for any insert boundary associated with a given element, this command will select the pair of insert boundaries whose upstream boundary is closest to the 5' end of the captain. If a pair of DRs is found, then the sequence surrounding them will be compared to each other for TIR-like motifs.

If multiple DRs and TIRs are found for a given element across different insert boundaries, then the DR or DR-TIR pair with the longest combined sequence will be chosen and all others ignored. The boundaries of the element will be defined by this single DR or DR-TIR pair. The empty target site from which that DR-TIR pair was derived is then chosen as the ‘reference’ target site for that element.

summarize

usage: starfish summarize [args]

pick element boundaries, identify overlaps, and name sites.

Required:
-a, --assembly    FILE   2 column tsv: genomeID, path to assembly FASTA.
-b, --bed         FILE   BED file with feature coordinates of mobile elements.
-x, --prefix      STR    prefix for naming all output files.
-o, --outdir      DIR    output directory.

Required, with defaults:
-s, --separator   STR    the character separating genomeID from featureID.
                         (default: '_')

Optional:
-S, --stats       FILE   stats.txt file.
                         (output by starfish insert)
-f, --flank       FILE   singleDR.stats file.
                         (output by starfish flank)
-g, --gff         FILE   2 column tsv: genome code, path to GFF.
-n, --nameField   STR    GFF attribute field where features are named.
                         (default: 'Name=')
--targetFeat      STR    feature to parse from GFF file
                         (default: mRNA)
-A, --ann         FILE   3 column tsv: sequenceID, fieldID, annotation.
                         (for ann field in --bed)
-t, --tag         FILE   2 column tsv: sequenceID, tag.
                         (for tag field in --bed)
-h, --help               print more details and exit.

This command summarizes and consolidates the output of the Element Finder module. It identifies all candidate Starship elements from --bed that have predicted boundary features, will choose the boundary feature that is closest to the captain (if multiple boundary features are present), will identify any elements that are nested or overlap each other by at least 1 bp, will print all elements to a new BED file, and will then print out their nucleic acid sequence. Relevant metadata, such as DRs and TIRs associated with each named element will be printed out to a summary file. Any features that are not assigned to an element (e.g., unaffiliated candidate captains) will not be printed out. Genes part of nested or overlapping elements will be assigned multiple elementIDs. Element regions are populated with gene features parsed from --gff. For reproducibility, this command will also assign names to all insertion sites as follows <genomeID><SEP><count> and will print out a new .named.stats file with these names.

Region Finder module

dereplicate

usage: starfish dereplicate [args]

situate element, site, and frag haplotypes in genomic regions using flanking orthologs.

Required:
-e, --element       FILE   MCL-formatted group file with element IDs.
                           (e.g., output by starfish group)
-t, --tagged        FILE   BED or GFF file of genes commonly found in elements
                           but not currently associated with any elements.
                           (e.g., non-captain tyrs)
-F, --feat          FILE   elements.feat file.
                           (output by starfish summarize)
-S, --stats         FILE   named.stats file.
                           (output by starfish summarize)
-O, --orthologs     FILE   MCL-formatted group file of all genes in all genomes.
-g, --gff           FILE   2 column tsv: genomeID, path to GFF.
-x, --prefix        STR    prefix for naming all output files.
-o, --outdir        DIR    output directory.

Required, with defaults:
-s, --separator     STR    character separating genomeID from featureID.
                           (default: '_')
-n, --nameField     STR    GFF attribute field where features are named.
                           (default: 'Name=')
--targetFeat        STR    feature to parse from GFF file
                           (default: mRNA)
-f, --flanking      INT    number of flanking OGs up- and downstream of
                           neighborhoods for determining region homology.
                           (default: 6)
-M, --mismatching   INT    max number of OG mismatches when grouping into regions.
                           (default: 1)
-d, --distance      INT    max length of up- and downstream sequence to search for
                           flanking OGs.
                           (in bp; default: 600000)
--emptyconstant     FLT    multiplier controlling max length of empty haplotypes
                           at a region with respect to ref element.
                           (default: 2.0)
--fragconstant      FLT    multiplier controlling max length of fragmented
                           haplotypes at a region with respect to ref element.
                           (default: 2.0)
--minrepeat         INT    min length of repeats to parse from --repeats.
                           (default: 250)
--mergecount        FLT    min fraction of haplotypes that must overlap between
                           2 regions to merge them.
                           (default: 0.5)
--mergelength       FLT    min fraction of bp overlap between two haplotypes to
                           consider them overlapping.
                           (default: 0.25)

Optional:
--repeats           FILE   2 column tsv: genome code, path to repeat GFF.
                           (repeats will be ignored when comparing flanks)
--maxfrag                  consider any empty haplotype with at least 
                           1 element-associated OG as fragmented.
--restrict                 do not use element-associated OGs for determining
                           region homology.
-h, --help                 print more details and exit.

This command enables the comparative analysis of elements, fragmented elements, and insertion sites across multiple individuals by grouping haplotypes into homologous genomic regions.

First, the nearest --flanking gene ortholog groups within --distance bp of each element's upstream and downstream boundaries are retrieved and sequentially compared against each other. Ortholog groups present in a greater number of individuals will be prioritized. Similar sets of upstream and downstream orthologs are then grouped into homologous genomic regions, allowing for up to --mismatching missing or mismatching ortholog groups. The set of examined genes and ortholog groups can be further filtered by providing a --repeats file (in which case any gene that overlaps with a repeat will be ignored) and/or by using the --restrict option (ignores any ortholog group that is present at least once in an element).

Each region is then used to search for other regions containing similar sets of gene ortholog groups. These new regions are labeled as either empty haplotypes, if the ortholog groups are all adjacent to each other, or fragmented haplotypes, if exactly one gap no longer than (--fragconstant * the longest element length) separates the sets of upstream and downstream ortholog groups. Note that because only genes assigned to ortholog groups in the --orthologs file are examined, two ortholog groups may be considered adjacent even though intervening genes may exists between them. As a sanity check, the maximum allowed adjacency distance between two ortholog groups for a given region is (--emptyconstant * the longest element length). To try and ensure that empty haplotypes do not represent fragmented elements, any empty haplotype containing at least 1 gene present in --tagged will be considered a fragmented haplotype. Using the --maxfrag option will add any ortholog group present at least once in an element to --tagged.

To ensure some level of mutual exclusivity between regions, all haplotypes from all regions are compared against each other to identify overlapping haplotypes, where haplotypes are considered overlapping if at least --mergelength fraction of at least one haplotype's bp coordinates overlap with the other. Regions are then merged together if at least --mergecount fraction of haplotypes within at least 1 region are overlapping.

Predicted insertion sites from the --stats file are cross-referenced against the list of empty and fragmented haplotypes and labeled as 'validated' if they overlap. A single reference element per homologous region will be chosen for each element family present in that region, prioritizing elements with flank boundaries and validated insertion sites. If a BED file with unaffiliated captain genes tagged as 'tyr' in the 5th field is provided, the coordinates of these genes will specifically be linked to particular fragmented haplotypes because these haplotypes may represent an unannotated or fragmented element.

Note that only genomes with genes present in --orthologs can be analyzed with this command. For best practices, --orthologs should be a file of low-copy number orthologs and it is recommended that users experiment with different --orthologs files and different --flanking and --mismatching values to ensure consistency across different parameter settings.

Auxiliary commands

sim

usage: starfish sim [args]

calculate k-mer similarity between captains or elements.
 
Required:
-m, --mode          STR    what features to compare, either 'cap' or 'element'.
-t, --type          STR    sequence type to compare, either 'nucl' or 'prot'.
-b, --bed           FILE   BED file with feature coordinates of mobile elements.
-x, --prefix        STR    prefix for naming all output files.
-o, --outdir        DIR    output directory.

Required, if --type nucl:
-a, --assembly      FILE   2 column tsv: genomeID, path to assembly FASTA.

Required, if --type prot:
-p, --protein       FILE   2 column tsv: genomeID, path to protein FASTA.

Required, with defaults:
--kmer              STR    k-mer size for sourmash.
                           (default: 510 if 'nucl', 17 if 'prot')
--scaled            STR    scaled parameter for sourmash.
                           (default: 100 if 'nucl', 20 if 'prot')

Optional:
-d, --dereplicated  FILE   dereplicated.txt file.
                           (output by starfish dereplicate)
-g, --group         FILE   MCL-formatted group file with elementIDs.
-F, --feat          FILE   elements.feat file.
                           (output by starfish summarize; if -m element)
-r, --restrict      FILE   file where each row contains tab-separated regionIDs.
-h, --help                 print more details and exit.

This command facilitates pairwise k-mer comparisons between captain sequences or full length element sequences using sourmash noabund. Comparisons can either be based on nucleotide or predicted amino acid sequences, but --mode element is restricted to nucleotide-based comparisons only. All possible comparisons will be printed out, even those with 0% similarity. Providing a --restrict file will restrict comparisons to only those regionIDs listed together on the same line

group

usage: starfish group [args]

sort captains or elements into homologous groups using mcl.

Required:
-s, --sim         FILE   3-column tsv: refID, queID, similarity.
                         (e.g., output by starfish sim)
-i, --idtag       STR    string used as a prefix to name groups.
-o, --outdir      DIR    output directory.

Required, with defaults:
-t, --thresh      FLT    min similarity threshold [0-1).
                         (default: 0)
-T, --threads     INT    number of threads.
                         (default: 1)
-I, --infl        FLT    inflation paramenter.
                         (default: 1.5)

Optional:
-f, --feat        FILE   .feat file.
                         (output by starfish summarize; if comparing elements)
-h, --help               print more details and exit.

This command sorts captains and elements into groups using the mcl algorithm and a user-provided --sim file with all pairwise similarities between entities. Higher inflation values mean more starting nodes for clustering, and thus will result in a more fine-grained analysis. As recommended by the mcl manual, if a minimum threshold is specified, each value x will be remapped to value y where y = x - threshold. When remapping, values below threshold are ignored. Entities not part of a multi-member group are automatically assigned the id prefix 'sng' for singleton. The prefix of the --sim file will automatically be parsed and used as a prefix to name all output files for good record keeping.

extend

usage: starfish extend [args]

extend downstream element boundary using BLASTn alignments.

Required:
-a, --assembly    FILE   2 column tsv with: genomeID, path to assembly FASTA.
-q, --query       FILE   BED file with feature coordinates of mobile elements.
                         (output by starfish summarize)
-t, --target      FILE   BED file with coordinates of candidate captains.
                         (output by starfish sketch)
-i, --idtag       STR    tag used to identify candidate captains in --target.
-g, --group       FILE   MCL-formatted group file with captain geneIDs.
-x, --prefix      STR    prefix for naming all output files.
-o, --outdir      DIR    output directory.

Required, with defaults:
--hsp             INT    min HSP length for BLASTn alignments.
                         (default: 5000)
--pid             INT    min percent identity of HSPs.
                         (default: 90)
--gap             INT    max gap between adjacent HSPs before extension terminates.
                         (in bp; default: 20000)
--max             INT    max length of extended elements.
                         (in bp; default: 1000000)
--min             INT    min length of extended elements.
                         (in bp; default: 15000)
--blastopts       STR    options to pass to BLASTn.
                         (default: '-task dc-megablast -evalue 0.0001
                         -max_target_seqs 1000000')
--upstream        INT    bp to begin aligning upstream of captains.
                         (default: 10000)

Optional:
-h, --help               print more details and exit

This command annotates genomic regions that are or most likely have been elements based on similarity to known elements. It uses blastn to search for elements and element fragments that could not be recovered using the Element Finder module. The universal principal of 'garbage in, garbage out' could not be more true for this command: please ensure that all elements in the provided --query file are high confidence elements or else you will propagate low quality element predictions throughout your dataset.

First, the flanking regions of all unaffiliated captain genes in --target will be progressively aligned to the flanking regions of known captains with either an 'insert' or 'flank' boundary in --query. Alignments begin --upstream bp from the 5' end of the captain-coding genes, and only captains part of the same homologous group in the --group file will be compared to each other. The script will progressively align HSPs starting from the target captain's 5' upstream and moving in the 3' downstream direction until a gap greater than --gap is encountered between adjacent HSPs. Captains nested within known elements are not aligned, even if present in --target: this often leads to false positive boundary estimates

augment

usage: starfish augment [args]

de novo gene annotation with metaeuk and an amino acid database.

Required:
-m, --mode        STR    either 'region' or 'contig'
-d, --db          FILE   mmseqs2 amino acid database of interest.
-a, --assembly    FILE   2 column tsv with: genomeID, path to assembly FASTA.
-x, --prefix      STR    prefix for naming all output files.
-i, --idtag       STR    string used as prefix for predicted gene featureIDs.
-o, --outdir      DIR    output directory.

Required, if --mode region:
-F, --annField    STR    fieldID in --bed ann column to add new ann to.
-b, --bed         FILE   BED file with neighborhoods or mobile elements.
                         (output by starfish sketch or summarize)

Required, with defaults:
-s, --separator   STR    character separating genomeID from featureID.
                         (default: '_')
-n, --nameField   STR    GFF attribute field where features are named.
                         (default: 'Name=')
--targetFeat      STR    feature to parse from GFF file
                         (default: mRNA)
-T, --threads     INT    number of threads to use for parallel processes.
                         (default: 6)                         
--metaeukopts     STR    options to pass to metaeuk easy-predict.
                         (default: see -h)

Optional, if --mode contig:
--bygenome               annotate all contigs from the same genome at once
--allcontigs             annotate all contigs at once

Optional:
--mappings        FILE   2 column tsv: database entry, annotation.
                         (entries must be found in --db)
--noCheck                skip format checking.
-f, --force              force all steps even if output already exists.
-h, --help               print more details and exit.

This command facilitates annotating genomic regions of interest (e.g., elements, neighbourhoods) with the metaeuk easy-predict workflow and a mmseqs2 profile database of interest. The coordinates of all new sequences will be compared against existing gene annotations in the --bed tile using bedtools, such that existing annotationIDs will be given priority and any new overlapping annotations will be ignored. Otherwise, if a new gene does not overlap at all with an old gene, the new genes will be assigned a de novo identifier on the fly: <genomeID><separator><idtag><count>.

cargo

usage: starfish cargo [args]

define element pangenomes and calculate cargo-based similarities between elements.

Required:
-b, --bed           FILE   BED file with feature coordinates of mobile elements.
-x, --prefix        STR    prefix for naming all output files.
-o, --outdir        DIR    output directory.

Required, with defaults:
-t, --thresh        LIST   thresholds for determining core, common,
                           and rare annotations.
                           (default: 0.2,0.9)

One of the following, for grouping genes together by annotations:
-f, --field         STR    annotation field in --bed for parsing gene annotations.
-a, --ann           FILE   2 column tsv: geneID, annotation.

One or more of the following, for defining sets of elements to compare:
-g, --group         FILE   MCL-formatted group file with element IDs.
-d, --dereplicated  FILE   dereplicated.txt file with all elements per region.
                           (output by starfish dereplicate)
-l, --list          STR    1 column list of elementIDs.

Optional:
-i, --idtag         STR    prefix of members of interest in --group.
                           (only members with this prefix will be parsed)
-h, --help                 print more details and exit.

This command identifies cargo genes of interest in various user-defined sets of mobile elements. Users first define which attributes should be used to identify cargo genes as part of the same anotation group (e.g., by an annotation field in the *.elements.bed file, or by a two column --ann file). Users then provide a list of two thresholds used to determine unique, rare, common and core cargo (e.g., a value of 0.2,0.9 means that core cargo is defined as being present in >= 0.9, common cargo >=0.2 and <0.9, rare cargo >1/N and <0.2, and unique cargo will always =1/N.

Sets of elements from which the pangenomes are derived may be defined multiple ways. If a --group file is provided, a file listing the cargo pangenome per element group will be printed. If a --dereplicated file is provided, a file listing the cargo pangenome per homologous genomic region will be printed. Finally, if a --list of elementIDs is provided, a file listing the cargo pangenome across all elements in the list will be printed. All cargo summaries are based on an annotation's presence/absence, and will not take copy number into account. For each set, a reference element representing the element with the most annotation will be printed to the summary file, along with a count of the total number of elements in the set.

Only cargo with annotations will be considered. If a gene has multiple annotations (e.g., in the case of a multi-domain protein), then all annotations will be considered individualy. However, each unique annotation will only be considered once for each gene. A Jaccard index is then calculated for all pairwise combinations of elements within each set based on overlap in their cargo gene annotations, and the results are printed to a similarity file for further analysis. All genes are used for the calculation.

format

usage: starfish format [args]

format fasta and gff files and databases (assumes genomeIDs assigned).

At least one of the following:
-a, --assembly    FILE   2 column tsv: genomeID, path to assembly FASTA.
-p, --protein     FILE   2 column tsv: genomeID, path to protein FASTA.
-g, --gff         FILE   2 column tsv: genomeID, path to GFF.

Required, with defaults:
-s, --separator   STR    character separating genomeID from featureID.
                         (default: '_')
-n, --nameField   STR    GFF attribute field where features are named.
                         (default: 'Name=')
--targetFeat      STR    feature to parse from GFF file
                         (default: mRNA)

Optional:
--check                  check file formatting and skip reformatting
--truncate               truncate sequence headers to max 50 chars
--gffdb           FILE   print concatenated gffs to FILE
--assemblydb      STR    create a blastn database with prefix STR
--proteindb       STR    create a blastp database with prefix STR
-h, --help               print more details and exit.

This command will format all sequence names in the supplied fasta and GFF3 files to conform to the formatting expectations of starfish: <genomeID><separator><featureID>. It will optionally truncate any sequence header to 50chars or less, because headers longer than this may lead to downstream errors in some of the dependencies. Headers in fasta files, and contigIDs and gene names in GFF3 files are first checked if they already conform to expectations, and are then formatted as necessary. If characters are already present in a sequence name, they will be removed. Only gene features will be printed out to the formatted gff. As a sanity check, the format of new files are validated after printing, so they need not be checked for proper formatting again. Reformatted files have the prefix of their parent files and are printed to their the parent file's directory, with the suffix: starfish_format.<file_suffix>

Each genomeID can only be present once per file provided to --fasta file, and multiple fasta path files may be provided (e.g., in the case where both protein fastas and assembly fastas need to be re-formatted).

Finally, this script will also remove any :, ; or | present in the sequence name, because : is a special character reserved for starfish annotate and ; and | will mess up gff parsing and system calls. If a fasta and gff files are provided, this script will check that each sequence in fasta has an entry in the gff, and vice versa and will print out warnings for sequences that are uniquely present in one file.

format-ncbi

usage: starfish format-ncbi [args]

format ncbi assemblies and assign genomeIDs.
 
Required:
-r, --report      FILE   an ncbi genome report file.
-a, --assemblies  FILE   1 column tsv: paths to assemblies.

Required, with defaults:
-s, --separator   STR    the character separating genomeID from featureID.
                         (default: '_')

Optional:
-c, --codes       FILE   1 column file listing existing genome codes.
-h, --help               print more details and exit.

This command will format assemblies downloaded directly from ncbi to conform to the formatting expectations of starfish: <genomeID><separator><featureID>. New genomeIDs are derived from the --report file and are formatted: <first three letters of genus><first three letters of species><count>. A --codes file with existing genome codes may be provided: any newly generated genome codes will not overlap with these, and for a given species with multiple codes, the command will attempt to continue counting from the genome code with the highest numberical count suffix.

Formatted files are printed to the directory where their unformatted parent file is located. This command will print each assemblies new genome code to STDOUT for record keeping. As a sanity check, the format of new files are validated after printing, so they need not be checked for proper formatting again.

an example of an ncbi genome report file can be found here: ftp:\/\/ftp.ncbi.nlm.nih.gov\/genomes\/GENOME_REPORTS

dereplicate-hood

usage: starfish dereplicate-hood [args]

situate neighborhoods in genomic regions using flanking orthologs.

Required:
-b, --bed           FILE   BED file of neighborhoods of interest.
                           (output by starfish sketch)
-O, --orthologs     FILE   MCL-formatted group file of all genes in all genomes.
-g, --gff           FILE   2 column tsv: genomeID, path to GFF.
-x, --prefix        STR    prefix for naming all output files.
-o, --outdir        DIR    output directory.

Required, with defaults:
-s, --separator     STR    character separating genomeID from featureID.
                           (default: '_')
-n, --nameField     STR    GFF attribute field where features are named.
                           (default: 'Name=')
--targetFeat        STR    feature to parse from GFF file
                           (default: mRNA)
-f, --flanking      INT    number of flanking OGs up- and downstream of
                           neighborhoods for determining region homology.
                           (default: 6)
-M, --mismatching   INT    max number of OG mismatches when grouping into regions.
                           (default: 1)
-d, --distance      INT    max length of up- and downstream sequence to search for
                           flanking OGs.
                           (in bp; default: 600000)
--emptyconstant     FLT    multiplier controlling max length of empty haplotypes
                           at a region with respect to ref neighborhood.
                           (default: 2.0)
--fragconstant      FLT    multiplier controlling max length of fragmented
                           haplotypes at a region with respect to ref neighborhood.
                           (default: 2.0)
--minrepeat         INT    min length of repeats to parse from --repeats.
                           (default: 250)
--mergecount        FLT    min fraction of haplotypes that must overlap between
                           2 regions to merge them.
                           (default: 0.5)
--mergelength       FLT    min fraction of bp overlap between two haplotypes to
                           consider them overlapping.
                           (default: 0.25)

Optional:
--groups            FILE   MCL-formatted group file with neighborhood IDs.
--repeats           FILE   2 column tsv: genome code, path to repeat GFF.
                           (repeats will be ignored when comparing flanks)
--maxfrag                  consider any empty haplotype with at least 
                           1 neighborhood-associated OG as fragmented.
--restrict                 do not use neighborhood-associated OGs for
                           determining region homology.
-h, --help                 print more details and exit.

This command is a derivative of starfish dereplicate that is adapted for any generic neighbourhood of interest. It takes a set of neighbourhoods and finds the corresponding homologous regions that lack that neighbourhood among the provided genomes (i.e., empty haplotypes). See the dereplicate command for more details.

Visualization commands

pair-viz

usage: starfish pair-viz [args]

execute element- element/site/region alignments, with optional circos visualization.

Required:
-m, --mode          STR    either 'all', 'regions', or 'pairwise'.
-t, --type          STR    subject sequence type, either 'element' or 'empty'.
                           (queries are always elements)
-A, --aligner       STR    alignment software to use, either 'mafft' or 'nucmer'.
-a, --assembly      FILE   2 column tsv: genomeID, path to assembly FASTA.
-b, --bed           FILE   BED file with feature coordinates of mobile elements.
                           (output by starfish summarize)
-o, --outdir        DIR    output directory.

Required, with defaults:
-T, --threads       INT    number of threads.
                           (default: 1)
-u, --upstream      INT    bp to align and visualize upstream.
                           (default: 100000)
-d, --downstream    INT    bp to align and visualize downstream.
                           (default: 100000)

Required, if --mode all/pairwise and --type empty:
-f, --flank         FILE   singleDR.stats file.
                           (output by starfish flank)
-S, --stats         FILE   named.stats file.
                           (output by starfish summarize)

Required, if --mode pairwise:
--que               STR    query elementID.
--sub               STR    subject elementID or empty site contigID.

Required, if --mode regions:
-r, --regions       FILE   regions.txt file.
                           (output by starfish dereplicate)

Required if --aligner mafft, with defaults:
--mafftopts         STR    options to pass to mafft.
                           (default: \'--auto\')

Required if --aligner nucmer, with defaults:
--nucmeropts        STR    options to pass to nucmer.
                           (default: \'--mum\')
--deltafilteropts   STR    options to pass to delta-filter.
                           (default: \'-m -l 1000 -i 90\')
--color1            HEX    hexcode of query element contig.
                           (without leading '#'; default: b3b1b2)
--color2            HEX    hexcode of subject element/empty site contig.
                           (without leading '#'; default: b3b1b2)
--color3            HEX    hexcode of links between subject and query.
                           (without leading '#'; default: b3b1b2)
--color4            HEX    hexcode of elements.
                           (without leading '#'; default: ff264e)

Optional:
--tags                     print gene tags in the BED file as gene labels.
--force                    force all steps even if output already exists.
--noali                    skip alignment step and keep unaligned fastas.
-h, --help                 print more details and exit.

This command facilitates alignment generation and visualization for assessing element annotation quality and fuelling exploratory analyses. It makes extensive use of the mafft and nucmer alignment softwares, as well as circos visualization package.

--mode all comparisons of --type empty visualize the alignment of an element against its reference empty insertion site (as annotated in --flank or --stats).

--mode all comparisons of --type element visualize a target element with extend boundaries against its query element

--mode regions visualize comparisons of element haplotypes-empty haplotypes for all element haplotypes in homologous genomic regions that also have at least 1 empty haplotype. For each homologous region group, the empty region on the longest contig will be selected to serve as the visualization subject (prioritizing those with validated insertion sites).

Any element may be aligned to any insertion site present in --stats using --mode pairwise. If you want to align multiple specific queries and subjects, subset --bed to contain only the query elements of interest and use --mode all

circos visualizations will only be conducted for nucmer alignments. Default mafft options correspond to einsi alignment method, which is recommended for sequences containing large unalignable regions

locus-viz

usage: starfish locus-viz [args]

execute alignments and generate synteny schematics with gggenomes.

Required:
-m, --mode          STR    either 'region', 'element' or 'schematic'.
-a, --assembly      FILE   2 column tsv: genomeID, path to assembly FASTA.
-b, --bed           FILE   BED file with feature coordinates of mobile elements.
-x, --prefix        STR    prefix for naming output files.
-o, --outdir        DIR    output directory.

Required, with defaults:
-T, --threads       INT    number of threads for alignment software.
                           (default: 1)
-U, --upstream      INT    bp to align and visualize upstream.
                           (default: 0)
-D, --downstream    INT    bp to align and visualize downstream.
                           (default: 0)
--units             STR    units of .png, either 'cm' or 'in'.
                           (default: in)

Required, if --mode region or element:
-A, --aligner       STR    aligner to use, either 'nucmer' or 'minimap2'.
--nucmeropts        STR    options to pass to nucmer.
                           (default: \'--mum\')
--deltafilteropts   STR    options to pass to delta-filter.
                           (default: \'-m -l 1000 -i 90\')
--minimapopts       STR    options to pass to minimap2.
                           (default: \'-x asm20 -N 0 -c\')

Required, if --mode region (all output by starfish dereplicate):
-r, --regions       FILE   regions.txt file.
-d, --dereplicated  FILE   dereplicated.txt file.
-j, --jaccard       FILE   haplotype_jaccard.sim file.

Required, if --mode element or schematic
-l, --list          FILE   1 column of elementID(s) to visualize.
                           (order will reflect figure if --mode element)

Required, if --mode schematic
-t, --ticks         INT    length in bp between tick marks scale.
                           (default: 25000)

Optional:
-g, --gff           FILE   2 column tsv: genome code, path to GFF.
-n, --nameField     STR    GFF attribute field where features are named.
                           (required if --gff; default: 'Name=')
--targetFeat        STR    feature to parse from GFF file
                           (required if --gff; default: mRNA)
--gc                FILE   BED file with GC content per contig.
                           (e.g., output by /utils/seq-gc.sh)
--tags              FILE   2 column tsv: geneID, gene tag.
                           (tags in --bed have priority)
--dedup                    only visualize non-redundant empty haplotypes.
                           (useful if visualizing many genomes)
--featureID                label all gene models with their featureID.
--regionOrder              arrange haplotypes as they appear in the regions file.
                           (for --mode region only)
--nocap                    parse all elements, even if they do not have a captain
--force                    force all steps even if output already exists.
-h, --help                 print more details and exit.

This command facilitates the generation of synteny plots using either nucmer or minimap2 and gggenomes. It accepts three different types of input data: --mode region-aligner for homologous genomic regions, --mode element-aligner for a custom list of elements, and --mode schematic for individual elements. A 'best guess' dimension for the .svg and .png output files will be determined based on the number of elements and element length. If this best guess results in a wonky figure, or if you want to change any aspect of the generated synteny plot post hoc, edit the .R script that is printed out and execute it directly from the commandline rather than running this command again.

Note: base R with the gggenomes package must be installed and accessible from commandline. Extensive benchmarking of minimap2 has revealed known issues regarding its chaining algorithm, which frequenctly results in missing alignments. We recommend using --aligner nucmer for this reason.

genome-viz

usage: starfish genome-viz [args]

visualize elements and sites in a genome with circos.

Required:
-m, --mode        STR    genomes to visualize, either 'all' or 'single'.
-a, --assembly    FILE   2 column tsv: genomeID, path to assembly FASTA.
-b, --bed         FILE   BED file with feature coordinates of mobile elements. 
-o, --outdir      DIR    output directory.

At most one of the following:
-S, --stats       FILE   named.stats file.
                         (output by starfish summarize)
-r, --regions     FILE   regions.txt file.
                         (output by starfish dereplicate)

Required, if --mode single:
-q, --que         STR    query genomeID.

Required, with defaults:
-s, --separator   STR    character separating genomeID from featureID.
                         (default: '_')
-l, --length      INT    min length of contigs to visualize, in bp.
                         (default: 50000)
-e, --empty       INT    max number of elementIDs to show a given insertion site.
                         (default: 3)
--color1          HEX    hexcode of genome contigs.
                         (without leading '#'; default: b3b1b2)
--color2          HEX    hexcode of insert and flank elements.
                         (without leading '#'; default: ff264e)
--color3          HEX    hexcode of extend elements.
                         (without leading '#'; default: ff264e)

Optional:
--force                  force all steps even if output already exists.
-h, --help               print more details and exit.

This script visualizes genome-wide elements and insertion sites or regions on a genome-by-genome basis using circos. --mode all will visualize each genome in the --bed file, while --mode single will visualize a single query genome in the --bed file. Elements present in the focal genome are colored red. All genomic regions with at least one element in at least one genome or all insertion sites are labeled around the perimeter of the plot.