Skip to content

Step by step tutorial

Emile Gluck-Thaler edited this page Feb 19, 2025 · 45 revisions

Table of Contents

  1. Overview
  2. Prepare workspace
  3. Gene finder module
  4. Element finder module
  5. Region finder module

Overview

starfish is organized into three main modules: Gene finder, Element finder, and Region finder. Each has dedicated commands that are typically run sequentially. Auxiliary commands that provide additional utilities and generate visualizations are also available through the commandline. Several useful stand-alone scripts are located in the $STARFISHDIR/aux/ directory.

We have provided example data from Gluck-Thaler et al. 2022 to illustrate a step-by-step analysis. These data consist of 6 Starship Voyager (belonging to the Voyager-family) and 1 Starship Defiant (belonging to the Prometheus-family) insertions found in 6 Macrophomina phaseolina genomes. To complete the tutorial, you must have installed starfish on your machine. The tutorial takes ~25 min with 2 processors. Many commands produce checkpoint files that are useful for restarting an interrupted analysis. Simply remove these .checkpoint files to start an analysis from scratch.

If you want to apply the steps in this tutorial to your own data, you might have to modify some of the commands to take into account the names of your files. For example, the following code only works for gff3 files that end in .final.gff3:

realpath gff3/* | perl -pe 's/^(.+?([^\/]+?).final.gff3)$/\2\t\1/' > ome2gff.txt

Prepare workspace

activate the starfish conda environment:

conda activate starfish

The location of several files and auxiliary scripts used for a starfish analysis will depend on how you installed the program. We need to know the parent directory of the directory where the main starfish executable is stored in order to find these files. Assuming the starfish command is in your PATH (which is would be if, for example, you installed starfish using conda), let's store this directory in a variable:

STARFISHDIR=$(dirname -- $(command -v -- starfish))/../

copy the starfish example data directory to a new test directory (examples/ can be downloaded from the github repo):

cp -r examples/ test/
cd test/

create ome2*.txt files detailing the absolute path to each genome's gff3 and assembly. These serve as control files used by multiple commands to find input data:

realpath assembly/* | perl -pe 's/^(.+?([^\/]+?).fasta)$/\2\t\1/' > ome2assembly.txt
realpath gff3/* | perl -pe 's/^(.+?([^\/]+?).final.gff3)$/\2\t\1/' > ome2gff.txt

concatenate all gff3 files into a single file (this concatenated file will be useful for consolidating de novo predicted genes with existing gene annotations, if desired):

cat gff3/*.gff3 > macpha6.gff3

concatenate all assembly files and make a blastn database for easy sequence searching:

mkdir blastdb
cut -f2 ome2assembly.txt | xargs cat > blastdb/macpha6.assemblies.fna
makeblastdb -in blastdb/macpha6.assemblies.fna -out blastdb/macpha6.assemblies -parse_seqids -dbtype nucl

Note that if you are annotating Starships on a per-genome level, the focal genome must be included in the blast database

calculate %GC content across all genomes (useful for visualizing elements later):

$STARFISHDIR/aux/seq-gc.sh -Nbw 1000 blastdb/macpha6.assemblies.fna > macpha6.assemblies.gcContent_w1000.bed
rm blastdb/macpha6.assemblies.fna

parse the provided eggnog mapper annotations (only works for old emapper data!! the format of the output file has changed in more recent emapper versions, so you will have to modify this command for new emapper data, see below):

cut -f1,12  ann/*emapper.annotations | grep -v  '#' | grep -v -P '\t-' | perl -pe 's/\t/\tEMAP\t/' | grep -vP '\tNA' > ann/macph6.gene2emap.txt

retrieve the narrowest eggnog ortholog group per sequence and convert to mcl format (useful for quickly classifying genes into orthogroups):

cut -f1,10 ann/*emapper.annotations | grep -v '#' | perl -pe 's/^([^\s]+?)\t([^\|]+).+$/\1\t\2/' > ann/macph6.gene2og.txt

important! Note that the above command only works with older emapper annotation output, such as the files provided as part of this tutorial. In order to parse the output from more recent versions of emapper, you must use the following command to retrieve the narrowest eggnog ortholog group per sequence:

cut -f1,5 ann/*emapper.annotations | grep -v '#' | perl -pe 's/(^.+?)\t.+,([^,]+)$/\1\t\2/' | perl -pe 's/@/\t/' > ann/macph6.gene2og.txt

now, convert to .mcl format (will produce the file macph6.gene2og.mcl, helps with downstream text parsing):

$STARFISHDIR/aux/geneOG2mclFormat.pl -i ann/macph6.gene2og.txt -o ann/

important! we are using precalculated emapper ortholog groups here purely out of convenience. If you have or plan to generate higher quality ortholog groups (e.g., through Orthofinder or Pangloss), we seriously recommend you use those instead of emapper ortholog groups. The output of these programs should already be in the required mcl tsv format, where the first column is the ortholog group ID, and each subsequent column contains a member of that ortholog group.

Gene finder module

we begin by de novo annotating all tyrosine recombinases (tyrs/YRs) in the provided assemblies. In practice, we can de novo annotate any gene we want, as long as we have an HMM file of a predicted domain within that gene (used to validate the predicted sequences) and a multifasta of amino acid sequences of that gene (used to actually de novo annotate sequences).

first, create a dedicated directory for good housekeeping:

mkdir geneFinder

de novo annotate tyrs with the provided YR HMM and amino acid queries (~10min):

starfish annotate -T 2 -x macpha6_tyr -a ome2assembly.txt -g ome2gff.txt -p $STARFISHDIR/db/YRsuperfams.p1-512.hmm -P $STARFISHDIR/db/YRsuperfamRefs.faa -i tyr -o geneFinder/

NB: If annotating multiple genes in addition to tyrs, do NOT run this command for different genes using the same output directory. Unique output directories should be used for each gene if annotating multiple genes! Failure to do so will result in unexpected errors.

you should observe the following printed output:

found 1 new tyr genes and 9 tyr genes that overlap with 11 existing genes

all of the final output files associated with the newly predicted genes will have the suffix filt_intersect if a gff of existing gene annotations was provided and if new genes were de novo annotated. If a gff was not provided or if no new genes were de novo annotated, the final output files will have the suffix filt. If .filt_intersect files were output, use those file moving forward; otherwise, use the .filt files.

Now consolidate the newly predicted gene coordinates with the existing concatenated gff3:

starfish consolidate -o ./ -g macpha6.gff3 -G geneFinder/macpha6_tyr.filt_intersect.gff

create a .txt file with the path to the new consolidated gff file (to be used as the new control file for specifying gff data):

realpath macpha6_tyr.filt_intersect.consolidated.gff | perl -pe 's/^/macpha6\t/' > ome2consolidatedGFF.txt

organize tyrs into mutually exclusive neighborhoods separated by at least 10kb to avoid adjacent tyrs messing up subsequent analyses. If multiple tyrs are found in close proximity (as can happen with fragmented annotations), it becomes difficult to identify which tyr might be the captain:

starfish sketch -m 10000 -q geneFinder/macpha6_tyr.filt_intersect.ids -g ome2consolidatedGFF.txt -i s -x macpha6 -o geneFinder/

you should observe the following printed output:

found 10 neighborhoods containing input query genes

neighborhoods will often contain intervening genes located between genes of interest so pull out the coordinates of candidate captains only:

grep -P '\ttyr\t' geneFinder/macpha6.bed > geneFinder/macpha6.tyr.bed 

Key output files

  • the *.filt_intersect.* files output by starfish annotate contain all newly predicted genes and amino acid sequences, if a gff of existing gene annotations was provided as in this example (otherwise, look for files with the suffix *.filt.*). Newly predicted genes that overlap with an existing gene will keep their original sequenceID (or if overlapping with multiple existing genes, will be assigned a new sequenceID consisting of the concatenated existing gene IDs separated by ':', but will be assigned the newly predicted amino acid sequence in the *.filt_intersect.fas file. Newly predicted genes that don't overlap with an existing gene will be assigned a new sequenceID that is generated on the fly and that incorporates the string specified by the argument --idtag.

  • the .bed file output from starfish sketch contains the coordinates of all genes of interest organized into neighborhoods. At this point, it should only contain the coordinates of tyr genes. This bed file will be used as the starting point for the next module.

Element finder module

now we can move on to annotating mobile elements associated with the candidate captain tyrs. In order to be found, elements must have the basic architecture as a fungal Starship or bacterial integrative and conjugative element: a captain gene (tyrs in this example, but could be something else) with zero or more cargo genes downstream of its 3' end. If the element has cargo genes located upstream of its 5' end, they will NOT be annotated (this decision was made because it helps filter out false positive hits).

create a dedicated directory for good housekeeping:

mkdir elementFinder

search for elements using the coordinates of the predicted tyrs as starting points for a BLAST-based search. The upstream sequence associated with each tyr will be searched against all genomes in the provided BLAST database (takes ~1min for this dataset but can take significantly longer depending on the number of captains and the size of the genome database; see Figure S1 in the starfish publication for more details):

starfish insert -T 2 -a ome2assembly.txt -d blastdb/macpha6.assemblies -b geneFinder/macpha6.tyr.bed -i tyr -x macpha6 -o elementFinder/

you should observe the following printed output:

found element boundaries and insertion sites for 7 tyr captains out of 10 input captains

a insert.bed file containing the boundaries of candidate elements will be output, along with a .stats file that contains data on the insertion sites associated with each boundary prediction. These are intermediate files that you do not have to look at, but are good to have for record keeping and used for some downstream analyses. Since the upstream sequences of tyrs are searched against multiple genomes, there will likely be multiple candidate boundaries and insertion sites per tyr, many of which will be filtered out at later steps.

although we haven't tested it for this tutorial, you can iteratively use the output for starfish insert as the input for another execution of starfish insert using different parameters. In subsequent executions of starfish insert, only candidate tyrs that do NOT already have a predicted boundary will be searched with the new parameters. This enables you to explore parameter space and maximize the number of boundaries found for each tyr. We recommend modifying the --pid parameter, which controls the minimum percent identity of alignments in blastn and nucmer; the --hsp parameter, which controls the minimum alignment length for blastn and nucmer; the --upstream and --downstream parameters, which control the distances to boundary edges from BLAST hits. For example, if you were comparing more distantly related species, you might consider decreasing the --pid and --hsp values. Increasing --upstream and --downstream might be well suited for recovering elements from hard to align regions that do not have clean insertion sites.

*NB: the insert.bed files produced for iterative rounds of starfish insert beyond the first has metadata for newly predicted elements and the ones predicted in the previous rounds, but the insert.stats file only lists the newly predicted element boundaries (i.e. those from round 2, 3, etc). The insert.stats entries from previous rounds should be manually copied and pasted to the end of the new insert.stats files. A permanent solution to fix this is in the works.

starfish insert -T 2 -a ome2assembly.txt -d blastdb/macpha6.assemblies -b elementFinder/macpha6.insert.bed -i tyr -x macpha6_round2 -o elementFinder/ --pid 80 --hsp 750

now we use starfish summarize to consolidate the output from starfish insert. A single 'reference' boundary, based on maximizing the length of flanking sequence alignments between the element and the empty site and on minimizing the distance between the boundary and the captain start site, will be selected for each captain such that each captain is associated with at most one pair of boundaries that now define the full length element. A final .bed file containing these elements, along with any annotated genes that fall within their boundaries will be printed out (if gffs are provided). Note that the boundaries of the elements are identified as distinct features in the .bed file whose feature IDs follow custom naming conventions. If they are a boundary that has support from starfish insert``, their feature ID will be written as <associated captain ID>|<associated contig ID with empty site>|<upstream / downstream>, and their attributes field will contain the fields SEQ, whose value is the sequence of the putative empty site, and ALIGNED, whose value is the total alignment length of flanking sequences in the element vs the empty site associated with this particular boundary. Each insertion site previously identified using starfish insert``` will be assigned a unique ID and this information will be printed to a new .stats file. A .feat file containing useful element metadata will be output, along with a .fna file containing the nucleotide sequences of the elements. Note that the headers in the .fna file will have the strand orientation of the element appended to the element ID.

Rarely, two captains may be located opposite each other such that they are each associated with an element that completely overlaps the other e.g., (captain>---cargo---<captain). This results in two completely overlapping elements, and without further information, it is impossible to tell which of the captains is the 'true' captain. Warnings of overlapping elements will be printed to STDERR and its up to the user to manually inspect these elements and chose a single captain. Manually annotating the direct repeats can sometimes help in determining which is the true captain.

starfish summarize -a ome2assembly.txt -b elementFinder/macpha6.insert.bed -x macpha6 -o elementFinder/ -S elementFinder/macpha6.insert.stats -g ome2consolidatedGFF.txt -A ann/macph6.gene2emap.txt -t geneFinder/macpha6_tyr.filt_intersect.ids 

If you want to add colors to specific features of interest in the BED file for visualization purposes with genome browswers, you can use the following awk one-liner;

awk '{
    if ($4 ~ /DR/) print $0 "\t255,255,0";        # Yellow for 'DR' in column 4
    else if ($4 ~ /TIR/) print $0 "\t255,165,0";  # Orange for 'TIR' in column 4
    else if ($5 ~ /tyr|cap/) print $0 "\t255,0,0"; # Red for 'tyr' or 'cap' in column 5
    else if ($5 != ".") print $0 "\t128,0,128";   # Purple if column 5 is not '.'
    else print $0 "\t169,169,169";                # Dark gray otherwise
}' macpha6.elements.bed > macpha6.elements.color.bed

once elements have been predicted, we can begin assigning them to families, naves and haplotypes. Starships are currently classified into 11 phylogenetic families that are named after the first element in that family to be described Gluck-Thaler and Vogan 2024. Membership to a family is determined based on similarity to HMM profiles of YR reference sequences: YRsuperfams.p1-512.hmm. Each family is further subdivided into element “naves” (“navis”, singular; latin for “ship”) based on orthology relationships between captain YRs. Naves are typically calculated on the fly using either Orthofinder or mmseqs easy-cluster. Elements from the same navis typically have the same target site, although may carry different cargo. It is therefore useful to further group Starships into 'haplotypes' based on pairwise k-mer or sequence-based similarities between their cargo sequences (see starfish sim and starfish group). This lets you distinguish between elements piloted by closely related captains but that carry different cargo. Here, we use a combination of captain naves and cargo haplotypes to group elements by "navis-haplotype" combinations that represent distinct types of Starships.

first, assign all Starships to a family by searching the candidate captain sequences against the reference captain HMM profile database and identifying the best profile hit to each sequence:

hmmsearch --noali --notextw -E 0.001 --max --cpu 12 --tblout elementFinder/macpha6_tyr_vs_YRsuperfams.out $STARFISHDIR/db/YRsuperfams.p1-512.hmm geneFinder/macpha6_tyr.filt_intersect.fas
perl -p -e 's/ +/\t/g' elementFinder/macpha6_tyr_vs_YRsuperfams.out | cut -f1,3,5 | grep -v '#' | sort -k3,3g | awk '!x[$1]++' > elementFinder/macpha6_tyr_vs_YRsuperfams_besthits.txt

replace all captain IDs with starship IDs to simplify downstream parsing:

grep -P '\tcap\t' elementFinder/macpha6.elements.bed | cut -f4,7 > elementFinder/macpha6.cap2ship.txt
$STARFISHDIR/aux/searchReplace.pl --strict -i elementFinder/macpha6_tyr_vs_YRsuperfams_besthits.txt -r elementFinder/macpha6.cap2ship.txt > elementFinder/macpha6_elements_vs_YRsuperfams_besthits.txt

now group all Starships into naves using mmseqs2 easy-cluster on the captain sequences with a very permissive 50% percent ID/ 25% coverage threshold:

mmseqs easy-cluster geneFinder/macpha6_tyr.filt_intersect.fas elementFinder/macpha6_tyr elementFinder/ --threads 2 --min-seq-id 0.5 -c 0.25 --alignment-mode 3 --cov-mode 0 --cluster-reassign
$STARFISHDIR/aux/mmseqs2mclFormat.pl -i elementFinder/macpha6_tyr_cluster.tsv -g navis -o elementFinder/

use sourmash and mcl to group all elements into haplotypes based on pairwise k-mer similarities of element nucleotide sequences:

starfish sim -m element -t nucl -b elementFinder/macpha6.elements.bed -x macpha6 -o elementFinder/ -a ome2assembly.txt
starfish group -m mcl -s elementFinder/macpha6.element.nucl.sim -i hap -o elementFinder/ -t 0.05

replace captain IDs with starship IDs in the naves file:

$STARFISHDIR/aux/searchReplace.pl -i elementFinder/macpha6_tyr_cluster.mcl -r elementFinder/macpha6.cap2ship.txt > elementFinder/macpha6.element_cluster.mcl

merge navis with haplotype to create a navis-haplotype label for each Starship:

$STARFISHDIR/aux/mergeGroupfiles.pl -t elementFinder/macpha6.element_cluster.mcl -q elementFinder/macpha6.element.nucl.I1.5.mcl > elementFinder/macpha6.element.navis-hap.mcl

convert mcl to gene2og format to simplify downstream parsing:

awk '{ for (i = 2; i <= NF; i++) print $i"\t"$1 }' elementFinder/macpha6.element.navis-hap.mcl > elementFinder/macpha6.element.navis-hap.txt

now add the family and navis-haplotype into to the element.feat file to consolidate metadata:

join -t$'\t' -1 1 -2 2 <(sort -t$'\t' -k1,1 elementFinder/macpha6.element.navis-hap.txt | grep -P '_e|_s') <(sort -t$'\t' -k2,2 elementFinder/macpha6.elements.feat) | awk -F'\t' '{print}' > elementFinder/macpha6.elements.temp.feat
echo -e "#elementID\tfamilyID\tnavisHapID\tcontigID\tcaptainID\telementBegin\telementEnd\telementLength\tstrand\tboundaryType\temptySiteID\temptyContig\temptyBegin\temptyEnd\temptySeq\tupDR\tdownDR\tDRedit\tupTIR\tdownTIR\tTIRedit\tnestedInside\tcontainNested" > elementFinder/macpha6.elements.ann.feat
join -t$'\t' -1 1 -2 1 <(sort -t$'\t' -k1,1 elementFinder/macpha6_elements_vs_YRsuperfams_besthits.txt | grep -P '_e|_s' | cut -f1,2) <(sort -t$'\t' -k1,1 elementFinder/macpha6.elements.temp.feat) | awk -F'\t' '{print}' >> elementFinder/macpha6.elements.ann.feat

A note on quality control: it is strongly recommended to visually inspect the alignments of each element against its 'reference' insertion site to manually filter out false positives. In our experience, false positives typically occur when an element is inserted into another transposable element (which makes it difficult to recover a 'true' insertion site) or when the empty site associated with an element is located on a very short contig, in which case there we are less confident about whether it is truly homologous to the flanking regions of the element. Any low confidence elements identified through manual inspection can be removed from further analysis.

Examples of false positives:

A beautiful example of a true positive:

we have simplified the inspection process to enable users to use circos to visualize nucmer alignments of elements and their flanking regions against their empty sites (takes ~2min):

mkdir pairViz
starfish pair-viz -m all -t empty -T 2 -A nucmer -a ome2assembly.txt -b elementFinder/macpha6.elements.bed -S elementFinder/macpha6.elements.named.stats -o pairViz/

Summary of key output files

  • *.insert.bed contains coordinates of all predicted element boundaries based on all candidate insertions.
  • *.insert.stats contains useful metadata on candidate insertions.
  • *.elements.bed contains all captain, boundary, and gene features of predicted elements.
  • *.elements.feat contains element metadata.
  • *.elements.fna contains element sequences.
  • *.elements.named.stats is an updated version of the insert.stats file that contains named insertion sites.

Region finder module

The final step of a starfish analysis involves situating elements and insertion sites into genomic regions so we know which insertions are shared across individuals or not. This allows us to not only genotype specific element insertions across multiple individuals in a unified fashion, but to identify different copies of a given element that is present across multiple sites within a population. In this procedure, we leverage two types of orthologous relationships: navis-haplotype relationships between elements (which allows us to genotype the presence/absence of specific elements at specific sites) and ortholog relationships between all genes across all genomes (which allows us to identify homologous genomic regions). For a given region, an individual will automatically be genotyped as either an “empty” haplotype, if orthogroups from the upstream element flank are adjacent to orthogroups from the downstream element flank; a “fragmented” haplotype if it contains a Starship-associated orthogroups; or as an "element" haplotype if it contains at least 1 element insertion. If a specific region can't be found in a given individual, that individual will not receive a genotype.

We use ortholog information from the eggnog mapper analysis to group all genes into ortholog groups since this is just a quick analysis. We would otherwise recommend running a more comprehensive analysis, like Orthofinder, mmseqs, or Pangloss because not all genes will be assigned to an emapper orthogroup and emapper orthogroups typically do not provide fine-grained information.

create a dedicated directory for good housekeeping:

mkdir regionFinder

create a file with tyrs that are not found in any elements (this will let us assign them to fragmented haplotypes in the dereplicate analysis, which can be helpful if annotating 'dead' or 'derelict' element copies):

grep -f <(comm -23 <(cut -f1 geneFinder/macpha6_tyr.filt_intersect.ids | sort) <(grep -P '\tcap\t|\ttyr\t' elementFinder/macpha6.elements.bed | cut -f4| sort)) geneFinder/macpha6.tyr.bed > regionFinder/unaffiliated_tyrs.bed

you can increase confidence the determinination of region homology by only looking at gene orthogroups with low copy numbers (controlled by -c) missing from few genomes (controlled by -a). This ensures that the orthogroups used to group regions into homologous regions are more unique, which improves our ability to identify homology between regions:

$STARFISHDIR/aux/filterOG.pl -O ann/macph6.gene2og.mcl -a 1 -c 5 -o ann/

it is more useful to play around with copy number thresholds than with genome absence thresholds because starfish dereplicate will automatically filter orthogroups to retain those that are present in a greater number of individuals.

now, dereplicate your data to identify independently segregating element insertions:

starfish dereplicate -e elementFinder/macpha6.element.navis-hap.mcl -t regionFinder/unaffiliated_tyrs.bed -F elementFinder/macpha6.elements.feat -S elementFinder/macpha6.elements.named.stats -O ann/macph6.gene2og.a1.c5.txt -g ome2gff.txt -x macpha6 -o regionFinder/ --flanking 3 --mismatching 1

the --flanking option controls how many ortholog groups on each flanking side of an element insertion are used to determine region homology: the higher this value, the more strict the criteria for determining homology. the --mismatching option controls how many mismatches are permitted when grouping flanking regions between individuals into the same homologous genomic region: the lower this value, the more strict the criteria for determining homology. We would normally recommend going with the default --flanking 6 and --mismatching 1; however, because the Defiant insertion in this example dataset is in a gene-sparse region, it can only be recovered with --flanking 3 since fewer than 6 ortholog groups are present in the flanking regions around its insertion.

a lot of information will be printed to STDOUT, but somewhere in there you should see:

found 5 regions with at least 1 cross-referenced element-insertion site pair

element haplotypes consist of predicted mobile element sequences. Empty haplotypes consist of a contiguous sequence formed by the flanking regions of an element. Fragmented haplotypes consist of a non-empty sequence flanked by the flanking regions of an element but missing a predicted element. Coordinates of predicted insertion sites from the Element Finder module are cross referenced with empty and fragmented haplotypes and are considered to be 'verified' if their coordinates overlap, since they now have two independent lines of evidence supporting their designation as a bonafide insertion site: one from the Element Finder module (determined by sequence searching) and one from the Region Finder module (determined by ortholog group matching).

one of the best pieces of evidence that a starship is real is if you find the same navis-haplotype in at least two different regions because this demonstrates that the navis-haplotype is a single transposing unit. Count the number of regions each navis-haplotype is found in:

grep -v '#' regionFinder/macpha6.fog3.d600000.m1.dereplicated.txt | cut -f2 | sort | uniq -c | perl -pe 's/ +//' | sort -k1,1nr

as before, it is strongly recommended to look at haplotype alignments within each region to manually filter out false positives. We have streamlined the inspection process by using gggenomes to visualize nucmer alignments (takes ~5min):

mkdir locusViz
starfish locus-viz -T 2 -m region -a ome2assembly.txt -b elementFinder/macpha6.elements.bed -x macpha6 -o locusViz/ -A nucmer -r regionFinder/macpha6.fog3.d600000.m1.regions.txt -d regionFinder/macpha6.fog3.d600000.m1.dereplicated.txt -j regionFinder/macpha6.fog3.d600000.m1.haplotype_jaccard.sim  -g ome2consolidatedGFF.txt --tags geneFinder/macpha6_tyr.filt_intersect.ids --gc macpha6.assemblies.gcContent_w1000.bed

the gggenomes script printed out by default should accommodate the length of most regions. But depending on the region length, the visualization may be wonky. Edit the R script or the *.seqs.config file (to flip sequence orientations) and re-run the R script manually in case you want to make custom edits.

Summary of key output files

  • *.mat files contain matrix-formatted data useful for visualizing alongside phylogenetic trees (e.g., using $STARFISHDIR/aux/mat2tree.py). To aid with heatmap-based visualizations, the *.genome.mat file uses '-1' to represent the presence of an empty side, '2' to represent the presence of an element, '1' to represent the presence of a fragmented element, and '0' to represent missing data.
  • *.sim files contain pairwise jaccard similarities between different regions and haplotypes within regions that among other things, will be useful for logically arranging haplotypes within synteny plots.
  • the *.dereplicated.txt file contains metadata about the elements present in each region, including the 'reference' element which is defined as the longest element whose predicted insertion site is found within the region.
  • the *.regions.txt file contains metadata about the element, empty, and fragmented haplotypes found in each region.
Clone this wiki locally