LTR_Stream is designed to achieve sub-lineage level LTR-RT clustering in closely related species, discovering valuable genetic markers for genome comparison and LTR-RT modular evolution in host genome. It takes nucleotide sequences of intact LTR-RTs belonging to the same LTR-lineage as input. A mix of LTR-RTs from different LTR-lineages is theoretically acceptable but not recommended. LTR_Stream gives each LTR-RT a cluster label and automatically evaluates reliability of each cluster.
Conda should be installed with version >=23.1.0.
Mamba is recommended for speeding up conda.
Please install git with version >=2.34.1.
Please configure the ssh key of git and make sure git clone
could work.
ltrStreamInstallPath=path_you_want_to_install_LTR_Stream
cd ${ltrStreamInstallPath} && git clone git@github.com:xjtu-omics/LTR_Stream.git
If mamba is not available, please run:
cd ${ltrStreamInstallPath}/LTR_Stream && bash Init_LTR_Stream_Env.sh
For a speeding up installation with mamba, please run:
cd ${ltrStreamInstallPath}/LTR_Stream && bash Init_LTR_Stream_Env.sh mamba
conda activate ltrStream
cd ${ltrStreamInstallPath}/LTR_Stream/src
snakemake -s LTR_Stream.smk -f stream --config ltrParaFile=path_of_ltrPara.tsv -j {threadsNumber}
LTR_Stream will automatically run according to parameters set in this TSV (Tab-Separated Values) file, so please make sure all the parameters were set here before you start LTR_Stream.smk. (You can modify the file name and path according to your preferences. In this documentation, we refer to this configuration parameters file as ltrPara.tsv.) The following is an example of the file. Lines beginning with a # represent comments. Values of optional parameter in this example represent their default values in LTR_Stream. To facilitate parameter debugging, the parameters that significantly impact the clustering results will be introduced first. A standard example of this file is under examples/.
# An example for ltrPara.tsv
# All tab seperated.
# Mandatory parameters
# workDir: A blank directory for running LTR_Stream
# The outputs of LTR_Stream are in workDir/figure
workDir /xx/xx/xx
# ltrFasta: The nucleotide sequences of the LTR-RT set you want to
# analyze. Please ensure it is in standard FASTA format.
ltrFasta /xx/xx/xx.fa
# Optional parameters
# Important parameters
# minOverLapForNovelModule: Control the number and dispersion of module sequences in the 3-D space.
# It is used in disjoint-set data structure to judge if there should be an edge between two alignment
# regions. It could be set at the range from 0 to 1. Greater minOverLapForNovelModule leads to more
# module sequences and more dispersed result. Default is 0.8.
minOverLapForNovelModule 0.8
# topModNum: Control the number and dispersion of module sequences with minOverLapForNovelModule.
# Greater topModNum leads to more module sequences and more dispersed result. LTR_Stream will output
# a module number versus covered LTR-RTs (named coverLine.pdf under workDir/figure). The topModNum
# needs to be set large enough to ensure that about 80% of LTR-RTs have 2-3 modules. It is estimated
# topModNum should be at range 200-800. Larger minOverLapForNovelModule usually corresponds to larger
# topModNum. You can adjust the two parameters in coordination. Default is 250.
topModNum 250
# tsneEarlyExaggeration: A crucial parameter in t-SNE dimensionality reduction, directly affects the
# results. An excessively large tsneEarlyExaggeration will result in a linear shape in the
# three-dimensional space, while an excessively small tsneEarlyExaggeration will lead to a dispersed
# distribution, hindering sub-lineage identification. It is estimated that tsneEarlyExaggeration
# should be at range 6-9. Default is 6.
tsneEarlyExaggeration 6
# tsnePerplexity: Larger tsnePerplexity will provide more robust results, while a smaller
# tsnePerplexity will yield more detailed clustering results. Depending on the size of the dataset,
# it is not recommended to set tsnePerplexity to less than 3% of the module sequence count for larger
# datasets, or less than 15 for smaller datasets. Default is 100.
tsnePerplexity 100
# cluCentCut: A parameter used to assess the degree of intra-class distribution aggregation in 3D
# space. A larger cluCentCut will result in coarser clustering. If LTR_Stream indicates clustering
# failure, please increase this parameter within the range of 0-1. Default is 0.1.
cluCentCut 0.1
# maxZoomInLevel: LTR_Stream achieves fine clustering of LTR-RT in complex scenarios through
# iterative expansion. This parameter controls the maximum depth of iterative expansion. If you find
# that the number of clusters is too large or some categories within subviews are verified as
# unreliable, you can set a maximum limit. The default value is -1, which means no limit is set.
maxZoomInLevel -1
# minClusterSize: Clusters containing fewer Module sequences will be considered noise and filtered
# out. If the number of clustered LTR-RTs is low, it is recommended to reduce the threshold accordingly.
# Default is 50.
minClusterSize 50
# Other parameters
# tsneLearningRate: For t-SNE dimensionality reduction, LTR_Stream requires a very small learning rate,
# with a default value of 6. It is not recommended to set this value higher than 8.
tsneLearningRate 6
# blastEvalue: Used for homology searching in BLASTn. Default is 1e-10. If the LTR-RT sequence set to
# be analyzed has particularly high similarity, you can reduce this parameter accordingly.
blastEvalue 1e-10
# Parameters used in ElPiGraph
epgLambda 0.01
epgMu 0.01
epgAlpha 0.05
Due to differences in dataset size and the degree of internal sequence consistency, the parameters of LTR_Stream need to be adjusted according to each dataset. The two most critical parameters are minOverlap
and tsnePerplexity
. LTR_Stream provides intermediate visualizations to assist with parameter tuning.
Before tuning the parameters, please run LTR_Stream.smk with the default settings. If no clustering results are produced or the results are unsatisfactory, proceed with parameter adjustment.
Specifically, please begin by determining an appropriate value for minOverlap.
Based on our testing, the value of minOverlap typically falls within the range of 0.75 to 0.99. After modifying this parameter, please re-run the following command — this will generate the file figure/coverLine.pdf
. Use this file to adjust minOverlap accordingly. A specific example and guidance is shown in the figure below.
snakemake -s LTR_Stream.smk -f staNovelSelectedNumVsCovered -R selectNovelUnits --config ltrParaFile=/path/to/ltrPara.tab -j {threads}
After setting an appropriate minOverlap
value, please proceed to adjust tsnePerplexity
.
The parameter tsnePerplexity is primarily related to the dataset size and does not have a fixed optimal range. For datasets with around 10,000 sequences, a value between 100 and 200 may be appropriate. For smaller datasets with only a few hundred sequences, values between 10 and 100 are typically suitable. After each adjustment, please run the following command — LTR_Stream will update the corresponding file figure/tsneDistance.3d.gif
. Please adjust this parameter based on the example provided in the figure below.
snakemake -s LTR_Stream.smk -f tsnePlot -R mergeModules --config ltrParaFile=/path/to/ltrPara.tab -j {threads}
snakemake -s LTR_Stream.smk -f stream -R stream --config ltrParaFile=/path/to/ltrPara.tab -j {threads}
All outputs will be saved in workDir/figure
GIF files showing clustering results in each 3D-subview.
TSV file recording final cluster results.
TSV file recording details of clustering including coordinate information in each subview.
TSV file recording foldchange of inter- and intra-distance and corresponding significance for each cluster. Foldchange that signifcantly greater than one means reliable cluster.
Line plot showing module number and corresponding covered LTR-RT percentage. Used for guiding parameter ajustment.
Xu, Tun, et al. "Deciphering complex interactions between LTR retrotransposons and three Papaver species using LTR_Stream." Genomics, Proteomics & Bioinformatics (2025): qzaf061.