Releases: oushujun/LTR_retriever
v3.0.4
This update further enhances robustness for large genomes, streamlines overlap computations, and lays the groundwork for more scalable LTR discovery.
Major update
- Reduced RepeatMasker memory footprint with
run_RM_split.pl
:
- Splits large FASTA inputs into manageable chunks for RepeatMasker, masking them in parallel.
- Skips already-masked chunks, then merges results into a single .masked file.
- First run RepeatMasker on the full dataset, but automatically fall back to a chunked masking strategy when the primary call fails or yields no repeats, capping parallel jobs to avoid OOM-kills.
- Rewrote bed_intersect_wao.pl
- Simplified buffering logic: maintain only “active” B intervals in memory, purge by chromosome and start/end comparisons.
- Eliminate circular-lookback logic in favor of a single pass with an in-memory buffer, supporting arbitrary chromosome orders.
- Dynamically detect the number of columns in B to generate the correct “wao” dummy lines when no overlaps are found.
- Speed is comparable to the original
bedtools intersect -wao
Minor update
- Refactored LTR.identifier.pl
Fixed a stray commented guard so that undefined scan entries are now properly skipped #193.
v3.0.3
Major change
Introduce the -salvage [0|1]
flag (default: 0) to recycle intermediate files and skip reruns when -salvage 1
is specified. This is particularly useful when processing large genomes (> 10 Gb) with limited walltime.
- Reuse existing results in the Init, Major, and Trunc steps to skip reprocessing.
- In the Major step, reuse TEsorter, HMM classification files, and processed candidates in the
.defalse
file. - Add new utility scripts under
bin/
:
-bed_intersect_wao.pl
: bedtools-like intersect with ‘wao’ behavior and buffer
-filter_extend.fa_by_defalse.pl
: filters extended FASTA by existing entries in the .defalse file
-filter_scn_by_defalse.pl
: filters scn entries present in .defalse file
Minor change
Modify LTR.identifier.pl
:
- Make necessary changes to implement the salvage mode.
- Add fallback for zero-length boundaries ($tot_len = $seq_len
) to fix @EDTA#564
Full Changelog: v3.0.2...v3.0.3
v3.0.2
New features
Added the K2P and p-distance models for divergence and age estimations. Now K2P is the default model (#170 #184)
Enhancements
- Improved parameters for
timeout blastn
to avoid stalling (#167) - Added new codes to search for solo LTRs and roughly intact LTR-RTs
- Added a wrapper to calculate solo:intact LTR ratios from both LTR_retriever and EDTA results (@EDTA#279)
Bug fixed
Added a script to recreate the retriever.scn.adj from the .defalse file to avoid inconsistencies.
Full Changelog: v3.0.1...v3.0.2
v3.0.1 release
New feature
Add the -stop
parameter to stop the program after a user-specified step. For example, if you only want to obtain the .defalse
and .pass.list
files, you can stop the program after the Major filtering step (i.e., -stop major
). By default, it will finish the full pipeline.
v3.0.0 update
Bug fix
- Update get_range.pl: fix the sequnce ID recognition issue for LTRharvest inputs #177
- Make sure candidates have sufficient flanking sequence to extend (50bp)
v2.9.9 update
New feature
Enable strand-aware outputs
For LTR candidates found in the negative strand, the locus presentation is now 5' -> 3', similar to candidates found in the positive strand. For example, Chr1:7890..3456
suggests the candidate is on the - strand. This information is shown in the first column of the pass.list
, the last column of the gff3
file, and the sequence names of the intact.fa
file. If the element is on the - strand, its sequence in the intact.fa
file will be shown as 5' -> 3' from the negative strand. For example, Chr1:7890..3456
's sequence will be a reverse complement to Chr1:3456..7890
's sequence. For candidates without strand information (i.e., lack of coding sequence), their strangeness will be assumed positive for convenience.
Bug fix
- Ensure candidates have sufficient flanking sequences to extend (default 50bp), which is necessary for LTR_retriever to determine whether the candidate is true or false. Candidates that can't satisfy this criterion will be skipped. Such a scenario is mostly likely found in fragmented genomes. Bug report: oushujun/EDTA#263
v2.9.8 update
New features
- Use the same LTR name for parts of INT and LTR from the same element in preparation for solving @edta#251
- Add the yml file for conda installation
Bug fix
Update get_range.pl
- A bug introduced in Aug, 2023 (# a375c5e) that will output all candidates (both LTR retrotransposons and not LTR repeats) for generating the library file. You will see non-LTR sequences in the library due to this bug (eg., LTR/EnSpm-CACTA). Now it's fixed.
- A bug introduced in May, 2023 (#058ce29) that fails to remove masked sequences in the final library. Now it's fixed.
- Remove the RepeatMasker support to simplify the code since this functionality is never used in the official release.
Bug fix
It just gets better with community efforts!
Major Updates
-
Add TEsorter to help to identify not LTR sequences. Candidate LTRs will be determined as "false" if they contain not-LTR HMM profile matches even the candidate contains LTR/TSD and the TGCA motif. This purging will remove a small number of structurally intact LTR candidates (5/2304 in rice). This implementation offers slight improvements over older versions and should be more significant for larger genomes.
LTR_retriever-harvest_FINDER sens spec accu prec FDR F1 retriever_v2.5 0.967 0.920 0.931 0.789 0.211 0.869 retriever_v2.6 0.963 0.931 0.939 0.811 0.189 0.881 retriever_v2.9.2 0.966 0.926 0.935 0.802 0.198 0.876 retriever_v2.9.4 0.967 0.928 0.937 0.804 0.196 0.878 -
Add more filtering parameters to identify solo LTRs, improve the solo-intact ratio calculation (#111, #110).
-
Resolve RMblast errors when it attempts to overutilize CPUs #137
Other improvements
- Now require sequence IDs for 13 characters or less to accomodate for huge chromosomes up to 999Mb in length.
- Add missing TRF parameter (#133)
- Add check to ensure the input genome is writable (LTR_retriever won't overwrite your genome) (#125).
- Remove gap length for genome size calculation.
Acknowledgements
Andreas Wallberg, @Shokusei, Evan Ernst, @xie-wei-hh, @with9, and users like YOU!
Version 2.9.0: Polishing outputs
Major updates
This version has many improvements in the downstream outputs including:
- standardized the GFF3 output following these criteria and used the updated TE-related sequence ontologies
- combined structural and homological LTR annotations. Homology-based LTR fragments will be replaced by structural-based LTR annotations wherever applicable.
Other improvements
- allow users to provide paths to dependencies in the command-line.
- updated readme
- fixed a number of minor bugs.