TIR-Learner is an ensemble pipeline for Terminal Inverted Repeat (TIR) transposable elements annotation in eukaryotic genomes. Version 3 represents a complete redesign and rewrite, focusing on enhancing efficiency, improving compatibility, and ensuring code quality.
- Background
- New in Version 3
- Installation
- Usage
- Program Workflow
- Output Files
- Algorithm Details
- Performance Improvements
- Contributing
- Citation
- Credits
- License
Transposable elements (TEs) are DNA sequences that can move within a genome. Terminal Inverted Repeat (TIR) transposons are a specific type of TE characterized by inverted repeat sequences at their ends. These TIRs act like bookends, marking the beginning and end of the transposable element and helping enzymes recognize and move the element.
TIR-Learner combines multiple approaches to identify and classify TIR transposons:
- Homology-based detection using curated reference libraries
- De novo structural identification
- Machine learning classification into TIR superfamilies
- Reduced I/O operations through in-memory data processing using Pandas DataFrames
- Multiprocessing support for both TIRvish and GRF tools
- Optimized sequence processing algorithms
- Updated dependencies for better maintainability
- New conda recipe for easier installation
- Pytorch backend for the CNN model
- Automatic genome file pre-scanning and validation
- Checkpoint system for progress recovery
- Overlap detection and simplification in results
- Additional parallel execution modes
- Progress tracking and verbose output options
If you do not have conda/mamba installed, you are strongly recommended to use Miniforge. For more information, please refer to conda-forge/miniforge.
Note: In all installation commands, -c conda-forge
MUST be specified before -c bioconda
to ensure conda-forge is having higher priority over bioconda so that the latest packages are installed (according to Channels: Specifying channels when installing packages).
You can create a new environment with any name (we use TIRLearner_env
as an example) and install TIR-Learner in it using the following one-liner command:
mamba create -n TIRLearner_env -c conda-forge -c bioconda tir-learner
mamba install -c conda-forge -c bioconda tir-learner
Sometimes users may want to force PyTorch to use CPU only (or conversely force using CUDA). For example, while HPC nodes may have GPUs that trigger conda to automatically install CUDA-enabled PyTorch builds, users without GPU access permissions will encounter runtime errors while using TIR-Learner.
In such cases, you can specify a CPU-only PyTorch build by adding "pytorch=*=*cpu*"
in the installation commands. For instance:
mamba create -n TIRLearner_env -c conda-forge -c bioconda tir-learner "pytorch=*=*cpu*"
Similarly, you can add "pytorch=*=*cuda*"
in the installation commands to enforce CUDA usage.
Clone the repository and install all the dependencies. TIR-Learner.py
is the entry point of the program.
- Python >=3.8
- BLAST
- GenomeTools (gt)
- GRF (Generic Repeat Finder)
- Required Python packages:
- biopython
- keras >=3.3.3
- pandas
- pytorch
- regex
- scikit-learn >=1.3
- swifter
TIR-Learner [-h] [-v] -f GENOME_FILE -s SPECIES [-n GENOME_NAME] [-l LENGTH] [-p PROCESSOR] [-w WORKING_DIR] [-o OUTPUT_DIR] [-c [CHECKPOINT_DIR]] [--verbose] [-d] [--grf_path GRF_PATH] [--gt_path GT_PATH] [-a ADDITIONAL_ARGS]
-h, --help
- Show help message and exit
-v, --version
- Show version information and exit
-
-f, --genome_file
- Genome file in FASTA format
- Must be properly formatted with unique sequence names
-
-s, --species
- Species specification for analysis
- Options:
rice
: Use rice-specific TIR reference librarymaize
: Use maize-specific TIR reference libraryothers
: Use general analysis pipeline without species-specific references
- Affects which processing path (Part A or B) will be used
-
-n, --genome_name
(default: "TIR-Learner")- Name prefix for output files
- Used in naming temporary files and final results
- Avoid using special characters
-
-l, --length
(default: 5000)- Maximum length of TIR transposons to detect
-
-p, -t, --processor
(default: all available CPU cores)- Number of processors to use for parallel operations
-
-w, --working_dir
(default: temporary directory in/tmp
)- Directory for storing temporary files
- Will be automatically cleaned unless in debug mode
- Requires sufficient disk space for temporary files
-
-o, --output_dir
(default: genome file directory)- Directory for final output files
- Will be created if it doesn't exist
- Must have write permissions
-
-c, --checkpoint_dir
- Directory for checkpoint files to load
- Note: Checkpoint files will always be saved in the output directory by default, so this option is only used to load checkpoints from a different directory
- Options:
- Specify with path (e.g.
-c /path/to/checkpoint
): Use given directory for checkpoints - Specify without path (e.g.
-c
): Automatically search in genome file and output directories - Do not specify: No checkpoint loading
- Specify with path (e.g.
-
--verbose
- Enable detailed progress output
- Shows user-friendly progress bars
-
-d, --debug
- Enable debug mode
- Keeps all temporary files
- Stores additional checkpoint information
-
--grf_path
- Path to GRF (Generic Repeat Finder)
- Required if GRF is not in system PATH
- Must point to directory containing grf-main
-
--gt_path
- Path to genometools
- Required if genometools is not in system PATH
- Must point to directory containing gt
-a, --additional_args
- Additional arguments to pass to the program
- Pass additional options individually using -a for each option
Examples:- Single option:
-a CHECKPOINT_OFF
- Multiple options:
-a CHECKPOINT_OFF -a SKIP_TIRVISH
- Single option:
Available options:
-
CHECKPOINT_OFF
- Completely disables checkpoint system
- No checkpoints will be saved or loaded
- Useful for one-off runs or when disk space is limited
-
NO_PARALLEL
- Disables multi-processing
- Forces serial execution of computation-intensive steps
- Useful for debugging or on systems with limited resources
-
SKIP_TIRVISH
- Skips TIRvish analysis step
- Reduces runtime but may miss some TIRs
- Cannot be used together with SKIP_GRF
-
SKIP_GRF
- Skips GRF analysis step
- Reduces runtime but may miss some TIRs
- Cannot be used together with SKIP_TIRVISH
TIR-Learner -f ./test/genome.fa -s others -p 2 -l 5000 -c ./ -w ./test/ -d --verbose

TIR-Learner v3 consists of two main processing paths:
Uses three consecutive modules:
- Module 1: Identify TIR based on existing TIR database
- Module 2: Predict TIR using de novo tool and match them with database to classify their TIR superfamily
- Module 3: Classify TIR superfamily of de novo tool predicted TIR with CNN
Uses a single module:
- Module 4: Predict TIR using de novo tool and classify their TIR superfamily with CNN
- Pre-scan genome file
- Route to Part A or B based on species
- Execute relevant modules
- Post-process results
- Combine predictions
- Remove overlaps
- Generate final output
TIR-Learner generates four output files in the TIR-Learner-Result
directory:
-
GFF3 annotation file:
<genome_name>_FinalAnn.gff3
- Contains predicted TIR locations and classifications
- Includes TIR and TSD details in attributes
-
FASTA sequence file:
<genome_name>_FinalAnn.fa
- Contains extracted sequences for predicted TIRs
- Headers include location and classification information
- GFF3 annotation file:
<genome_name>_FinalAnn_filter.gff3
- FASTA sequence file:
<genome_name>_FinalAnn_filter.fa
- Minimum TIR length: 10bp
- Maximum TIR length: 1000bp
- Maximum TIR distance: 5000bp (default)
- Minimum similarity: 80%
Superfamily-specific TSD patterns:
- DTA: 8bp
- DTC: 2-3bp
- DTH: 3bp (TWA)
- DTM: 7-10bp
- DTT: 2bp (TA)
- Uses sequence fragments from TIR regions
- Classifies into five major TIR superfamilies
- Trained on curated datasets
- Reduced temporary file generation
- In-memory data processing using Pandas
- Minimized external storage operations
- Python multiprocessing for TIRvish and GRF
- Swifter for pandas DataFrame parallel processing
Contributions are very welcome! Please feel free to submit a Pull Request.
The manuscript of TIR-Learner v3 is currently in preparation. Presentation slides:
TIR-Learner v3: New generation TE annotation program for identifying TIRs
Previous versions:
- TIR-Learner v2 (Part of EDTA v1):
Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline - TIR-Learner v1:
TIR-Learner, a New Ensemble Method for TIR Transposable Element Annotation, Provides Evidence for Abundant New Transposable Elements in the Maize Genome
Special thanks to @oushujun and @WeijiaSu for their fantastic work on TIR-Learner v1 and v2! Their foundational work made this improved version possible.
This work was supported by:
- The Ou Lab at The Ohio State University
- The Ohio Supercomputer Center
- OSU Undergraduate Research Access Innovation Seed Grant
The development of TIR-Learner v3 would not have been possible without their generous support in providing opportunities, resources, and platforms for this research.
This project is licensed under the GPL-3.0 License - see the LICENSE file for details.