TIR-Learner v3

TIR-Learner is an ensemble pipeline for Terminal Inverted Repeat (TIR) transposable elements annotation in eukaryotic genomes. Version 3 represents a complete redesign and rewrite, focusing on enhancing efficiency, improving compatibility, and ensuring code quality.

Background

Transposable elements (TEs) are DNA sequences that can move within a genome. Terminal Inverted Repeat (TIR) transposons are a specific type of TE characterized by inverted repeat sequences at their ends. These TIRs act like bookends, marking the beginning and end of the transposable element and helping enzymes recognize and move the element.

TIR-Learner combines multiple approaches to identify and classify TIR transposons:

Homology-based detection using curated reference libraries
De novo structural identification
Machine learning classification into TIR superfamilies

New in Version 3

Improved Efficiency

Reduced I/O operations through in-memory data processing using Pandas DataFrames
Multiprocessing support for both TIRvish and GRF tools
Optimized sequence processing algorithms

Enhanced Compatibility

Updated dependencies for better maintainability
New conda recipe for easier installation
Pytorch backend for the CNN model

New Features

Automatic genome file pre-scanning and validation
Checkpoint system for progress recovery
Overlap detection and simplification in results
Additional parallel execution modes
Progress tracking and verbose output options

Installation

Using Conda/Mamba (Recommended)

If you do not have conda/mamba installed, you are strongly recommended to use Miniforge. For more information, please refer to conda-forge/miniforge.

Note: In all installation commands, -c conda-forge MUST be specified before -c bioconda to ensure conda-forge is having higher priority over bioconda so that the latest packages are installed (according to Channels: Specifying channels when installing packages).

Install with a new environment

You can create a new environment with any name (we use TIRLearner_env as an example) and install TIR-Learner in it using the following one-liner command:

mamba create -n TIRLearner_env -c conda-forge -c bioconda tir-learner

Install in an existing environment

mamba install -c conda-forge -c bioconda tir-learner

Enforce PyTorch Build (CUDA or CPU-only)

Sometimes users may want to force PyTorch to use CPU only (or conversely force using CUDA). For example, while HPC nodes may have GPUs that trigger conda to automatically install CUDA-enabled PyTorch builds, users without GPU access permissions will encounter runtime errors while using TIR-Learner.

In such cases, you can specify a CPU-only PyTorch build by adding "pytorch=*=*cpu*" in the installation commands. For instance:

mamba create -n TIRLearner_env -c conda-forge -c bioconda tir-learner "pytorch=*=*cpu*"

Similarly, you can add "pytorch=*=*cuda*" in the installation commands to enforce CUDA usage.

Manual Installation

Clone the repository and install all the dependencies. TIR-Learner.py is the entry point of the program.

Dependencies

Python >=3.8
BLAST
GenomeTools (gt)
GRF (Generic Repeat Finder)
Required Python packages:
- biopython
- keras >=3.3.3
- pandas
- pytorch
- regex
- scikit-learn >=1.3
- swifter

Usage

TIR-Learner [-h] [-v] -f GENOME_FILE -s SPECIES [-n GENOME_NAME] [-l LENGTH] [-p PROCESSOR] [-w WORKING_DIR] [-o OUTPUT_DIR] [-c [CHECKPOINT_DIR]] [--verbose] [-d] [--grf_path GRF_PATH] [--gt_path GT_PATH] [-a ADDITIONAL_ARGS]

Program Information

-h, --help
- Show help message and exit
-v, --version
- Show version information and exit

Required Arguments

-f, --genome_file
- Genome file in FASTA format
- Must be properly formatted with unique sequence names
-s, --species
- Species specification for analysis
- Options:
  - rice: Use rice-specific TIR reference library
  - maize: Use maize-specific TIR reference library
  - others: Use general analysis pipeline without species-specific references
- Affects which processing path (Part A or B) will be used

Optional Arguments

Basic Configuration

-n, --genome_name (default: "TIR-Learner")
- Name prefix for output files
- Used in naming temporary files and final results
- Avoid using special characters
-l, --length (default: 5000)
- Maximum length of TIR transposons to detect
-p, -t, --processor (default: all available CPU cores)
- Number of processors to use for parallel operations

Directory Configuration

-w, --working_dir (default: temporary directory in /tmp)
- Directory for storing temporary files
- Will be automatically cleaned unless in debug mode
- Requires sufficient disk space for temporary files
-o, --output_dir (default: genome file directory)
- Directory for final output files
- Will be created if it doesn't exist
- Must have write permissions
-c, --checkpoint_dir
- Directory for checkpoint files to load
- Note: Checkpoint files will always be saved in the output directory by default, so this option is only used to load checkpoints from a different directory
- Options:
  - Specify with path (e.g. -c /path/to/checkpoint): Use given directory for checkpoints
  - Specify without path (e.g. -c): Automatically search in genome file and output directories
  - Do not specify: No checkpoint loading

Debug and Verbose Options

--verbose
- Enable detailed progress output
- Shows user-friendly progress bars
-d, --debug
- Enable debug mode
- Keeps all temporary files
- Stores additional checkpoint information

Path Configuration

--grf_path
- Path to GRF (Generic Repeat Finder)
- Required if GRF is not in system PATH
- Must point to directory containing grf-main
--gt_path
- Path to genometools
- Required if genometools is not in system PATH
- Must point to directory containing gt

Additional Arguments

-a, --additional_args
- Additional arguments to pass to the program
- Pass additional options individually using -a for each option
  Examples:
  - Single option: -a CHECKPOINT_OFF
  - Multiple options: -a CHECKPOINT_OFF -a SKIP_TIRVISH

Available options:

CHECKPOINT_OFF
- Completely disables checkpoint system
- No checkpoints will be saved or loaded
- Useful for one-off runs or when disk space is limited
NO_PARALLEL
- Disables multi-processing
- Forces serial execution of computation-intensive steps
- Useful for debugging or on systems with limited resources
SKIP_TIRVISH
- Skips TIRvish analysis step
- Reduces runtime but may miss some TIRs
- Cannot be used together with SKIP_GRF
SKIP_GRF
- Skips GRF analysis step
- Reduces runtime but may miss some TIRs
- Cannot be used together with SKIP_TIRVISH

Example

TIR-Learner -f ./test/genome.fa -s others -p 2 -l 5000 -c ./ -w ./test/ -d --verbose

Program Workflow

TIR-Learner v3 consists of two main processing paths:

Part A (Rice and Maize)

Uses three consecutive modules:

Module 1: Identify TIR based on existing TIR database
Module 2: Predict TIR using de novo tool and match them with database to classify their TIR superfamily
Module 3: Classify TIR superfamily of de novo tool predicted TIR with CNN

Part B (Other Species)

Uses a single module:

Module 4: Predict TIR using de novo tool and classify their TIR superfamily with CNN

Processing Flow

Pre-scan genome file
Route to Part A or B based on species
Execute relevant modules
Post-process results
- Combine predictions
- Remove overlaps
- Generate final output

Output Files

TIR-Learner generates four output files in the TIR-Learner-Result directory:

Raw Results

GFF3 annotation file: <genome_name>_FinalAnn.gff3
- Contains predicted TIR locations and classifications
- Includes TIR and TSD details in attributes
FASTA sequence file: <genome_name>_FinalAnn.fa
- Contains extracted sequences for predicted TIRs
- Headers include location and classification information

Filtered Results

GFF3 annotation file: <genome_name>_FinalAnn_filter.gff3
FASTA sequence file: <genome_name>_FinalAnn_filter.fa

Algorithm Details

TIR Detection

Minimum TIR length: 10bp
Maximum TIR length: 1000bp
Maximum TIR distance: 5000bp (default)
Minimum similarity: 80%

TSD Validation

Superfamily-specific TSD patterns:

DTA: 8bp
DTC: 2-3bp
DTH: 3bp (TWA)
DTM: 7-10bp
DTT: 2bp (TA)

CNN Classification

Uses sequence fragments from TIR regions
Classifies into five major TIR superfamilies
Trained on curated datasets

Performance Improvements

I/O Optimization

Reduced temporary file generation
In-memory data processing using Pandas
Minimized external storage operations

Parallel Processing

Python multiprocessing for TIRvish and GRF
Swifter for pandas DataFrame parallel processing

Contributing

Contributions are very welcome! Please feel free to submit a Pull Request.

Citation

The manuscript of TIR-Learner v3 is currently in preparation. Presentation slides:

TIR-Learner v3: New generation TE annotation program for identifying TIRs

Previous versions:

TIR-Learner v2 (Part of EDTA v1):
Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline
TIR-Learner v1:
TIR-Learner, a New Ensemble Method for TIR Transposable Element Annotation, Provides Evidence for Abundant New Transposable Elements in the Maize Genome

Credits

Previous Versions

Special thanks to @oushujun and @WeijiaSu for their fantastic work on TIR-Learner v1 and v2! Their foundational work made this improved version possible.

Acknowledgments

This work was supported by:

The development of TIR-Learner v3 would not have been possible without their generous support in providing opportunities, resources, and platforms for this research.

License

This project is licensed under the GPL-3.0 License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github/workflows		.github/workflows
TIR-Learner3		TIR-Learner3
docs		docs
test		test
.gitattribute		.gitattribute
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

License

lutianyu2001/TIR-Learner

Folders and files

Latest commit

History

Repository files navigation

TIR-Learner v3

Table of Contents

Background

New in Version 3

Improved Efficiency

Enhanced Compatibility

New Features

Installation

Using Conda/Mamba (Recommended)

Install with a new environment

Install in an existing environment

Enforce PyTorch Build (CUDA or CPU-only)

Manual Installation

Dependencies

Usage

Program Information

Required Arguments

Optional Arguments

Basic Configuration

Directory Configuration

Debug and Verbose Options

Path Configuration

Additional Arguments

Example

Program Workflow

Part A (Rice and Maize)

Part B (Other Species)

Processing Flow

Output Files

Raw Results

Filtered Results

Algorithm Details

TIR Detection

TSD Validation

CNN Classification

Performance Improvements

I/O Optimization

Parallel Processing

Contributing

Citation

Credits

Previous Versions

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Uh oh!

Contributors 2

Uh oh!

Languages