IsoRefiner is a refinement tool to identify exon-intron structures of transcript (RNA) isoforms using long reads. It employs multiple transcript-identification tools, filters erroneous structures, merges results from the tools, and constructs the final dataset including novel transcript structures. Its inputs are long reads and reference data (genome and annotation), and it outputs a refined dataset (GTF file). We tested IsoRefiner using Oxford Nanopore cDNA reads, although it can potentially accept other types of reads such as PacBio. We have submitted a paper describing the IsoRefiner algorithm, and it is under review.
Tanaka Y., Sunamura N., Kajitani R., Ikeguchi M., and Kunimoto R. Long-read RNA sequencing unveils a novel cryptic exon in MNAT1 along with its full-length transcript structure in TDP-43 proteinopathy. Communications Biology 8, 1056 (2025). https://doi.org/10.1038/s42003-025-08463-4
We tested IsoRefiner on Linux x86_64 environments. After installation, you can execute isorefiner
command.
conda install -y -c conda-forge -c bioconda isorefiner
Miniconda is utilized. It may take a long time to solve dependencies, and you can use mamba instead of conda
to reduce the time.
Alternatively, you can install it and create a virtual environment simultaneously:
conda create -y -c conda-forge -c bioconda -n isorefiner_env python=3.12.8 isorefiner
conda activate isorefiner_env
python=3.12.8
is added to save time to solve dependencies.
docker pull rkajitani/isorefiner
Start a container interactively:
docker run -it -v $(pwd):/work -w /work rkajitani/isorefiner /bin/bash
Or, run isorefiner as a command:
docker run -v $(pwd):/work -w /work --rm rkajitani/isorefiner isorefiner ...
Binding of directories, -v $(pwd):/work -w /work
, can be changed arbitrarily.
Required tools are listed in the YAML file for conda. All of the required bioinformatics tools can be installed through the Bioconda channel.
cd test/isorefiner
bash cmd.sh
A dataset and script are in the test directory. If you succeed in a test, isorefiner_refined.gtf
is output. The test dataset was generated by a simulator, SQANTI-SIM.
isorefiner trans_struct_wf -r reads.fastq -g genome.fasta -a ref_annot.gtf -t 32
Above, IsoRefiner executes a workflow to refine transcript structures. Its subcommands are used internally. reads.fastq
is a file of input long reads (FASTQ or FASTA, gzip allowed). Multiple files can be specified as space-delimited string (e.g., "reads_1.fastq reads_2.fastq"). genome.fasta
and ref_annot.gtf
are the reference genome and annotation, respectively. The number of threads (parallelization) is 32
in this command. Final result is isorefiner_refined.gtf
.
The command below runs an end-to-end workflow, which uses subcommands internally. Although detailed parameters for internal steps can not be specified, it is convenient to run the workflow without preparing a complex shell-script. Intermediate files are in the directory named isorefiner_{command}_work
(default) or -d argument
, and a log file named log.txt
is created in the same directory.
Workflow of transcript-structure refinement.
isorefiner trans_struct_wf [-h] -r [READS ...] -g GENOME -a REF_GTF [-o OUT_GTF] [-d WORK_DIR] [-t THREADS]
options:
-h, --help show this help message and exit
-r [READS ...], --reads [READS ...]
Reads (FASTQ or FASTA, gzip allowed, mandatory) (default: None)
-g GENOME, --genome GENOME
Reference genome (FASTA, mandatory) (default: None)
-a REF_GTF, --ref_gtf REF_GTF
Reference genome annotation (GTF, mandatory) (default: None)
-o OUT_GTF, --out_gtf OUT_GTF
Final output file name (GTF) (default: isorefiner_refined.gtf)
-d WORK_DIR, --work_dir WORK_DIR
Working directory containing intermediate and log files (default: isorefiner_trans_struct_wf_work)
-t THREADS, --threads THREADS
Number of threads (default: 1)
output: isorefiner_refined.gtf (-o argument)
Each command below corresponds to a specific step used in the workflow. When specifing detailed parameters, it is suitable to execute these commands directly with options. The example step-by-step procedures are written in step_by_step.sh. Intermediate files are in the directory named isorefiner_{command}_work
(default) or -d argument
, and a log file named log.txt
is created in the same directory.
Trim nanopore reads using Porechop_ABI.
isorefiner trim [-h] -r [READS ...] [-o OUT_PREFIX] [-d WORK_DIR] [-t THREADS] [-p TOOL_OPTION]
options:
-h, --help show this help message and exit
-r [READS ...], --reads [READS ...]
Reads (FASTQ or FASTA, gzip allowed, mandatory) (default: None)
-o OUT_PREFIX, --out_prefix OUT_PREFIX
Prefix of final output files (extentions are those of input files) (default: isorefiner_trimmed)
-d WORK_DIR, --work_dir WORK_DIR
Working directory containing intermediate and log files (default: isorefiner_trim_work)
-t THREADS, --threads THREADS
Number of threads (default: 1)
-p TOOL_OPTION, --tool_option TOOL_OPTION
Option for Porechomp_ABI (quoted string) (default: )
output: isorefiner_trimmed.fastq ({-o argument}.fastq)
When multiple input files, isorefiner_trimmed_1.fastq isorefiner_trimmed_2.fastq ...
File extentions are inherited from the input files.
Map reads to the reference genome using Minimap2, and sort BAM files.
isorefiner map [-h] -r [READS ...] -g GENOME [-o OUT_PREFIX] [-d WORK_DIR] [-t THREADS] [-m MM2_OPTION] [-s SORT_OPTION]
options:
-h, --help show this help message and exit
-r [READS ...], --reads [READS ...]
Reads (FASTQ or FASTA, gzip allowed, mandatory) (default: None)
-g GENOME, --genome GENOME
Reference genome (FASTA, mandatory) (default: None)
-o OUT_PREFIX, --out_prefix OUT_PREFIX
Prefix of output BAM files (default: isorefiner_mapped)
-d WORK_DIR, --work_dir WORK_DIR
Working directory containing intermediate and log files (default: isorefiner_map_work)
-t THREADS, --threads THREADS
Number of threads (default: 1)
-m MM2_OPTION, --mm2_option MM2_OPTION
Option for minimap2 (quoted string) (default: -x splice -ub -k14 --secondary=no)
-s SORT_OPTION, --sort_option SORT_OPTION
Option for samtools sort (quoted string) (default: -m 2G)
output: isorefiner_mapped.bam ({-o argument}.bam)
When multiple input files, isorefiner_mapped_1.bam isorefiner_mapped_2.bam ...
Run Bambu (read mapping-based tool).
isorefiner run_bambu [-h] -b [BAM ...] -g GENOME -a REF_GTF [-o OUT_GTF] [-d WORK_DIR] [-t THREADS]
options:
-h, --help show this help message and exit
-b [BAM ...], --bam [BAM ...]
Mapped reads files (BAM, mandatory) (default: None)
-g GENOME, --genome GENOME
Reference genome (FASTA, mandatory) (default: None)
-a REF_GTF, --ref_gtf REF_GTF
Reference genome annotation (GTF, mandatory) (default: None)
-o OUT_GTF, --out_gtf OUT_GTF
Final output file name (GTF) (default: isorefiner_bambu.gtf)
-d WORK_DIR, --work_dir WORK_DIR
Working directory containing intermediate and log files (default: isorefiner_bambu_work)
-t THREADS, --threads THREADS
Number of threads (default: 1)
output: isorefiner_bambu.gtf (-o argument)
Run ESPRESSO (read mapping-based tool).
isorefiner run_espresso [-h] -b [BAM ...] -g GENOME -a REF_GTF [-o OUT_GTF] [-d WORK_DIR] [-t THREADS] [-s TOOL_S_OPTION] [-c TOOL_C_OPTION] [-q TOOL_Q_OPTION]
options:
-h, --help show this help message and exit
-b [BAM ...], --bam [BAM ...]
Mapped reads files (BAM, mandatory) (default: None)
-g GENOME, --genome GENOME
Reference genome (FASTA, mandatory) (default: None)
-a REF_GTF, --ref_gtf REF_GTF
Reference genome annotation (GTF, mandatory) (default: None)
-o OUT_GTF, --out_gtf OUT_GTF
Final output file name (GTF) (default: isorefiner_espresso.gtf)
-d WORK_DIR, --work_dir WORK_DIR
Working directory containing intermediate and log files (default: isorefiner_espresso_work)
-t THREADS, --threads THREADS
Number of threads (default: 1)
-s TOOL_S_OPTION, --tool_s_option TOOL_S_OPTION
Option for ESPRESSO_S.pl (quoted string) (default: )
-c TOOL_C_OPTION, --tool_c_option TOOL_C_OPTION
Option for ESPRESSO_C.pl (quoted string) (default: )
-q TOOL_Q_OPTION, --tool_q_option TOOL_Q_OPTION
Option for ESPRESSO_Q.pl (quoted string) (default: )
output: isorefiner_espresso.gtf (-o argument)
Run IsoQuant (read mapping-based tool).
isorefiner run_isoquant [-h] -b [BAM ...] -g GENOME -a REF_GTF [-o OUT_GTF] [-d WORK_DIR] [-t THREADS] [-p TOOL_OPTION]
options:
-h, --help show this help message and exit
-b [BAM ...], --bam [BAM ...]
Mapped reads files (BAM, mandatory) (default: None)
-g GENOME, --genome GENOME
Reference genome (FASTA, mandatory) (default: None)
-a REF_GTF, --ref_gtf REF_GTF
Reference genome annotation (GTF, mandatory) (default: None)
-o OUT_GTF, --out_gtf OUT_GTF
Final output file name (GTF) (default: isorefiner_isoquant.gtf)
-d WORK_DIR, --work_dir WORK_DIR
Working directory containing intermediate and log files (default: isorefiner_isoquant_work)
-t THREADS, --threads THREADS
Number of threads (default: 1)
-p TOOL_OPTION, --tool_option TOOL_OPTION
Option for isoquant (quoted string) (default: --complete_genedb --data_type nanopore --stranded none --transcript_quantification unique_only
--gene_quantification unique_only --matching_strategy default --splice_correction_strategy default_ont --model_construction_strategy default_ont
--no_secondary --check_canonical --count_exons)
output: isorefiner_isoquant.gtf (-o argument)
Run StringTie (read mapping-based tool).
isorefiner run_stringtie [-h] -b [BAM ...] [-g GENOME] -a REF_GTF [-o OUT_GTF] [-d WORK_DIR] [-t THREADS] [-p TOOL_OPTION]
options:
-h, --help show this help message and exit
-b [BAM ...], --bam [BAM ...]
Mapped reads files (BAM, mandatory) (default: None)
-g GENOME, --genome GENOME
Reference genome (FASTA) (default: None)
-a REF_GTF, --ref_gtf REF_GTF
Reference genome annotation (GTF, mandatory) (default: None)
-o OUT_GTF, --out_gtf OUT_GTF
Final output file name (GTF) (default: isorefiner_stringtie.gtf)
-d WORK_DIR, --work_dir WORK_DIR
Working directory containing intermediate and log files (default: isorefiner_stringtie_work)
-t THREADS, --threads THREADS
Number of threads (default: 1)
-p TOOL_OPTION, --tool_option TOOL_OPTION
Option for StringTie (quoted string) (default: )
output: isorefiner_stringtie.gtf (-o argument)
Run RNA-Bloom (de novo assembly-based tool) and GMAP (contig mapping).
isorefiner run_rnabloom [-h] -r [READS ...] [-g GENOME] [-o OUT_GTF] [-d WORK_DIR] [-t THREADS] [--max_mem MAX_MEM] [--rnabloom_option RNABLOOM_OPTION]
[--gmap_min_cov GMAP_MIN_COV] [--gmap_min_idt GMAP_MIN_IDT] [--gmap_max_intron GMAP_MAX_INTRON] [--gmap_option GMAP_OPTION]
options:
-h, --help show this help message and exit
-r [READS ...], --reads [READS ...]
Reads (FASTQ or FASTA, gzip allowed, mandatory) (default: None)
-g GENOME, --genome GENOME
Reference genome (FASTA) (default: None)
-o OUT_GTF, --out_gtf OUT_GTF
Final output file name (GTF) (default: isorefiner_rnabloom.gtf)
-d WORK_DIR, --work_dir WORK_DIR
Working directory containing intermediate and log files (default: isorefiner_rnabloom_work)
-t THREADS, --threads THREADS
Number of threads (default: 1)
--max_mem MAX_MEM Max memory for RNA-Bloom (java -Xmx) (default: 400g)
--rnabloom_option RNABLOOM_OPTION
Option for RNA-Bloom (quoted string) (default: )
--gmap_min_cov GMAP_MIN_COV
Min alignment coverage for GMAP [0-1] (default: 0.5)
--gmap_min_idt GMAP_MIN_IDT
Min identity for GMAP [0-1] (default: 0.95)
--gmap_max_intron GMAP_MAX_INTRON
Max intron length for GMAP (bp) (default: 100000)
--gmap_option GMAP_OPTION
Option for GMAP (quoted string) (default: -n 1 --no-chimeras)
output: isorefiner_stringtie.gtf (-o argument)
Filter transcript isoforms (GTF format).
isorefiner filter [-h] -i INPUT_GTF -r [READS ...] -g GENOME [-o OUT_GTF] [-d WORK_DIR] [-t THREADS] [--max_indel MAX_INDEL] [--max_clip MAX_CLIP] [--min_idt MIN_IDT]
[--min_cov MIN_COV] [--min_mean_depth MIN_MEAN_DEPTH]
options:
-h, --help show this help message and exit
-i INPUT_GTF, --input_gtf INPUT_GTF
Input transcript isoform structures (GTF, mandatory) (default: None)
-r [READS ...], --reads [READS ...]
Reads (FASTQ or FASTA, gzip allowed, mandatory) (default: None)
-g GENOME, --genome GENOME
Reference genome (FASTA, mandatory) (default: None)
-o OUT_GTF, --out_gtf OUT_GTF
Final output file name (GTF) (default: isorefiner_filtered.gtf)
-d WORK_DIR, --work_dir WORK_DIR
Working directory containing intermediate and log files (default: isorefiner_filter_work)
-t THREADS, --threads THREADS
Number of threads (default: 1)
--max_indel MAX_INDEL
Max indel for read mapping (default: 20)
--max_clip MAX_CLIP Max clip (unaligned) length for read mapping (default: 200)
--min_idt MIN_IDT Min identity for read mapping [0-1] (default: 0.9)
--min_cov MIN_COV Min coverage for filtering [0-1] (default: 0.95)
--min_mean_depth MIN_MEAN_DEPTH
Min mean coverage depth for filtering (default: 1.0)
output: isorefiner_filtered.gtf (-o argument)
Merge and refine transcript isoforms.
refine [-h] -i [INPUT_GTF ...] -r [READS ...] -g GENOME -a REF_GTF [-o OUT_GTF] [-d WORK_DIR] [-t THREADS] [--max_indel MAX_INDEL] [--max_clip MAX_CLIP]
[--min_idt MIN_IDT] [--intron_dist_th INTRON_DIST_TH]
options:
-h, --help show this help message and exit
-i [INPUT_GTF ...], --input_gtf [INPUT_GTF ...]
Input transcript isoform structures (GTF, mandatory) (default: None)
-r [READS ...], --reads [READS ...]
Reads (FASTQ or FASTA, gzip allowed, mandatory) (default: None)
-g GENOME, --genome GENOME
Reference genome (FASTA, mandatory) (default: None)
-a REF_GTF, --ref_gtf REF_GTF
Reference genome annotation (GTF, mandatory) (default: None)
-o OUT_GTF, --out_gtf OUT_GTF
Final output file name (GTF) (default: isorefiner_refined.gtf)
-d WORK_DIR, --work_dir WORK_DIR
Working directory containing intermediate and log files (default: isorefiner_refine_work)
-t THREADS, --threads THREADS
Number of threads (default: 1)
--max_indel MAX_INDEL
Max indel for read mapping (default: 20)
--max_clip MAX_CLIP Max clip (unaligned) length for read mapping (default: 200)
--min_idt MIN_IDT Min identity for read mapping [0-1] (default: 0.9)
--intron_dist_th INTRON_DIST_TH
Intron distance threshold to exclude erroneous isoforms (default: 20)
output: isorefiner_refined.gtf (-o argument)