IsoRefiner

IsoRefiner is a refinement tool to identify exon-intron structures of transcript (RNA) isoforms using long reads. It employs multiple transcript-identification tools, filters erroneous structures, merges results from the tools, and constructs the final dataset including novel transcript structures. Its inputs are long reads and reference data (genome and annotation), and it outputs a refined dataset (GTF file). We tested IsoRefiner using Oxford Nanopore cDNA reads, although it can potentially accept other types of reads such as PacBio. We have submitted a paper describing the IsoRefiner algorithm, and it is under review.

Publication

Tanaka Y., Sunamura N., Kajitani R., Ikeguchi M., and Kunimoto R. Long-read RNA sequencing unveils a novel cryptic exon in MNAT1 along with its full-length transcript structure in TDP-43 proteinopathy. Communications Biology 8, 1056 (2025). https://doi.org/10.1038/s42003-025-08463-4

Installation

We tested IsoRefiner on Linux x86_64 environments. After installation, you can execute isorefiner command.

Bioconda

conda install -y -c conda-forge -c bioconda isorefiner

Miniconda is utilized. It may take a long time to solve dependencies, and you can use mamba instead of conda to reduce the time.
Alternatively, you can install it and create a virtual environment simultaneously:

conda create -y -c conda-forge -c bioconda -n isorefiner_env python=3.12.8 isorefiner
conda activate isorefiner_env

python=3.12.8 is added to save time to solve dependencies.

Docker

docker pull rkajitani/isorefiner

Start a container interactively:

docker run -it -v $(pwd):/work -w /work rkajitani/isorefiner /bin/bash

Or, run isorefiner as a command:

docker run -v $(pwd):/work -w /work --rm rkajitani/isorefiner isorefiner ...

Binding of directories, -v $(pwd):/work -w /work, can be changed arbitrarily.

Dependency

Required tools are listed in the YAML file for conda. All of the required bioinformatics tools can be installed through the Bioconda channel.

Test

cd test/isorefiner
bash cmd.sh

A dataset and script are in the test directory. If you succeed in a test, isorefiner_refined.gtf is output. The test dataset was generated by a simulator, SQANTI-SIM.

Quick start

isorefiner trans_struct_wf -r reads.fastq -g genome.fasta -a ref_annot.gtf -t 32

Above, IsoRefiner executes a workflow to refine transcript structures. Its subcommands are used internally. reads.fastq is a file of input long reads (FASTQ or FASTA, gzip allowed). Multiple files can be specified as space-delimited string (e.g., "reads_1.fastq reads_2.fastq"). genome.fasta and ref_annot.gtf are the reference genome and annotation, respectively. The number of threads (parallelization) is 32 in this command. Final result is isorefiner_refined.gtf.

Workflow command usage

The command below runs an end-to-end workflow, which uses subcommands internally. Although detailed parameters for internal steps can not be specified, it is convenient to run the workflow without preparing a complex shell-script. Intermediate files are in the directory named isorefiner_{command}_work (default) or -d argument, and a log file named log.txt is created in the same directory.

trans_struct_wf

Workflow of transcript-structure refinement.

isorefiner trans_struct_wf [-h] -r [READS ...] -g GENOME -a REF_GTF [-o OUT_GTF] [-d WORK_DIR] [-t THREADS]

options:
  -h, --help            show this help message and exit
  -r [READS ...], --reads [READS ...]
                        Reads (FASTQ or FASTA, gzip allowed, mandatory) (default: None)
  -g GENOME, --genome GENOME
                        Reference genome (FASTA, mandatory) (default: None)
  -a REF_GTF, --ref_gtf REF_GTF
                        Reference genome annotation (GTF, mandatory) (default: None)
  -o OUT_GTF, --out_gtf OUT_GTF
                        Final output file name (GTF) (default: isorefiner_refined.gtf)
  -d WORK_DIR, --work_dir WORK_DIR
                        Working directory containing intermediate and log files (default: isorefiner_trans_struct_wf_work)
  -t THREADS, --threads THREADS
                        Number of threads (default: 1)

output: isorefiner_refined.gtf (-o argument)

Command usage for each step

Each command below corresponds to a specific step used in the workflow. When specifing detailed parameters, it is suitable to execute these commands directly with options. The example step-by-step procedures are written in step_by_step.sh. Intermediate files are in the directory named isorefiner_{command}_work (default) or -d argument, and a log file named log.txt is created in the same directory.

trim

Trim nanopore reads using Porechop_ABI.

isorefiner trim [-h] -r [READS ...] [-o OUT_PREFIX] [-d WORK_DIR] [-t THREADS] [-p TOOL_OPTION]

options:
  -h, --help            show this help message and exit
  -r [READS ...], --reads [READS ...]
                        Reads (FASTQ or FASTA, gzip allowed, mandatory) (default: None)
  -o OUT_PREFIX, --out_prefix OUT_PREFIX
                        Prefix of final output files (extentions are those of input files) (default: isorefiner_trimmed)
  -d WORK_DIR, --work_dir WORK_DIR
                        Working directory containing intermediate and log files (default: isorefiner_trim_work)
  -t THREADS, --threads THREADS
                        Number of threads (default: 1)
  -p TOOL_OPTION, --tool_option TOOL_OPTION
                        Option for Porechomp_ABI (quoted string) (default: )

output: isorefiner_trimmed.fastq ({-o argument}.fastq)
  When multiple input files, isorefiner_trimmed_1.fastq isorefiner_trimmed_2.fastq ...
  File extentions are inherited from the input files.

map

Map reads to the reference genome using Minimap2, and sort BAM files.

isorefiner map [-h] -r [READS ...] -g GENOME [-o OUT_PREFIX] [-d WORK_DIR] [-t THREADS] [-m MM2_OPTION] [-s SORT_OPTION]

options:
  -h, --help            show this help message and exit
  -r [READS ...], --reads [READS ...]
                        Reads (FASTQ or FASTA, gzip allowed, mandatory) (default: None)
  -g GENOME, --genome GENOME
                        Reference genome (FASTA, mandatory) (default: None)
  -o OUT_PREFIX, --out_prefix OUT_PREFIX
                        Prefix of output BAM files (default: isorefiner_mapped)
  -d WORK_DIR, --work_dir WORK_DIR
                        Working directory containing intermediate and log files (default: isorefiner_map_work)
  -t THREADS, --threads THREADS
                        Number of threads (default: 1)
  -m MM2_OPTION, --mm2_option MM2_OPTION
                        Option for minimap2 (quoted string) (default: -x splice -ub -k14 --secondary=no)
  -s SORT_OPTION, --sort_option SORT_OPTION
                        Option for samtools sort (quoted string) (default: -m 2G)

output: isorefiner_mapped.bam ({-o argument}.bam)
  When multiple input files, isorefiner_mapped_1.bam isorefiner_mapped_2.bam ...

run_bambu

Run Bambu (read mapping-based tool).

isorefiner run_bambu [-h] -b [BAM ...] -g GENOME -a REF_GTF [-o OUT_GTF] [-d WORK_DIR] [-t THREADS]

options:
  -h, --help            show this help message and exit
  -b [BAM ...], --bam [BAM ...]
                        Mapped reads files (BAM, mandatory) (default: None)
  -g GENOME, --genome GENOME
                        Reference genome (FASTA, mandatory) (default: None)
  -a REF_GTF, --ref_gtf REF_GTF
                        Reference genome annotation (GTF, mandatory) (default: None)
  -o OUT_GTF, --out_gtf OUT_GTF
                        Final output file name (GTF) (default: isorefiner_bambu.gtf)
  -d WORK_DIR, --work_dir WORK_DIR
                        Working directory containing intermediate and log files (default: isorefiner_bambu_work)
  -t THREADS, --threads THREADS
                        Number of threads (default: 1)

output: isorefiner_bambu.gtf (-o argument)

run_espresso

Run ESPRESSO (read mapping-based tool).

isorefiner run_espresso [-h] -b [BAM ...] -g GENOME -a REF_GTF [-o OUT_GTF] [-d WORK_DIR] [-t THREADS] [-s TOOL_S_OPTION] [-c TOOL_C_OPTION] [-q TOOL_Q_OPTION]

options:
  -h, --help            show this help message and exit
  -b [BAM ...], --bam [BAM ...]
                        Mapped reads files (BAM, mandatory) (default: None)
  -g GENOME, --genome GENOME
                        Reference genome (FASTA, mandatory) (default: None)
  -a REF_GTF, --ref_gtf REF_GTF
                        Reference genome annotation (GTF, mandatory) (default: None)
  -o OUT_GTF, --out_gtf OUT_GTF
                        Final output file name (GTF) (default: isorefiner_espresso.gtf)
  -d WORK_DIR, --work_dir WORK_DIR
                        Working directory containing intermediate and log files (default: isorefiner_espresso_work)
  -t THREADS, --threads THREADS
                        Number of threads (default: 1)
  -s TOOL_S_OPTION, --tool_s_option TOOL_S_OPTION
                        Option for ESPRESSO_S.pl (quoted string) (default: )
  -c TOOL_C_OPTION, --tool_c_option TOOL_C_OPTION
                        Option for ESPRESSO_C.pl (quoted string) (default: )
  -q TOOL_Q_OPTION, --tool_q_option TOOL_Q_OPTION
                        Option for ESPRESSO_Q.pl (quoted string) (default: )

output: isorefiner_espresso.gtf (-o argument)

run_isoquant

Run IsoQuant (read mapping-based tool).

isorefiner run_isoquant [-h] -b [BAM ...] -g GENOME -a REF_GTF [-o OUT_GTF] [-d WORK_DIR] [-t THREADS] [-p TOOL_OPTION]

options:
  -h, --help            show this help message and exit
  -b [BAM ...], --bam [BAM ...]
                        Mapped reads files (BAM, mandatory) (default: None)
  -g GENOME, --genome GENOME
                        Reference genome (FASTA, mandatory) (default: None)
  -a REF_GTF, --ref_gtf REF_GTF
                        Reference genome annotation (GTF, mandatory) (default: None)
  -o OUT_GTF, --out_gtf OUT_GTF
                        Final output file name (GTF) (default: isorefiner_isoquant.gtf)
  -d WORK_DIR, --work_dir WORK_DIR
                        Working directory containing intermediate and log files (default: isorefiner_isoquant_work)
  -t THREADS, --threads THREADS
                        Number of threads (default: 1)
  -p TOOL_OPTION, --tool_option TOOL_OPTION
                        Option for isoquant (quoted string) (default: --complete_genedb --data_type nanopore --stranded none --transcript_quantification unique_only
                        --gene_quantification unique_only --matching_strategy default --splice_correction_strategy default_ont --model_construction_strategy default_ont
                        --no_secondary --check_canonical --count_exons)

output: isorefiner_isoquant.gtf (-o argument)

run_stringtie

Run StringTie (read mapping-based tool).

isorefiner run_stringtie [-h] -b [BAM ...] [-g GENOME] -a REF_GTF [-o OUT_GTF] [-d WORK_DIR] [-t THREADS] [-p TOOL_OPTION]

options:
  -h, --help            show this help message and exit
  -b [BAM ...], --bam [BAM ...]
                        Mapped reads files (BAM, mandatory) (default: None)
  -g GENOME, --genome GENOME
                        Reference genome (FASTA) (default: None)
  -a REF_GTF, --ref_gtf REF_GTF
                        Reference genome annotation (GTF, mandatory) (default: None)
  -o OUT_GTF, --out_gtf OUT_GTF
                        Final output file name (GTF) (default: isorefiner_stringtie.gtf)
  -d WORK_DIR, --work_dir WORK_DIR
                        Working directory containing intermediate and log files (default: isorefiner_stringtie_work)
  -t THREADS, --threads THREADS
                        Number of threads (default: 1)
  -p TOOL_OPTION, --tool_option TOOL_OPTION
                        Option for StringTie (quoted string) (default: )

output: isorefiner_stringtie.gtf (-o argument)

run_rnabloom

Run RNA-Bloom (de novo assembly-based tool) and GMAP (contig mapping).

isorefiner run_rnabloom [-h] -r [READS ...] [-g GENOME] [-o OUT_GTF] [-d WORK_DIR] [-t THREADS] [--max_mem MAX_MEM] [--rnabloom_option RNABLOOM_OPTION]
                               [--gmap_min_cov GMAP_MIN_COV] [--gmap_min_idt GMAP_MIN_IDT] [--gmap_max_intron GMAP_MAX_INTRON] [--gmap_option GMAP_OPTION]

options:
  -h, --help            show this help message and exit
  -r [READS ...], --reads [READS ...]
                        Reads (FASTQ or FASTA, gzip allowed, mandatory) (default: None)
  -g GENOME, --genome GENOME
                        Reference genome (FASTA) (default: None)
  -o OUT_GTF, --out_gtf OUT_GTF
                        Final output file name (GTF) (default: isorefiner_rnabloom.gtf)
  -d WORK_DIR, --work_dir WORK_DIR
                        Working directory containing intermediate and log files (default: isorefiner_rnabloom_work)
  -t THREADS, --threads THREADS
                        Number of threads (default: 1)
  --max_mem MAX_MEM     Max memory for RNA-Bloom (java -Xmx) (default: 400g)
  --rnabloom_option RNABLOOM_OPTION
                        Option for RNA-Bloom (quoted string) (default: )
  --gmap_min_cov GMAP_MIN_COV
                        Min alignment coverage for GMAP [0-1] (default: 0.5)
  --gmap_min_idt GMAP_MIN_IDT
                        Min identity for GMAP [0-1] (default: 0.95)
  --gmap_max_intron GMAP_MAX_INTRON
                        Max intron length for GMAP (bp) (default: 100000)
  --gmap_option GMAP_OPTION
                        Option for GMAP (quoted string) (default: -n 1 --no-chimeras)

output: isorefiner_stringtie.gtf (-o argument)

filter

Filter transcript isoforms (GTF format).

isorefiner filter [-h] -i INPUT_GTF -r [READS ...] -g GENOME [-o OUT_GTF] [-d WORK_DIR] [-t THREADS] [--max_indel MAX_INDEL] [--max_clip MAX_CLIP] [--min_idt MIN_IDT]
                         [--min_cov MIN_COV] [--min_mean_depth MIN_MEAN_DEPTH]

options:
  -h, --help            show this help message and exit
  -i INPUT_GTF, --input_gtf INPUT_GTF
                        Input transcript isoform structures (GTF, mandatory) (default: None)
  -r [READS ...], --reads [READS ...]
                        Reads (FASTQ or FASTA, gzip allowed, mandatory) (default: None)
  -g GENOME, --genome GENOME
                        Reference genome (FASTA, mandatory) (default: None)
  -o OUT_GTF, --out_gtf OUT_GTF
                        Final output file name (GTF) (default: isorefiner_filtered.gtf)
  -d WORK_DIR, --work_dir WORK_DIR
                        Working directory containing intermediate and log files (default: isorefiner_filter_work)
  -t THREADS, --threads THREADS
                        Number of threads (default: 1)
  --max_indel MAX_INDEL
                        Max indel for read mapping (default: 20)
  --max_clip MAX_CLIP   Max clip (unaligned) length for read mapping (default: 200)
  --min_idt MIN_IDT     Min identity for read mapping [0-1] (default: 0.9)
  --min_cov MIN_COV     Min coverage for filtering [0-1] (default: 0.95)
  --min_mean_depth MIN_MEAN_DEPTH
                        Min mean coverage depth for filtering (default: 1.0)

output: isorefiner_filtered.gtf (-o argument)

refine

Merge and refine transcript isoforms.

refine [-h] -i [INPUT_GTF ...] -r [READS ...] -g GENOME -a REF_GTF [-o OUT_GTF] [-d WORK_DIR] [-t THREADS] [--max_indel MAX_INDEL] [--max_clip MAX_CLIP]
                         [--min_idt MIN_IDT] [--intron_dist_th INTRON_DIST_TH]

options:
  -h, --help            show this help message and exit
  -i [INPUT_GTF ...], --input_gtf [INPUT_GTF ...]
                        Input transcript isoform structures (GTF, mandatory) (default: None)
  -r [READS ...], --reads [READS ...]
                        Reads (FASTQ or FASTA, gzip allowed, mandatory) (default: None)
  -g GENOME, --genome GENOME
                        Reference genome (FASTA, mandatory) (default: None)
  -a REF_GTF, --ref_gtf REF_GTF
                        Reference genome annotation (GTF, mandatory) (default: None)
  -o OUT_GTF, --out_gtf OUT_GTF
                        Final output file name (GTF) (default: isorefiner_refined.gtf)
  -d WORK_DIR, --work_dir WORK_DIR
                        Working directory containing intermediate and log files (default: isorefiner_refine_work)
  -t THREADS, --threads THREADS
                        Number of threads (default: 1)
  --max_indel MAX_INDEL
                        Max indel for read mapping (default: 20)
  --max_clip MAX_CLIP   Max clip (unaligned) length for read mapping (default: 200)
  --min_idt MIN_IDT     Min identity for read mapping [0-1] (default: 0.9)
  --intron_dist_th INTRON_DIST_TH
                        Intron distance threshold to exclude erroneous isoforms (default: 20)

output: isorefiner_refined.gtf (-o argument)

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
isorefiner		isorefiner
test		test
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
conda.yml		conda.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IsoRefiner

Publication

Installation

Bioconda

Docker

Dependency

Test

Quick start

Workflow command usage

trans_struct_wf

Command usage for each step

trim

map

run_bambu

run_espresso

run_isoquant

run_stringtie

run_rnabloom

filter

refine

About

Uh oh!

Releases 1

Packages

Languages

License

rkajitani/IsoRefiner

Folders and files

Latest commit

History

Repository files navigation

IsoRefiner

Publication

Installation

Bioconda

Docker

Dependency

Test

Quick start

Workflow command usage

trans_struct_wf

Command usage for each step

trim

map

run_bambu

run_espresso

run_isoquant

run_stringtie

run_rnabloom

filter

refine

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages