Detection of pseudouridine, m6A, and m5C on individual molecules from direct RNA nanopore signals
SWARM was trained on signals basecalled with guppy 6.4.6 for RNA002 and with dorado 0.7.2 for RNA004.
Recommended parameters RNA002:
guppy_basecaller -i $INPUTDIR --recursive -s $output_path.fastq -c guppy/ont-guppy/data/rna_r9.4.1_70bps_hac.cfg --device cuda:all:100%
Recommended parameters RNA004:
MODEL=dorado-0.7.2-linux-x64/rna004_130bps_sup@v5.0.0
dorado basecaller $MODEL $INPUTDIR -r -x cuda:all --emit-fastq > $output_path.fastq
minimap 2.24 for alignment and samtools 1.12 for quality checks
Recommended parameters: -k 5 for sythetic IVTs and -k 14 for human transcriptomes
minimap2 -ax map-ont -k 5 ${fasta} ${input_path}/guppy_pass.fastq | samtools sort -o ${output_path}.bam
samtools index ${output_path}.bam
samtools view -b -F 2324 ${bam_file}.bam > ${bam_file}_pass_filtered.bam
samtools index ${bam_file}_pass_filtered.bam
This step is optional but highly recommended, especially for large datasets.
https://github.com/hasindu2008/slow5tools
Example conversion command:
#convert fast5 files to slow5 files using 8 I/O processes
slow5tools f2s $INPUT_DIR -d $TEMPDIR -p 8
#Merge all the slow5 files in to a single file using 8 threads
slow5tools merge $TEMPDIR -o $OUTDIR/${SAMPLE}.blow5 -t 8
#remove the temporary directory
rm -rf $TEMPDIR
Our workflow supports both f5c sam and nanopolish tsv formats. We highly recommend opting for f5c and sam files. This requires the slow5 conversion outlined in previous step.
https://github.com/hasindu2008/f5c
Example event align command:
#f5c index
f5c index -t 48 $FASTQ_PATH --slow5 $SLOW5_PATH
#sam event alignment format
f5c eventalign -t 48 -r $FASTQ_PATH --rna -g $genome -b $BAM --slow5 $SLOW5_PATH --min-mapq 0 --secondary=yes --signal-index --scale-events --samples --print-read-names --sam > $OUT
We used this format in earlier stages of the project, our workflow can still support it. Note that our prediction workflow is optimised for f5c sam format.
https://github.com/jts/nanopolish
Example event align command:
nanopolish index -d ${fast5_path} -s ${guppy_files}/sequencing_summary.txt $fastq
nanopolish eventalign -t 48 --reads $fastq --bam $bam_file \
--genome $fasta --signal-index --scale-events --samples --print-read-names > $output_path
Simply clone from github (install lfs to download large h5 files)
git lfs install
git clone https://github.com/comprna/SWARM/ && cd SWARM
This step is highly recommended and required for using our C++ preprocessing of sam event files. Skip if using eventalign.tsv format.
cd SWARM_scripts/preprocess/
#build and compile htslib, slow5tools, SWARM_preprocess
bash build.sh
SWARM supports GPU inference with tensorflow, tested with versions 2.8.0 and 2.15.0
GPU-configured tensorflow should be available on most HPC systems. Otherwise, you can install tensorflow configured for GPU as per https://www.tensorflow.org/install/
python requirements:
python=3.11.7
tensorflow==2.15.0
numpy==1.26.2
pandas==2.2.0
sklearn==1.4.0
pysam==0.22.1
Use this approach for faster and simultaneous preprocessing + model inference. Run build.sh from above section.
Models for RNA002 or RNA004 chemistry are automatically selected based on the blow5 data.
Example bash code to run SWARM read-level prediction:
module load tensorflow
export MOD=m6A # [<m6A> <m5C> <pU>]
export FASTA=Homo_sapiens.GRCh38.cdna.fa
export BLOW5=Hek293_mRNA.blow5
export SAM=Hek293_mRNA_f5C.events.sam
export OUT=Hek293_mRNA.$MOD.m1.tsv
python3 SWARM_read_level.py -m $MOD --sam $SAM --fasta $FASTA --raw $BLOW5 -o $OUT
Alternatively, preprocessing and prediction can be run separately from eventalign.tsv, but that involves large temp files.
First preprocess the event alignments.
export MOD=pU
export BAM=Hek293_mRNA_f5C.bam
export EVENTS=Hek293_mRNA.events.tsv
export OUT=Hek293_mRNA_pU
python3 SWARM_read_level.py --preprocess -m $MOD --bam BAM --nanopolish $EVENTS -o $OUT
Then predict modification states.
export MOD=pU
export PICKLE=Hek293_mRNA_pU_T.pickle
export OUT=Hek293_mRNA_pU.pred.tsv
python3 SWARM_read_level.py --predict -m $MOD --pickle $PICKLE -o $OUT
First sort the model1 output, use cat if pooling multiple replicates.
cat Hek293_mRNA_rep1_pU.pred.tsv Hek293_mRNA_rep2_pU.pred.tsv > Hek293_mRNA_pooled_pU.pred.tsv
sort -k 1 Hek293_mRNA_pooled_pU.pred.tsv > Hek293_mRNA_pooled_pU.pred.tsv.sorted
Run site-level detection on sorted read-level data:
INPUT=Hek293_mRNA_pooled_pU.pred.tsv.sorted
OUT=Hek293_mRNA_pooled_pU.m2.pred.tsv
python3 SWARM_site_level.py -i $INPUT -o $OUT
Use --sam tag with SWARM_read_level.py to get mod.sam output (pred.tsv is still produced too).
Note that this runs slower as multithreaded preprocessing is not implemented with modsam.
python3 SWARM_read_level.py -m $MOD --sam $SAM --fasta $FASTA --raw $BLOW5 -o $OUT --sam
mod.sam can also be generated from sorted read-level pred.tsv files.
This should be faster and also enables filtering of sites for cleaner results.
M1_sorted=Hek293_mRNA_pooled_pU.pred.tsv.sorted
SAM=Hek293_mRNA_f5C.events.sam
M2=Hek293_mRNA_pooled_pU.m2.pred.tsv
OUT=Hek293_mRNA_pooled_pU.m2.pred.tsv
python3 SWARM_make_modsam.py -i $M1_sorted -s $SAM -m $M2 -o $OUT
This optional step reduces the time to retrain models as preprocessing only a fraction of signals from a whole sample is usually enough for training. We trim for events comprising 500 signals per 9mer.
python3 train_models/trim_tsv_events.py -i <eventalign.tsv> -o <out_prefix> --limit-out 500
Preprocess trimmed files for model1 input features. Make sure to include --out_counter arg here!
python3 SWARM_read_level.py --preprocess -m <pU/m6A/m5C/ac4C> --bam <BAM> //
--nanopolish <eventalign_trimmed.tsv> -o <out_prefix> --out_counter
Use this step for stratified sampling of the preprocessed signals. Splits equal number of signals per 9mer for positive/negative labels in each of train/validation/test set (60/20/20 split by default). Run on each sample, to make positive/negative data. Give same outpath if multiple samples are intended to be used under the same positive/negative label.
python3 train_models/split_training_by_9mers.py -i <preprocesed.pickle> //
--counts <preprocessed.counts> -o <outpath> --limit <signals_per_9mer> //
[--train_percent 0.6] [--validate_percent 0.2]
Use this step for finalising training/validation/testing data with matching positive/negative labels.
python3 train_models/assemble_data.py --input_positive <positive_prefix> //
--input_negative <negative_prefix> -o <outpath> [--positive_label 1] [--negative_label 0]
Use this step for training binary classifier of modification states with single-base single-molecule resolution.
python3 train_models/train_model1.py -i <assembled_prefix> -o <outpath> //
[--vector_len 36] [--features 7] [--labels 2] [--epochs 100]