Skip to content

comprna/SWARM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SWARM: Single-molecule Workflow for Analysing RNA Modifications

Detection of pseudouridine, m6A, and m5C on individual molecules from direct RNA nanopore signals


Table of Contents



Preprocess raw signals


Basecalling

SWARM was trained on signals basecalled with guppy 6.4.6 for RNA002 and with dorado 0.7.2 for RNA004.

Recommended parameters RNA002:

guppy_basecaller -i $INPUTDIR --recursive -s $output_path.fastq -c guppy/ont-guppy/data/rna_r9.4.1_70bps_hac.cfg --device cuda:all:100%

Recommended parameters RNA004:

MODEL=dorado-0.7.2-linux-x64/rna004_130bps_sup@v5.0.0
dorado basecaller $MODEL $INPUTDIR -r -x cuda:all --emit-fastq > $output_path.fastq

Alignment

minimap 2.24 for alignment and samtools 1.12 for quality checks

Recommended parameters: -k 5 for sythetic IVTs and -k 14 for human transcriptomes

minimap2 -ax map-ont -k 5 ${fasta} ${input_path}/guppy_pass.fastq | samtools sort -o ${output_path}.bam
samtools index ${output_path}.bam

samtools view -b -F 2324  ${bam_file}.bam > ${bam_file}_pass_filtered.bam
samtools index ${bam_file}_pass_filtered.bam

fast5 to slow5

This step is optional but highly recommended, especially for large datasets.

https://github.com/hasindu2008/slow5tools

Example conversion command:

#convert fast5 files to slow5 files using 8 I/O processes
slow5tools f2s $INPUT_DIR -d $TEMPDIR  -p 8

#Merge all the slow5 files in to a single file using 8 threads
slow5tools merge $TEMPDIR -o $OUTDIR/${SAMPLE}.blow5 -t 8

#remove the temporary directory
rm -rf  $TEMPDIR

Event alignment

f5c

Our workflow supports both f5c sam and nanopolish tsv formats. We highly recommend opting for f5c and sam files. This requires the slow5 conversion outlined in previous step.

https://github.com/hasindu2008/f5c

Example event align command:

#f5c index
f5c index -t 48 $FASTQ_PATH --slow5 $SLOW5_PATH

#sam event alignment format
f5c  eventalign -t 48  -r $FASTQ_PATH --rna  -g $genome -b $BAM --slow5 $SLOW5_PATH --min-mapq 0 --secondary=yes --signal-index --scale-events --samples --print-read-names --sam > $OUT

nanopolish

We used this format in earlier stages of the project, our workflow can still support it. Note that our prediction workflow is optimised for f5c sam format.

https://github.com/jts/nanopolish

Example event align command:

nanopolish index -d ${fast5_path} -s ${guppy_files}/sequencing_summary.txt $fastq

nanopolish eventalign -t 48 --reads $fastq --bam $bam_file \
        --genome $fasta --signal-index --scale-events --samples --print-read-names > $output_path

Detect RNA modifications


Installation

Simply clone from github (install lfs to download large h5 files)

git lfs install
git clone https://github.com/comprna/SWARM/ && cd SWARM

This step is highly recommended and required for using our C++ preprocessing of sam event files. Skip if using eventalign.tsv format.

cd SWARM_scripts/preprocess/

#build and compile htslib, slow5tools, SWARM_preprocess
bash build.sh

Dependencies

SWARM supports GPU inference with tensorflow, tested with versions 2.8.0 and 2.15.0

GPU-configured tensorflow should be available on most HPC systems. Otherwise, you can install tensorflow configured for GPU as per https://www.tensorflow.org/install/

python requirements:

python=3.11.7
tensorflow==2.15.0
numpy==1.26.2
pandas==2.2.0
sklearn==1.4.0
pysam==0.22.1

Read-level single-base detection

sam + slow5 preprocessing (preferred)

Use this approach for faster and simultaneous preprocessing + model inference. Run build.sh from above section.

Models for RNA002 or RNA004 chemistry are automatically selected based on the blow5 data.

Example bash code to run SWARM read-level prediction:

module load tensorflow

export MOD=m6A    # [<m6A> <m5C> <pU>]   
export FASTA=Homo_sapiens.GRCh38.cdna.fa
export BLOW5=Hek293_mRNA.blow5
export SAM=Hek293_mRNA_f5C.events.sam
export OUT=Hek293_mRNA.$MOD.m1.tsv

python3 SWARM_read_level.py -m $MOD --sam $SAM --fasta $FASTA --raw $BLOW5 -o $OUT 

eventalign.tsv preprocessing

Alternatively, preprocessing and prediction can be run separately from eventalign.tsv, but that involves large temp files.

First preprocess the event alignments.

export MOD=pU
export BAM=Hek293_mRNA_f5C.bam
export EVENTS=Hek293_mRNA.events.tsv
export OUT=Hek293_mRNA_pU

python3 SWARM_read_level.py --preprocess -m $MOD --bam BAM --nanopolish $EVENTS -o $OUT

Then predict modification states.

export MOD=pU
export PICKLE=Hek293_mRNA_pU_T.pickle
export OUT=Hek293_mRNA_pU.pred.tsv

python3 SWARM_read_level.py --predict -m $MOD --pickle $PICKLE -o $OUT

Site-level detection

First sort the model1 output, use cat if pooling multiple replicates.

cat Hek293_mRNA_rep1_pU.pred.tsv Hek293_mRNA_rep2_pU.pred.tsv > Hek293_mRNA_pooled_pU.pred.tsv
sort -k 1 Hek293_mRNA_pooled_pU.pred.tsv > Hek293_mRNA_pooled_pU.pred.tsv.sorted

Run site-level detection on sorted read-level data:

INPUT=Hek293_mRNA_pooled_pU.pred.tsv.sorted
OUT=Hek293_mRNA_pooled_pU.m2.pred.tsv
python3 SWARM_site_level.py -i $INPUT -o $OUT 

modsam output

Use --sam tag with SWARM_read_level.py to get mod.sam output (pred.tsv is still produced too).

Note that this runs slower as multithreaded preprocessing is not implemented with modsam.

python3 SWARM_read_level.py -m $MOD --sam $SAM --fasta $FASTA --raw $BLOW5 -o $OUT --sam

mod.sam can also be generated from sorted read-level pred.tsv files.

This should be faster and also enables filtering of sites for cleaner results.

M1_sorted=Hek293_mRNA_pooled_pU.pred.tsv.sorted
SAM=Hek293_mRNA_f5C.events.sam
M2=Hek293_mRNA_pooled_pU.m2.pred.tsv
OUT=Hek293_mRNA_pooled_pU.m2.pred.tsv
python3 SWARM_make_modsam.py -i $M1_sorted -s $SAM -m $M2 -o $OUT

Train new models


Train read-level prediction

Trim eventalign files

This optional step reduces the time to retrain models as preprocessing only a fraction of signals from a whole sample is usually enough for training. We trim for events comprising 500 signals per 9mer.

python3 train_models/trim_tsv_events.py -i <eventalign.tsv> -o <out_prefix> --limit-out 500

Preprocess trimmed files

Preprocess trimmed files for model1 input features. Make sure to include --out_counter arg here!

python3 SWARM_read_level.py --preprocess -m <pU/m6A/m5C/ac4C> --bam <BAM> //
--nanopolish <eventalign_trimmed.tsv> -o <out_prefix> --out_counter

Split training/validation/testing data

Use this step for stratified sampling of the preprocessed signals. Splits equal number of signals per 9mer for positive/negative labels in each of train/validation/test set (60/20/20 split by default). Run on each sample, to make positive/negative data. Give same outpath if multiple samples are intended to be used under the same positive/negative label.

python3 train_models/split_training_by_9mers.py -i <preprocesed.pickle> //
--counts <preprocessed.counts> -o <outpath> --limit <signals_per_9mer> //
[--train_percent 0.6] [--validate_percent 0.2]

Assemble data

Use this step for finalising training/validation/testing data with matching positive/negative labels.

python3 train_models/assemble_data.py --input_positive <positive_prefix> //
--input_negative <negative_prefix> -o <outpath> [--positive_label 1] [--negative_label 0]

Train read-level model

Use this step for training binary classifier of modification states with single-base single-molecule resolution.

python3 train_models/train_model1.py -i <assembled_prefix> -o <outpath> //
[--vector_len 36] [--features 7] [--labels 2] [--epochs 100]

About

Single-molecule Workflow for Analysing RNA Modifications

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages