Catalogue
This pipeline is designed to quantify splicing events within minigene-based mutagenesis libraries. Minigene systems are widely used to study the regulatory mechanisms of RNA splicing, particularly in the context of variant interpretation and functional genomics. By introducing specific mutations into synthetic constructs (minigenes), researchers can assess how sequence changes affect splicing outcomes in a controlled cellular environment.
Software
pigz = 2.7
BWA = 0.7.17
HISAT2 = 2.2.1
Samtools = 1.21
bamtools = 2.5.2
FLASH2 = 2.2.00
fastp = 0.23.4
RegTools = 1.0.0
R = 4.3.1
nextflow = 23.10.0
R Packages
data.table = 1.15.4
UpSetR = 1.4.0
gplots = 3.1.3.1
corrplot = 0.92
reshape2 = 1.4.4
optparse = 1.7.4
psych = 2.4.3
reactable = 0.4.4
tidyverse = 2.0.0
stringr = 1.5.1
performanceanalytics = 2.0.4
parallelly = 1.37.1
dendextend = 1.17.1
sparkline = 2.0
ggVennDiagram = 1.5.2
htmltools = 0.5.8
Bioconductor Packages
GenomicRanges = 1.54.1
Rsamtools = 2.18.0
Biostrings = 2.70.3
sample | replicate | directory | read1 | read2 | reference | barcode |
---|---|---|---|---|---|---|
s1 | rep1 | /path/of/directory/ | s1_rep1_r1.fastq.gz | s1_rep1_r2.fastq.gz | s1_ref.fa | s1_barcode.txt |
s1 | rep2 | /path/of/directory/ | s1_rep2_r1.fastq.gz | s1_rep2_r2.fastq.gz | s1_ref.fa | s1_barcode.txt |
s1 | rep3 | /path/of/directory/ | s1_rep3_r1.fastq.gz | s1_rep3_r2.fastq.gz | s1_ref.fa | s1_barcode.txt |
s2 | rep1 | /path/of/directory/ | s2_rep1_r1.fastq.gz | s2_rep1_r2.fastq.gz | s2_ref.fa | s2_barcode.txt |
s2 | rep2 | /path/of/directory/ | s2_rep2_r1.fastq.gz | s2_rep2_r2.fastq.gz | s2_ref.fa | s2_barcode.txt |
s2 | rep3 | /path/of/directory/ | s2_rep3_r1.fastq.gz | s2_rep3_r2.fastq.gz | s2_ref.fa | s2_barcode.txt |
Important
- The sample sheet must be a csv file and the header must be like above in the example
- All the files should be in the directory for each sample
The reference must be the sequence(s) of the minigene.
- Random Intron Library: The reference file should contain all variant sequences. For example, if the library includes 100 random intron sequences, the reference fasta file must include all 100 corresponding sequences.
- Mutagenesis Library: The reference file only needs the wild-type sequence.
Important
- File must be a fasta file
- Exons must be in Uppercase
- Introns must be in Lowercase
Please provide the barcode association file as below
barcode | variant | varid | count |
---|---|---|---|
CCAAATAATCATTAGGATCAGCATATTAATCTAGATTC | GTAAGTCGAAGAATTCTTGGGGAAACAGATTGAAATAACTTGGGAAGTAGTTCTTTCTCTTAGTGTGAAAGTATGTTCTCA | var1 | 56 |
ATCTATCAGAATGTATATTGGGATAAAAATAGTGATTC | GTAAGTGATTCAGGAGAGTTTCGTTCAGATTGAAATAACTTGGGAAGTAGTTCTTTCTCTTAGTGTGAAAGTATGTTCTCA | var2 | 293 |
Important
- The barcode association file must be a tsv file and the header must be like above in the example
Please submit the bash script below
#!/bin/bash
#BSUB -o %J.o
#BSUB -e %J.e
#BSUB -R "select[mem>1000] rusage[mem=1000]"
#BSUB -M 1000
#BSUB -q normal
#BSUB -J nf_splicing
# modules
module load HGI/common/nextflow/23.10.0
module load HGI/softpack/users/fs18/nf_splicing
#--------------#
# user specify #
#--------------#
# LSF group
export LSB_DEFAULT_USERGROUP=hgi
# Paths
export INPUTSAMPLE=$PWD/inputs/sample_sheet.csv
export OUTPUTRES=$PWD/outputs
#-----------#
# pipelines #
#-----------#
nextflow run -resume nf_splicing/main.nf --sample_sheet $INPUTSAMPLE \
--library random
--sample_sheet path of the sample sheet
--outdir the directory path of output results, default: the current directory
--do_pe_reads whether to process paired-end reads, default: false
Basic:
--library random, muta, default: muta
--barcode_template barcode template, default: NNNNATNNNNATNNNNATNNNNATNNNNATNNNNATNN
--barcode_marker barcode marker, default: CTACTGATTCGATGCAAGCTTG
Fastp:
--fastp_cut_mean_quality mean quality for fastp, default: 20
Flash2:
--flash2_min_overlap min overlap for flash2, default: 10
--flash2_max_overlap max overlap for flash2, default: 250
--flash2_min_overlap_outie min overlap outie for flash2, default: 20
--flash2_max_mismatch_density max mismatch density for flash2, default: 0.25
BWA:
--bwa_gap_open gap open penalty for BWA, default: 10,10
--bwa_gap_ext gap extension penalty for BWA, default: 5,5
--bwa_clip clip penalty for BWA, default: 1,1
Barcode extraction:
--filter_softclip_base softclip base for filtering, default: 5
HISAT2:
--hisat2_score_min min score for HISAT2, default: L,0,-0.3
--hisat2_mp min/max mismatch penalty for HISAT2, default: 5,2
--hisat2_sp min/max splice penalty for HISAT2, default: 2,1
--hisat2_np non-canonical splicing penalty for HISAT2, default: 0
--hisat2_pen_noncansplice non-canonical splicing penalty for HISAT2, default: 0
Spliced products:
--do_spliced_products whether to process spliced products, default: false
Regtools:
--regtools_min_anchor min anchor length for regtools, default: 5
--regtools_min_intron min intron length for regtools, default: 20
Junction classification:
--classify_min_overlap min overlap for classification, default: 2
--classify_min_cov min coverage for classification, default: 2
--classify_reduce reduce the number of reads for classification, default: 2
π output_directory
ββββ π extracted_barcodes
β ββββ π s1_rep1
β β ββββ π canonical_barcodes.txt
β β ββββ π novel_barcodes.txt
β ββββ π s1_rep2
β ββββ π s1_rep3
ββββ π novel_junctions
β ββββ π s1_rep1
β β ββββ π junctions.bed
β β ββββ π classified_junctions.txt
β β ββββ π classified_junctions.reduce.txt
β β ββββ π classified_junctions.png
β β ββββ π classified_variants.png
β ββββ π s1_rep2
β ββββ π s1_rep3
ββββ π novel_splicing_results
β ββββ π s1_rep1
β β ββββ π spliced_alignment.bam
β β ββββ π spliced_products.txt
β ββββ π s1_rep2
β ββββ π s1_rep3
ββββ π splicing_counts
β ββββ π s1_rep1.splicing_matrix.txt
β ββββ π s1_rep2.splicing_matrix.txt
β ββββ π s1_rep3.splicing_matrix.txt
ββββ π splicing_reports
ββββ π s1
ββββ π splicing_report.html
ββββ π *.png
These files summarize all the barcodes in the sequencing library, categorized by canonical and novel splicing events. Compared to the barcode association file, they typically contain more detected barcodes, as the pipeline permits a one-base mismatch during barcode detection.
Note
π barcodes of canonical splicing events
name | barcode | varid | count |
---|---|---|---|
E1_E2_E3 | ATTAATAATTATCTCTATAGGCATGACCATGATCATAG | var1 | 23 |
E1_E3 | TTTTATTTCGATATTAATCATGATATAAATGTTCATAC | var2 | 143 |
π barcodes of novel splicing events
barcode | varid | count |
---|---|---|
ATTAATAATTATCTCTATAGGCATGACCATGATCATAG | var1 | 23 |
TTTTATTTCGATATTAATCATGATATAAATGTTCATAC | var2 | 143 |
Note
π junction bed file
This is a bed output file which contains all the junctions detected in the sample.
chrom | chromStart | chromEnd | name | score | strand | thickStart | thickEnd | itemRgb | blockCount | blockSizes | blockStarts |
---|---|---|---|---|---|---|---|---|---|---|---|
var1 | 25 | 451 | JUNC00000001 | 55 | ? | 25 | 451 | 255,0,0 | 2 | 35,29 | 0,397 |
var1 | 28 | 451 | JUNC00000002 | 156 | ? | 28 | 451 | 255,0,0 | 2 | 73,106 | 0,317 |
var2 | 29 | 451 | JUNC00000004 | 225 | ? | 29 | 451 | 255,0,0 | 2 | 37,303 | 0,119 |
var3 | 29 | 451 | JUNC00000006 | 84 | ? | 29 | 451 | 255,0,0 | 2 | 33,101 | 0,321 |
var3 | 29 | 451 | JUNC00000007 | 43 | ? | 29 | 451 | 255,0,0 | 2 | 29,29 | 0,393 |
π classified junctions of all the variants
This is a tsv output file which contains all the classified junctions for all the variants
varid | start | end | cov | annotation |
---|---|---|---|---|
var1 | 48 | 139 | 11 | exon_splicing_3p_E1; intron_retension_3p_I1 |
var2 | 77 | 340 | 3578 | exon_skipping_E2; intron_retension_5p_I1; intron_retension_3p_I2 |
π classified junctions (reduced)
This is a tsv output file which contains all the classified junctions regardless of variants, only based on the splicing sites.
start | end | cov | annotation |
---|---|---|---|
48 | 139 | 11 | exon_splicing_3p_E1; intron_retension_3p_I1 |
77 | 340 | 3578 | exon_skipping_E2; intron_retension_5p_I1; intron_retension_3p_I2 |
Note
π novel splicing bam
This is the bam output file for novel splicing events.
π novel splicing products
This is an output file which contains all the spliced products (sequences) generating from the bam file.
Note
π raw read count file
varid | canonical_ inclusion_ E2 |
canonical_ skipping_ E2 |
intron_ retention_ I1 |
intron_ retention_ I2 |
intron_ retension_ 5p_I1 |
intron_ retension_ 5p_I2 |
intron_ retension_ 3p_I1 |
intron_ retension_ 3p_I2 |
exon_ splicing_ 3p_E1 |
exon_ splicing_ 3p_E2 |
exon_ splicing_ 3p_E3 |
exon_ splicing_ 5p_E1 |
exon_ splicing_ 5p_E2 |
exon_ splicing_ 5p_E3 |
exon_ skipping_ E2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
var1 | 743 | 90 | 47 | 21 | 0 | 0 | 5 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 10 |
var2 | 235 | 16 | 63 | 18 | 0 | 0 | 6 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | 22 |
var3 | 229 | 20 | 62 | 14 | 0 | 0 | 7 | 0 | 7 | 0 | 0 | 0 | 0 | 0 | 18 |
This summary HTML file presents the summary of the statistical analysis and general overviews.