Skip to content

wtsi-hgi/nf_splicing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Splicing Analysis Pipeline

A nextflow pipeline of identifying and quantifying splicing events

Splicing Analysis Pipeline Structure

Table of Contents

Catalogue
  1. Description
  2. Dependencies
  3. File Format
  4. Usage
  5. Outputs
  6. Note

Description

This pipeline is designed to quantify splicing events within minigene-based mutagenesis libraries. Minigene systems are widely used to study the regulatory mechanisms of RNA splicing, particularly in the context of variant interpretation and functional genomics. By introducing specific mutations into synthetic constructs (minigenes), researchers can assess how sequence changes affect splicing outcomes in a controlled cellular environment.


Dependencies

Software
pigz = 2.7
BWA = 0.7.17
HISAT2 = 2.2.1
Samtools = 1.21
bamtools = 2.5.2
FLASH2 = 2.2.00
fastp = 0.23.4
RegTools = 1.0.0
R = 4.3.1
nextflow = 23.10.0
R Packages
data.table = 1.15.4
UpSetR = 1.4.0
gplots = 3.1.3.1
corrplot = 0.92
reshape2 = 1.4.4
optparse = 1.7.4
psych = 2.4.3
reactable = 0.4.4
tidyverse = 2.0.0
stringr = 1.5.1
performanceanalytics = 2.0.4
parallelly = 1.37.1
dendextend = 1.17.1
sparkline = 2.0
ggVennDiagram = 1.5.2
htmltools = 0.5.8
Bioconductor Packages
GenomicRanges = 1.54.1
Rsamtools = 2.18.0
Biostrings = 2.70.3

File Format

Sample Sheet

sample replicate directory read1 read2 reference barcode
s1 rep1 /path/of/directory/ s1_rep1_r1.fastq.gz s1_rep1_r2.fastq.gz s1_ref.fa s1_barcode.txt
s1 rep2 /path/of/directory/ s1_rep2_r1.fastq.gz s1_rep2_r2.fastq.gz s1_ref.fa s1_barcode.txt
s1 rep3 /path/of/directory/ s1_rep3_r1.fastq.gz s1_rep3_r2.fastq.gz s1_ref.fa s1_barcode.txt
s2 rep1 /path/of/directory/ s2_rep1_r1.fastq.gz s2_rep1_r2.fastq.gz s2_ref.fa s2_barcode.txt
s2 rep2 /path/of/directory/ s2_rep2_r1.fastq.gz s2_rep2_r2.fastq.gz s2_ref.fa s2_barcode.txt
s2 rep3 /path/of/directory/ s2_rep3_r1.fastq.gz s2_rep3_r2.fastq.gz s2_ref.fa s2_barcode.txt

Important

  1. The sample sheet must be a csv file and the header must be like above in the example
  2. All the files should be in the directory for each sample

Reference File

The reference must be the sequence(s) of the minigene.

  • Random Intron Library: The reference file should contain all variant sequences. For example, if the library includes 100 random intron sequences, the reference fasta file must include all 100 corresponding sequences.
  • Mutagenesis Library: The reference file only needs the wild-type sequence.

Important

  1. File must be a fasta file
  2. Exons must be in Uppercase
  3. Introns must be in Lowercase

Barcode Association File

Please provide the barcode association file as below

barcode variant varid count
CCAAATAATCATTAGGATCAGCATATTAATCTAGATTC GTAAGTCGAAGAATTCTTGGGGAAACAGATTGAAATAACTTGGGAAGTAGTTCTTTCTCTTAGTGTGAAAGTATGTTCTCA var1 56
ATCTATCAGAATGTATATTGGGATAAAAATAGTGATTC GTAAGTGATTCAGGAGAGTTTCGTTCAGATTGAAATAACTTGGGAAGTAGTTCTTTCTCTTAGTGTGAAAGTATGTTCTCA var2 293

Important

  1. The barcode association file must be a tsv file and the header must be like above in the example

Usage

Run -- Sanger farm

Please submit the bash script below

#!/bin/bash
#BSUB -o %J.o
#BSUB -e %J.e
#BSUB -R "select[mem>1000] rusage[mem=1000]"
#BSUB -M 1000
#BSUB -q normal
#BSUB -J nf_splicing
   
# modules
module load HGI/common/nextflow/23.10.0
module load HGI/softpack/users/fs18/nf_splicing
   
#--------------#
# user specify #
#--------------#
# LSF group
export LSB_DEFAULT_USERGROUP=hgi
   
# Paths
export INPUTSAMPLE=$PWD/inputs/sample_sheet.csv
export OUTPUTRES=$PWD/outputs
  
#-----------#
# pipelines #
#-----------#
nextflow run -resume nf_splicing/main.nf --sample_sheet $INPUTSAMPLE \
                                         --library      random

Options

Mandatory arguments

--sample_sheet                path of the sample sheet
--outdir                      the directory path of output results, default: the current directory
--do_pe_reads                 whether to process paired-end reads, default: false

Optional arguments

Basic:
--library                     random, muta, default: muta
--barcode_template            barcode template, default: NNNNATNNNNATNNNNATNNNNATNNNNATNNNNATNN
--barcode_marker              barcode marker, default: CTACTGATTCGATGCAAGCTTG

Fastp:
--fastp_cut_mean_quality      mean quality for fastp, default: 20

Flash2:
--flash2_min_overlap          min overlap for flash2, default: 10
--flash2_max_overlap          max overlap for flash2, default: 250
--flash2_min_overlap_outie    min overlap outie for flash2, default: 20
--flash2_max_mismatch_density max mismatch density for flash2, default: 0.25

BWA:
--bwa_gap_open                gap open penalty for BWA, default: 10,10
--bwa_gap_ext                 gap extension penalty for BWA, default: 5,5
--bwa_clip                    clip penalty for BWA, default: 1,1

Barcode extraction:
--filter_softclip_base        softclip base for filtering, default: 5

HISAT2:
--hisat2_score_min            min score for HISAT2, default: L,0,-0.3
--hisat2_mp                   min/max mismatch penalty for HISAT2, default: 5,2
--hisat2_sp                   min/max splice penalty for HISAT2, default: 2,1
--hisat2_np                   non-canonical splicing penalty for HISAT2, default: 0
--hisat2_pen_noncansplice     non-canonical splicing penalty for HISAT2, default: 0

Spliced products:
--do_spliced_products         whether to process spliced products, default: false

Regtools:
--regtools_min_anchor         min anchor length for regtools, default: 5
--regtools_min_intron         min intron length for regtools, default: 20

Junction classification:
--classify_min_overlap        min overlap for classification, default: 2
--classify_min_cov            min coverage for classification, default: 2
--classify_reduce             reduce the number of reads for classification, default: 2

Outputs

Structure

πŸ“ output_directory
    β”œβ”€β”€β”€ πŸ“ extracted_barcodes
    β”‚       β”œβ”€β”€β”€ πŸ“ s1_rep1
    β”‚       β”‚       β”œβ”€β”€β”€ πŸ“„ canonical_barcodes.txt
    β”‚       β”‚       └─── πŸ“„ novel_barcodes.txt
    β”‚       β”œβ”€β”€β”€ πŸ“ s1_rep2
    β”‚       └─── πŸ“ s1_rep3
    β”œβ”€β”€β”€ πŸ“ novel_junctions
    β”‚       β”œβ”€β”€β”€ πŸ“ s1_rep1
    β”‚       β”‚       β”œβ”€β”€β”€ πŸ“„ junctions.bed
    β”‚       β”‚       β”œβ”€β”€β”€ πŸ“„ classified_junctions.txt
    β”‚       β”‚       β”œβ”€β”€β”€ πŸ“„ classified_junctions.reduce.txt
    β”‚       β”‚       β”œβ”€β”€β”€ πŸ“„ classified_junctions.png
    β”‚       β”‚       └─── πŸ“„ classified_variants.png
    β”‚       β”œβ”€β”€β”€ πŸ“ s1_rep2
    β”‚       └─── πŸ“ s1_rep3
    β”œβ”€β”€β”€ πŸ“ novel_splicing_results
    β”‚       β”œβ”€β”€β”€ πŸ“ s1_rep1
    β”‚       β”‚       β”œβ”€β”€β”€ πŸ“„ spliced_alignment.bam
    β”‚       β”‚       └─── πŸ“„ spliced_products.txt
    β”‚       β”œβ”€β”€β”€ πŸ“ s1_rep2
    β”‚       └─── πŸ“ s1_rep3
    β”œβ”€β”€β”€ πŸ“ splicing_counts
    β”‚       β”œβ”€β”€β”€ πŸ“„ s1_rep1.splicing_matrix.txt
    β”‚       β”œβ”€β”€β”€ πŸ“„ s1_rep2.splicing_matrix.txt
    β”‚       └─── πŸ“„ s1_rep3.splicing_matrix.txt
    └─── πŸ“ splicing_reports
            └─── πŸ“ s1
                    β”œβ”€β”€β”€ πŸ“„ splicing_report.html
                    └─── πŸ“„ *.png

File Description

Extracted barcodes

These files summarize all the barcodes in the sequencing library, categorized by canonical and novel splicing events. Compared to the barcode association file, they typically contain more detected barcodes, as the pipeline permits a one-base mismatch during barcode detection.

Note

πŸ“„ barcodes of canonical splicing events

name barcode varid count
E1_E2_E3 ATTAATAATTATCTCTATAGGCATGACCATGATCATAG var1 23
E1_E3 TTTTATTTCGATATTAATCATGATATAAATGTTCATAC var2 143

πŸ“„ barcodes of novel splicing events

barcode varid count
ATTAATAATTATCTCTATAGGCATGACCATGATCATAG var1 23
TTTTATTTCGATATTAATCATGATATAAATGTTCATAC var2 143

Novel junctions

Note

πŸ“„ junction bed file

This is a bed output file which contains all the junctions detected in the sample.

chrom chromStart chromEnd name score strand thickStart thickEnd itemRgb blockCount blockSizes blockStarts
var1 25 451 JUNC00000001 55 ? 25 451 255,0,0 2 35,29 0,397
var1 28 451 JUNC00000002 156 ? 28 451 255,0,0 2 73,106 0,317
var2 29 451 JUNC00000004 225 ? 29 451 255,0,0 2 37,303 0,119
var3 29 451 JUNC00000006 84 ? 29 451 255,0,0 2 33,101 0,321
var3 29 451 JUNC00000007 43 ? 29 451 255,0,0 2 29,29 0,393

πŸ“„ classified junctions of all the variants

This is a tsv output file which contains all the classified junctions for all the variants

varid start end cov annotation
var1 48 139 11 exon_splicing_3p_E1;
intron_retension_3p_I1
var2 77 340 3578 exon_skipping_E2;
intron_retension_5p_I1;
intron_retension_3p_I2

πŸ“„ classified junctions (reduced)

This is a tsv output file which contains all the classified junctions regardless of variants, only based on the splicing sites.

start end cov annotation
48 139 11 exon_splicing_3p_E1;
intron_retension_3p_I1
77 340 3578 exon_skipping_E2;
intron_retension_5p_I1;
intron_retension_3p_I2

Novel splicing results

Note

πŸ“„ novel splicing bam

This is the bam output file for novel splicing events.

πŸ“„ novel splicing products

This is an output file which contains all the spliced products (sequences) generating from the bam file.


Qauntification of splicing events

Note

πŸ“„ raw read count file

varid canonical_
inclusion_
E2
canonical_
skipping_
E2
intron_
retention_
I1
intron_
retention_
I2
intron_
retension_
5p_I1
intron_
retension_
5p_I2
intron_
retension_
3p_I1
intron_
retension_
3p_I2
exon_
splicing_
3p_E1
exon_
splicing_
3p_E2
exon_
splicing_
3p_E3
exon_
splicing_
5p_E1
exon_
splicing_
5p_E2
exon_
splicing_
5p_E3
exon_
skipping_
E2
var1 743 90 47 21 0 0 5 0 5 0 0 0 0 0 10
var2 235 16 63 18 0 0 6 0 6 0 0 0 0 0 22
var3 229 20 62 14 0 0 7 0 7 0 0 0 0 0 18

Splicing reports

This summary HTML file presents the summary of the statistical analysis and general overviews.


Note

Junction Classification

Junction Classification

About

A nextflow pipeline of identifying and quantifying splicing events

Resources

License

Stars

Watchers

Forks

Packages

No packages published