Skip to content

Commit 8c192fe

Browse files
committed
minor improvements #3
1 parent 40f0496 commit 8c192fe

File tree

8 files changed

+92
-102
lines changed

8 files changed

+92
-102
lines changed

CITATION.cff

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,21 @@ authors:
1414
given-names: Stephan
1515
family-names: Reichl
1616
orcid: 'https://orcid.org/0000-0001-8555-7198'
17+
- affiliation: CeMM Research Center for Molecular Medicine
18+
given-names: Raphael
19+
family-names: Bednarsky
20+
orcid: 'https://orcid.org/0009-0005-0404-3424'
21+
- affiliation: CeMM Research Center for Molecular Medicine
22+
given-names: Jake
23+
family-names: Burton
24+
orcid: 'https://orcid.org/0000-0001-7251-8800'
25+
- affiliation: CeMM Research Center for Molecular Medicine
26+
given-names: Fangwen
27+
family-names: Zhao
28+
- affiliation: CeMM Research Center for Molecular Medicine
29+
given-names: Lina
30+
family-names: Dobnikar
31+
orcid: 'https://orcid.org/0000-0002-0616-647X'
1732
- given-names: Christoph
1833
family-names: Bock
1934
orcid: 'https://orcid.org/0000-0001-6091-3088'

README.md

Lines changed: 24 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -7,20 +7,24 @@
77
[![Snakemake](https://img.shields.io/badge/Snakemake->=8.25.3-green)](https://snakemake.readthedocs.io/en/stable/)
88

99
# RNA-seq Data Processing, Quantification & Annotation Pipeline
10-
A [Snakemake 8](https://snakemake.readthedocs.io/en/stable/) workflow for end-to-end processing, quantification, and annotation of gene expression for RNA-seq experiments, starting from single- or paired-end reads within raw/unaligned/unmapped [uBAM](https://gatk.broadinstitute.org/hc/en-us/articles/360035532132-uBAM-Unmapped-BAM-Format) files, including a comprehensive MultiQC report.
10+
A [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow for end-to-end processing, quantification, and annotation of gene expression for RNA-seq experiments, starting from single- or paired-end reads within raw/unaligned/unmapped [uBAM](https://gatk.broadinstitute.org/hc/en-us/articles/360035532132-uBAM-Unmapped-BAM-Format) files, including a comprehensive MultiQC report.
1111

12-
> [!NOTE]
12+
> [!NOTE]
1313
> This workflow adheres to the module specifications of [MrBiomics](https://github.com/epigen/MrBiomics), an effort to augment research by modularizing (biomedical) data science. For more details, instructions, and modules check out the project's repository.
1414
>
1515
> ⭐️ **Star and share modules you find valuable** 📤 - help others discover them, and guide our future work!
1616
1717
> [!IMPORTANT]
18-
> **If you use this workflow in a publication, please don't forget to give credit to the authors by citing it using this DOI [10.5281/zenodo.15119355](https://doi.org/10.5281/zenodo.15119355) and acknowledging the [rna-seq-star-deseq2 workflow](https://github.com/snakemake-workflows/rna-seq-star-deseq2) DOI [10.5281/zenodo.4737358](https://doi.org/10.5281/zenodo.4737358) from which some structure and code were adapted.**
18+
> **If you use this workflow in a publication, please don't forget to give credit to the authors by citing it using this DOI [10.5281/zenodo.15119355](https://doi.org/10.5281/zenodo.15119355) and acknowledge the [rna-seq-star-deseq2 workflow](https://github.com/snakemake-workflows/rna-seq-star-deseq2) DOI [10.5281/zenodo.4737358](https://doi.org/10.5281/zenodo.4737358) from which some structure and code were adapted.**
1919
2020
![Workflow Rulegraph](./workflow/dags/rulegraph.svg)
2121

2222
# 🖋️ Authors
2323
- [Stephan Reichl](https://github.com/sreichl)
24+
- [Raphael Bednarsky](https://github.com/bednarsky)
25+
- [Jake Burton](https://github.com/burtonjake)
26+
- [Fangwen Zhao](https://github.com/fwzhao)
27+
- [Lina Dobnikar](https://github.com/ld401)
2428
- [Christoph Bock](https://github.com/chrbock)
2529

2630
# 💿 Software
@@ -34,8 +38,9 @@ This project wouldn't be possible without the following software and their depen
3438
| MultiQC | https://doi.org/10.1093/bioinformatics/btw354 |
3539
| RSeQC | https://doi.org/10.1093/bioinformatics/bts356 |
3640
| biomaRt | https://doi.org/10.1038/nprot.2009.97 |
37-
| EDASeq | https://doi.org/10.1186/1471-2105-12-480 |
3841
| gffutils | https://github.com/daler/gffutils |
42+
| GenomicRanges | https://doi.org/10.1371/journal.pcbi.1003118 |
43+
| rtracklayer | https://doi.org/10.1093/bioinformatics/btp328 |
3944
| pandas | https://doi.org/10.5281/zenodo.3509134 |
4045
| fastp | https://doi.org/10.1093/bioinformatics/bty560 |
4146
| pigz | https://zlib.net/pigz/ |
@@ -47,15 +52,15 @@ This is a template for the Methods section of a scientific publication and is in
4752

4853
**Quantification.** Gene expression quantification was performed on the filtered and trimmed reads. The STAR aligner (ver) [ref] was utilized in `--quantMode GeneCounts` mode to count reads overlapping annotated genes based on the Ensembl [ver `config: ref: release`] gene annotation for the [`config: ref: species`] genome (build [`config: ref: build`]) and using parameters [`config["star_args"]`]. Library strandedness (`[none/yes/reverse]`, specified per sample in `config/annotation.csv`) was accounted for during counting. Counts for individual samples were aggregated into a single gene-by-sample count matrix using a custom Python script utilizing the pandas library (ver) [ref]. Quality control metrics from various tools, including fastp (ver) [ref], RSeQC (ver) [ref] and STAR (ver) [ref], were aggregated using MultiQC (ver) [ref].
4954

50-
**Annotation.** Gene annotations from Ensembl were retrieved using the R package biomaRt (ver) [ref]. Annotations included Ensembl gene ID, gene symbol (`external_gene_name`), gene biotype, and description. Additionally, exon-based GC content (`exon_gc`) and cumulative exon length (`exon_length`) were calculated for each gene using a custom R function adapted from EDASeq (ver) [ref], leveraging biomaRt (ver) [ref] to fetch exon coordinates and sequences. This exon-based approach was chosen as sequencing reads in poly(A)-selected libraries primarily derive from exonic regions, making these metrics more appropriate for downstream bias correction (e.g., Conditional Quantile Normalization - CQN) than whole-gene metrics. A sample annotation file was generated, integrating input annotations with QC metrics.
55+
**Annotation.** Gene annotations from Ensembl were retrieved using the R package biomaRt (ver) [ref]. Annotations included Ensembl gene ID, gene symbol (`external_gene_name`), gene biotype, and description. Additionally, exon-based GC content (`exon_gc`) and cumulative exon length (`exon_length`) were calculated for each gene based on the genome GFT and FASTA files using a custom R function, leveraging GenomicRanges (ver) [ref] and rtracklayer (ver) [ref]. This exon-based approach was chosen because only exonic reads are quantified during alignment and contribute to the count matrix (also sequencing reads in poly(A)-selected libraries primarily derive from exonic regions), making these metrics more appropriate for downstream bias correction (e.g., Conditional Quantile Normalization - CQN) than whole-gene metrics (i.e., including introns). A sample annotation file was generated, integrating input annotations with QC metrics.
5156

52-
The processing and quantification described here was performed using a publicly available Snakemake [ver] (ref) workflow [[10.5281/zenodo.15119355](https://doi.org/10.5281/zenodo.15119355)], which adopted code from anotehr workflow (ref) [10.5281/zenodo.4737358](https://doi.org/10.5281/zenodo.4737358).
57+
The processing and quantification described here was performed using a publicly available Snakemake [ver] (ref) workflow [[10.5281/zenodo.15119355](https://doi.org/10.5281/zenodo.15119355)], which adapted code from another workflow (ref) [10.5281/zenodo.4737358](https://doi.org/10.5281/zenodo.4737358).
5358

5459
# 🚀 Features
5560
This workflow offers several key advantages for RNA-seq analysis over existing pipelines:
5661

5762
* **MrBiomics Module:** Designed for modularity and seamless integration with other [MrBiomics](https://github.com/epigen/MrBiomics/) analysis workflows (e.g., filtering/normalization, differential expression, unsupervised analysis) and includes example analysis recipes.
58-
* **Exon-Centric Annotation:** Calculates *exon-based* GC content and length, providing more accurate metrics for downstream bias correction (like CQN) for typical poly(A)-selected RNA-seq libraries.
63+
* **Exon-Centric Annotation:** Calculates *exon-based* GC content and length, providing more accurate metrics for downstream bias correction (like CQN).
5964
* **Robust Input Handling:** Starts directly from raw uBAM files and includes automated checks to ensure consistency between annotated read types (single/paired) and BAM file content.
6065
* **Efficiency Focused:** Optimized for performance and disk space using data streaming between processing steps and temporary intermediate files.
6166
* **User-Friendly:** Offers comprehensive documentation, clear configuration, standard Snakemake usage, and practical usage tips including detailed QC guidelines.
@@ -73,22 +78,23 @@ The workflow performs the following steps that produce the outlined results:
7378
- De-interleaves the filtered FASTQ stream into separate compressed R1 and R2 files for paired-end data, or compresses directly for single-end data using shell commands and `pigz`.
7479
> [!NOTE]
7580
> `fastp` adapter auto-detection is disabled because we use STDIN mode (i.e., stream the data through pipes) to be disk space efficient.
81+
> We do not deduplicate aligned reads. ["...the computational removal of duplicates does improve neither accuracy nor precision and can actually worsen the power and the False Discovery Rate (FDR) for differential gene expression."](https://www.nature.com/articles/srep25533)
7682
- **Quantification:**
7783
- Uses STAR `GeneCounts` to quantify reads per gene based on the specified Ensembl reference genome and annotation (`star/{sample}/`).
7884
- Handles unstranded, forward-stranded, and reverse-stranded library protocols based on the `strandedness` column.
7985
- Aggregates counts into a single matrix (`counts/counts.csv`).
8086
- **Annotation:**
8187
- Outputs gene annotations (`counts/gene_annotation.csv`).
8288
- Retrieves gene annotations (Ensembl ID, gene symbol, biotype, description) from Ensembl using `biomaRt`.
83-
- Calculates **exon-based** GC content and cumulative exon length for each gene, suitable for poly(A) selected libraries.
84-
- Outputs a sample annotation table containing sample-wise general MultiQC statistics (`counts/sample_annotation.csv`).
89+
- Calculates **exon-based** GC content and cumulative exon length for each gene using the genome's GFT and FASTA files.
90+
- Outputs a sample annotation table containing sample-wise general MultiQC statistics (`counts/sample_annotation.csv`).
8591
> [!NOTE]
8692
> Gene annotation can take a while since it depends on the availability of external data sources accessed via `biomaRt`.
8793
>
88-
> GC-content and length are **exon-based**: In poly(A)‑selected libraries (such as Illumina TruSeq, Smart-seq or QuantSeq), the sequencing reads mainly come from exonic regions. Therefore, potential correction for GC bias and gene length should ideally use exon‑level GC content and effective exon length rather than whole‑gene metrics that include introns.
94+
> GC-content and length are **exon-based**, because we only use exonic reads during the count matrix generation. Furthermore, in poly(A)‑selected libraries (such as Illumina TruSeq, Smart-seq or QuantSeq), the sequencing reads mainly come from exonic regions. Therefore, potential correction for GC bias and gene length should use exon‑level GC content and effective exon length rather than whole‑gene metrics that include introns.
8995
- **QC & Reporting:**
90-
- Employs RSeQC tools to generate key quality metrics like strand specificity and read distribution across genomic features (`rseqc/`).
91-
- Aggregates QC metrics from fastp, STAR and RSeQC into a single report using MultiQC (`report/multiqc_report.html`) with [AI summaries](https://seqera.io/blog/ai-summaries-multiqc/).
96+
- Employs `RSeQC` tools to generate key quality metrics like strand specificity and read distribution across genomic features (`rseqc/`).
97+
- Aggregates QC metrics from `fastp`, `STAR` and `RSeQC` into a single report using `MultiQC` (`report/multiqc_report.html`) with [AI summaries](https://seqera.io/blog/ai-summaries-multiqc/).
9298

9399
---
94100

@@ -111,15 +117,14 @@ The workflow produces the following directory structure:
111117
> [!IMPORTANT]
112118
> Resources are downloaded automatically to `resources/{config::project_name}/rnaseq_pipeline/)`, are large (>`3GB`) and have to be manually removed if not needed anymore.
113119
114-
115120
# 🛠️ Usage
116121

117122
Here are some tips for the usage of this workflow:
118123

119124
- Configure and run the workflow first for a few samples.
120125
- Once everything works, run it for all samples.
121-
- If you are unsure about the memory requirements for alignment (the most memory intensive step) provide the Snakemake parameter `--retries X`, where `X` denotes the number of retries, and with every retry the memory is increased to `attempts * config:mem`.
122-
- To save disk space the intermediate gzipped FASTQ files are marked as temporary using Snakemake's `temp()` directive. To remove them upon successful completion you have to include the `--delete-temp-output` flag in your Snakemake command.
126+
- If you are unsure about the memory requirements for alignment (the most memory intensive step) provide the Snakemake parameter `--retries X`, where `X` denotes the number of retries, and with every retry the memory is increased to `attempts * config::mem`.
127+
- To save disk space the intermediate gzipped `FASTQ` files are marked as temporary using Snakemake's `temp()` directive. To remove them upon successful completion you have to run Snakemake with the `--delete-temp-output` flag.
123128

124129
This workflow is written with Snakemake and its usage is described in the [Snakemake Workflow Catalog](https://snakemake.github.io/snakemake-workflow-catalog?usage=epigen/rnaseq_pipeline).
125130

@@ -133,9 +138,10 @@ Explore detailed examples showcasing module usage in comprehensive end-to-end an
133138
# 🔍 Quality Control
134139
Below are some guidelines for the manual quality control of each sample using the generated `MultiQC` report, but keep in mind that every experiment/dataset is different. Thresholds are general suggestions and may vary based on experiment type, organism, and library prep.
135140

136-
- **Alignment Rate (STAR):** % (Uniquely) Mapped Reads > 70-80%. Low rates might indicate contamination or reference issues.
137-
- **Alignment Scores & Gene Counts (STAR):** High proportion of uniquely mapped reads assigned to exonic regions (e.g., >60-70% for poly(A) mRNA-seq). Low rates could suggest gDNA contamination or high intronic reads. In case of many intronic reads or non-poly(A) mRNA-seq protocols do not use exon-based gene annotations (gc-content and length).
138-
- **Read Quality (fastp):** High average quality scores across reads after trimming. Ensure effective adapter removal.
141+
- **Read Depth (STAR)**: Count of `(Uniquely) Mapped Reads` >10M is minimum, >20M reads is optimal for differential expression analysis.
142+
- **Alignment Rate (STAR):** `% (Uniquely) Mapped Reads` >70-80%. Low rates might indicate contamination or reference issues.
143+
- **Alignment Scores & Gene Counts (STAR):** High proportion of uniquely mapped reads assigned to exonic regions (e.g., >60-70% for poly(A) mRNA-seq). Low rates could suggest gDNA contamination or high intronic reads. In case of many intronic reads or non-poly(A) mRNA-seq protocols do not use exon-based gene annotations (gc-content and length) downstream.
144+
- **Read Quality (fastp):** `% > Q30` (=Percentage of bases with Phred score > 30, after filtering/trimming) > 90%. High average quality scores across reads after trimming. Ensure effective adapter removal.
139145
- **Library Complexity (fastp/RSeQC):** % Duplication Rate should not be excessively high (highly variable, interpret in context of expression). Very high rates might indicate low input or PCR issues.
140146
- **Strand Specificity (RSeQC):** For stranded protocols, >90-95% reads should match the expected strand.
141147
- Inspect [**Genome Browser Tracks**](https://github.com/epigen/genome_tracks/) using UCSC Genome Browser (online) or IGV (local)

config/config.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -27,13 +27,14 @@ adapter_fasta: path/to/adapters.fa
2727

2828
# fastp specific arguments for adapter trimming and quality control filtering
2929
# see: https://github.com/OpenGene/fastp
30-
# useful parameters are --adapter_sequence GTCTCGTGGGCTCGG/auto TODO
30+
# useful parameters are --adapter_sequence GTCTCGTGGGCTCGG/auto
3131
# or provide (nextera) adapter fasta (.fa) file and/or nucleotide adapter sequence of the used protocol --adapter_fasta path/to/adapters.fa
3232
fastp_args: "--adapter_sequence auto --trim_poly_g"
3333

3434
# STAR aligner specific arguments
3535
# see: https://github.com/alexdobin/STAR
36-
star_args: ""
36+
# here we provide the ENCODE standard options for long RNA-seq pipeline (3.3.2 ENCODE options)
37+
star_args: "--alignIntronMax 1000000 --alignIntronMin 20 --alignMatesGapMax 1000000 --alignSJDBoverhangMin 1 --alignSJoverhangMin 8 --outFilterMismatchNmax 999 --outFilterMismatchNoverReadLmax 0.04 --outFilterMultimapNmax 20 --outFilterType BySJout"
3738

3839

3940

workflow/envs/biomart.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,10 @@ channels:
44
- nodefaults
55
dependencies:
66
- bioconductor-biomart=2.56.1
7+
- bioconductor-genomicranges=1.52.0
8+
- bioconductor-rsamtools=2.16.0
9+
- bioconductor-rtracklayer=1.60.0
710
- r-tidyverse=2.0.0
8-
- bioconductor-edaseq=2.34.0
911
# remove once we can update to bioconductor-biomart of the 3.18 release, which will
1012
# include this proper fix for the underlying compatibility issue:
1113
# https://github.com/Bioconductor/BiocFileCache/pull/50

workflow/rules/processing.smk

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ rule trim_filter:
4747
input:
4848
bams = lambda wc: annot.loc[wc.sample, "bam_file"],
4949
read_type_check = os.path.join(result_path,".check_read_type","{sample}.done"),
50-
adapter_fasta = config["adapter_fasta"] if config["adapter_fasta"]!="" else []
50+
adapter_fasta = config["adapter_fasta"] if config["adapter_fasta"]!="" else [],
5151
output:
5252
fastq_filtered_R1 = temp(os.path.join(result_path,"fastp","{sample}","{sample}_R1.filtered.fastq.gz")),
5353
fastq_filtered_R2 = temp(os.path.join(result_path,"fastp","{sample}","{sample}_R2.filtered.fastq.gz")),
@@ -100,7 +100,7 @@ rule align:
100100
reads_per_gene = os.path.join(result_path,"star","{sample}","ReadsPerGene.out.tab"),
101101
resources:
102102
# dynamic memory allocation based on attempts (multiple attempts can be configured with --retries X)
103-
mem_mb=lambda wildcards, attempt: attempt*int(config.get("mem", "16000")),
103+
mem_mb=lambda wildcards, attempt: attempt*int(config.get("mem", "32000")),
104104
threads: 24
105105
log:
106106
"logs/star/{sample}.log",

workflow/rules/qc.smk

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,8 @@ rule rseqc_readdis:
106106
priority: 1
107107
log:
108108
"logs/rseqc/rseqc_readdis/{sample}.log",
109+
resources:
110+
mem_mb=lambda wildcards, input: max(4 * input.size_mb, 4000)
109111
conda:
110112
"../envs/rseqc.yaml"
111113
shell:

0 commit comments

Comments
 (0)