Skip to content

🧬Custom Bulk RNA-seq workflow for QC, DESeq2 differential expression, and visualization in R 🧬 (RNA-seq, bioinformatics, bulk-rnaseq, visualization, DESeq2, edgeR)

Notifications You must be signed in to change notification settings

hshim1/Bulk-RNAseq-visualization-workflow-ZRN01

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

53 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Bulk RNA-seq Time-Course Analysis: ZRN01 Dataset #1 (RM)

Analysis and pipeline prepared by: Ha-Na Shim

Contact: hshim1@uchicago.edu for questions, feedback, or suggestions regarding this workflow.

Tools/packages used

R RStudio pheatmap ggplot2 tidyverse nf-core Snakemake Salmon Kallisto MultiQC

DESeq2 hex clusterProfiler hex ggplot2 hex tidyverse hex dplyr hex edgeR hex

Table of Contents

Required packages and R version

The following packages are required to run the workflow and generate the visualizations below.

R β‰₯ 4.5.1 (released 2025-06-13). Bioconductor β‰₯ 3.21 (works with R 4.5.x)

#CRAN
install.packages(c(
"tidyverse",   # includes dplyr, ggplot2, stringr, etc.
"patchwork", "ggrepel", "cowplot", "gridExtra", "scales"
))

#Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("DESeq2", "edgeR", "clusterProfiler", "AnnotationDbi", "org.Hs.eg.db"))  

About this repository

This repository documents an applied bulk RNA-seq analysis workflow that I carried out on an unpublished dataset provided by a collaborator. The focus here is on the downstream analysis starting from processed count data. Raw sequencing data (FASTQ files), sample metadata, and other identifying information are intentionally excluded in order to protect unpublished work and maintain confidentiality.

The goal of this repository is to provide a clear example of how bulk RNA-seq data generated in the lab can be taken through quality control, differential gene expression analysis, and downstream functional interpretation using a structured and reproducible workflow. While the dataset itself is specific to a collaborator’s study, the workflow is generalizable and can be applied to new datasets generated in the lab with minimal adaptation.

Part 1 of the analysis focuses on quality control and exploratory data assessment. This includes CPM-based filtering, normalization checks, variance stabilizing transformations, correlation matrices, dendrograms, and principal component analysis with scree and loading plots. These steps help confirm that the input data are suitable for downstream analyses and that no major outliers are present.

Part 2 performs differential gene expression testing with DESeq2 across the experimental contrasts of interest. Results are summarized with volcano plots, MA plots, and histograms of DEG counts. Functional interpretation is included through GO enrichment, Venn and UpSet diagrams of shared DEG sets, and aggregate time-course lineplots for groups of functionally related genes. Figures are exported in SVG format for high-quality, publication-ready visualization.

All code is written in R and organized into two R Markdown notebooks (one for QC, one for DGE and downstream analysis), with knitted HTML reports and all figures included in the repository. Supporting functions are organized into an R/ folder for reuse across analyses. This structure makes it straightforward to extend the workflow to additional datasets while maintaining clarity and reproducibility.

Repository structure

This repository is organized for clarity and reproducibility. Source analyses live in 0-rmarkdown/, their knitted reports are in 0-knit-html/, reusable helpers live under R/, and exported figures are in 1-figures/. Raw data are intentionally not included (unpublished).

Quick tree view
Bulk-RNAseq-visualization-workflow-ZRN01-RM/
β”œβ”€ README.md
β”œβ”€ .gitignore
β”œβ”€ 0-rmarkdown/
β”‚  β”œβ”€ 1-preprocessing-and-quality-control.Rmd
β”‚  └─ 2-differential-gene-expression-figures.Rmd
β”œβ”€ 0-knit-html/
β”‚  β”œβ”€ 1-preprocessing-and-quality-control.html
β”‚  └─ 2-differential-gene-expression-figures.html
β”œβ”€ R/
β”‚  └─ …                                 
└─ 1-figures/
   β”œβ”€ quality_control/
   β”‚  β”œβ”€ pca_plot_noloadings.svg
   β”‚  β”œβ”€ rle_combined_fig.svg
   β”‚  └─ … 
   └─ differential_gene_expression/
      β”œβ”€ volcano_combined_2x2_T21_vs_D21.svg
      β”œβ”€ GO_enrichment_day7.svg
      └─ …
  
Folder-by-folder guide

0-rmarkdown/ β€” analysis notebooks

  • 1-preprocessing-and-quality-control.Rmd β€” QC, clustering, PCA, normalization checks.
  • 2-differential-gene-expression-figures.Rmd β€” DESeq2 contrasts, volcano/MA, GO, UpSet/Venn, lineplots.

0-knit-html/ β€” knitted HTML reports

  • Human-readable exports of the Rmd notebooks for quick review.

R/ β€” function library

  • Reusable helpers supporting the analysis and figure generation.

1-figures/ β€” analysis outputs (SVG)

  • quality_control/ β€” PCA (Β± loadings), scree, RLE, dispersion, correlation/dendrogram, etc.
  • differential_gene_expression/ β€” Volcano & MA plots, GO enrichment, UpSet/Venn, aggregate lineplots.

Not included

  • Raw counts and metadata (dataset is unpublished).
How to navigate this project
  1. Skim the knitted reports in 0-knit-html/ for a quick overview.
  2. Open the corresponding Rmd in 0-rmarkdown/ for code + narrative.
  3. Import helpers from R/ as needed.
  4. Figures referenced below live in 1-figures/.

Part 1: Quality control and sample clustering

This section of the workflow is primarily focused on assessing the quality of samples and to identify potential outlier samples. Although multiQC or other RNAseq QC pipelines will likely identify problematic samples upstream of generation of raw counts, sample clustering can also provide biologically insightful results.

1a. CPM distribution filtering 1b. Distribution of p-values across DESeq2 contrasts of interest 1c. log2FoldChange scatterplots to compare shrinkage vs. no shrinkage 1c. DESeq2 normalization assessment 1d. Sample-to-sample correlation matrix heatmap 1e. PCA plot (Β± loadings + Scree) 1f. Lineplot of gene subtypes across all samples

1a. CPM distribution filtering

CPM distribution F1–F2 CPM distribution F3–F4

Parameters used for CPM filtering

Criteria: >3 CPM in at least 3 replicates


total_genes <- nrow(cpm_matrix)
expressed_genes <- rowSums(cpm_matrix > 3) >= 3
n_expressed <- sum(expressed_genes)

1b. Library size comparison across all samples

Library size boxplot

All samples appear to share similar library sizes, so there are

1b. Distribution of p-values across DESeq2 contrasts of interest

p-value distributions

Interestingly, in the case of this experiment, our condition of interest (genotype) appears to induce a large transcriptional pertubation. ~9000 genes pass our pvalue filter of 0.05. This result is not necessarily problematic since we will apply a log2FoldChange cutoff, which should reduce potential FPs.

1c. log2FoldChange scatterplots to compare effect of DESeq2's apeglm shrinkage vs. no shrinkage

Shrinkage comparison

DESeq2 offers a wi

1c. DESeq2 normalization assessment

DESeq2 dispersion plot RLE boxplot visualization

1d. sample-to-sample correlation matrix heatmap

Sample-to-sample correlation matrix heatmap Sample dendrogram

1e. PCA plot (with and withlout loadings + SCREE plot

PCA plot without loadings PCA plot with loadings

Scree plot

1f. Gene subclass lineplot across

Gene biotype lineplot

For QC purposes, this plot is likely superfluous/unnecessary. Regardless, samples with technical/sequencing issues will show an altered number of counts for a particular gene subtype.

Part 2: Differential gene expression analysis

Histogram of upregulated and downregulated DEG counts for each timepoint

DEG histogram

2a. Volcano plots of differentially expressed genes

Volcano combined 2Γ—2 β€” T21 vs D21

2aa. MA plots of differentially expressed genes

MA plots T21 vs D21

Gene ontology (GO) enrichment analysis of upregulated and downregulated differentially expressed genes

GO enrichment β€” Day 4

GO enrichment β€” Day 7

GO enrichment β€” Day 10

GO enrichment β€” Day 15

edgeR Counts per million (CPM) boxpltos of genes of interest over course of differentiation

Venn diagrams of upregulated + downregulated DEGs shared across all 4 timepoints

UpSet β€” Upregulated Venn β€” Upregulated

UpSet β€” Downregulated Venn β€” Downregulated

Aggregate gene expression lineplots to visualize change in GO term-associated gene expression over time

Combined lineplots grid

Heatmap of clustered gene expression and aggregate expression of each hierarchically clustered module

Heatmap of top 1000 variable genes Lineplots of top 500 variable genes

GSEA of custom GO lists

CPM boxplots for genes of interest

About

🧬Custom Bulk RNA-seq workflow for QC, DESeq2 differential expression, and visualization in R 🧬 (RNA-seq, bioinformatics, bulk-rnaseq, visualization, DESeq2, edgeR)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published