Bulk RNA-seq Time-Course Analysis: ZRN01 Dataset #1 (RM)

Analysis and pipeline prepared by: Ha-Na Shim

Contact: hshim1@uchicago.edu for questions, feedback, or suggestions regarding this workflow.

Tools/packages used

#CRAN
install.packages(c(
"tidyverse",   # includes dplyr, ggplot2, stringr, etc.
"patchwork", "ggrepel", "cowplot", "gridExtra", "scales"
))

#Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("DESeq2", "edgeR", "clusterProfiler", "AnnotationDbi", "org.Hs.eg.db"))

About this repository

This repository documents an applied bulk RNA-seq analysis workflow that I carried out on an unpublished dataset provided by a collaborator. The focus here is on the downstream analysis starting from processed count data. Raw sequencing data (FASTQ files), sample metadata, and other identifying information are intentionally excluded in order to protect unpublished work and maintain confidentiality.

The goal of this repository is to provide a clear example of how bulk RNA-seq data generated in the lab can be taken through quality control, differential gene expression analysis, and downstream functional interpretation using a structured and reproducible workflow. While the dataset itself is specific to a collaborator’s study, the workflow is generalizable and can be applied to new datasets generated in the lab with minimal adaptation.

Part 1 of the analysis focuses on quality control and exploratory data assessment. This includes CPM-based filtering, normalization checks, variance stabilizing transformations, correlation matrices, dendrograms, and principal component analysis with scree and loading plots. These steps help confirm that the input data are suitable for downstream analyses and that no major outliers are present.

Part 2 performs differential gene expression testing with DESeq2 across the experimental contrasts of interest. Results are summarized with volcano plots, MA plots, and histograms of DEG counts. Functional interpretation is included through GO enrichment, Venn and UpSet diagrams of shared DEG sets, and aggregate time-course lineplots for groups of functionally related genes. Figures are exported in SVG format for high-quality, publication-ready visualization.

All code is written in R and organized into two R Markdown notebooks (one for QC, one for DGE and downstream analysis), with knitted HTML reports and all figures included in the repository. Supporting functions are organized into an R/ folder for reuse across analyses. This structure makes it straightforward to extend the workflow to additional datasets while maintaining clarity and reproducibility.

Repository structure

This repository is organized for clarity and reproducibility. Source analyses live in 0-rmarkdown/, their knitted reports are in 0-knit-html/, reusable helpers live under R/, and exported figures are in 1-figures/. Raw data are intentionally not included (unpublished).

Quick tree view

Bulk-RNAseq-visualization-workflow-ZRN01-RM/
├─ README.md
├─ .gitignore
├─ 0-rmarkdown/
│  ├─ 1-preprocessing-and-quality-control.Rmd
│  └─ 2-differential-gene-expression-figures.Rmd
├─ 0-knit-html/
│  ├─ 1-preprocessing-and-quality-control.html
│  └─ 2-differential-gene-expression-figures.html
├─ R/
│  └─ …                                 
└─ 1-figures/
   ├─ quality_control/
   │  ├─ pca_plot_noloadings.svg
   │  ├─ rle_combined_fig.svg
   │  └─ … 
   └─ differential_gene_expression/
      ├─ volcano_combined_2x2_T21_vs_D21.svg
      ├─ GO_enrichment_day7.svg
      └─ …

Folder-by-folder guide

`0-rmarkdown/` — analysis notebooks

1-preprocessing-and-quality-control.Rmd — QC, clustering, PCA, normalization checks.
2-differential-gene-expression-figures.Rmd — DESeq2 contrasts, volcano/MA, GO, UpSet/Venn, lineplots.

`0-knit-html/` — knitted HTML reports

Human-readable exports of the Rmd notebooks for quick review.

`R/` — function library

Reusable helpers supporting the analysis and figure generation.

`1-figures/` — analysis outputs (SVG)

quality_control/ — PCA (± loadings), scree, RLE, dispersion, correlation/dendrogram, etc.
differential_gene_expression/ — Volcano & MA plots, GO enrichment, UpSet/Venn, aggregate lineplots.

Not included

Raw counts and metadata (dataset is unpublished).

How to navigate this project

Skim the knitted reports in 0-knit-html/ for a quick overview.
Open the corresponding Rmd in 0-rmarkdown/ for code + narrative.
Import helpers from R/ as needed.
Figures referenced below live in 1-figures/.

Part 1: Quality control and sample clustering

This section of the workflow is primarily focused on assessing the quality of samples and to identify potential outlier samples. Although multiQC or other RNAseq QC pipelines will likely identify problematic samples upstream of generation of raw counts, sample clustering can also provide biologically insightful results.

1a. CPM distribution filtering 1b. Distribution of p-values across DESeq2 contrasts of interest 1c. log2FoldChange scatterplots to compare shrinkage vs. no shrinkage 1c. DESeq2 normalization assessment 1d. Sample-to-sample correlation matrix heatmap 1e. PCA plot (± loadings + Scree) 1f. Lineplot of gene subtypes across all samples

1a. CPM distribution filtering

Parameters used for CPM filtering

Criteria: >3 CPM in at least 3 replicates


total_genes <- nrow(cpm_matrix)
expressed_genes <- rowSums(cpm_matrix > 3) >= 3
n_expressed <- sum(expressed_genes)

1b. Library size comparison across all samples

All samples appear to share similar library sizes, so there are

1b. Distribution of p-values across DESeq2 contrasts of interest

Interestingly, in the case of this experiment, our condition of interest (genotype) appears to induce a large transcriptional pertubation. ~9000 genes pass our pvalue filter of 0.05. This result is not necessarily problematic since we will apply a log2FoldChange cutoff, which should reduce potential FPs.

1c. log2FoldChange scatterplots to compare effect of DESeq2's apeglm shrinkage vs. no shrinkage

DESeq2 offers a wi

1c. DESeq2 normalization assessment

1d. sample-to-sample correlation matrix heatmap

1e. PCA plot (with and withlout loadings + SCREE plot

1f. Gene subclass lineplot across

For QC purposes, this plot is likely superfluous/unnecessary. Regardless, samples with technical/sequencing issues will show an altered number of counts for a particular gene subtype.