Analysis and pipeline prepared by: Ha-Na Shim
Contact: hshim1@uchicago.edu for questions, feedback, or suggestions regarding this workflow.
- Data & Inputs
- Part 1: Quality control and sample assessment
- Downstream DGE & Functional Analysis (stubs)
- Reproducibility
The following packages are required to run the workflow and generate the visualizations below.
R β₯ 4.5.1 (released 2025-06-13). Bioconductor β₯ 3.21 (works with R 4.5.x)
#CRAN
install.packages(c(
"tidyverse", # includes dplyr, ggplot2, stringr, etc.
"patchwork", "ggrepel", "cowplot", "gridExtra", "scales"
))
#Bioconductor
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c("DESeq2", "edgeR", "clusterProfiler", "AnnotationDbi", "org.Hs.eg.db"))
This repository documents an applied bulk RNA-seq analysis workflow that I carried out on an unpublished dataset provided by a collaborator. The focus here is on the downstream analysis starting from processed count data. Raw sequencing data (FASTQ files), sample metadata, and other identifying information are intentionally excluded in order to protect unpublished work and maintain confidentiality.
The goal of this repository is to provide a clear example of how bulk RNA-seq data generated in the lab can be taken through quality control, differential gene expression analysis, and downstream functional interpretation using a structured and reproducible workflow. While the dataset itself is specific to a collaboratorβs study, the workflow is generalizable and can be applied to new datasets generated in the lab with minimal adaptation.
Part 1 of the analysis focuses on quality control and exploratory data assessment. This includes CPM-based filtering, normalization checks, variance stabilizing transformations, correlation matrices, dendrograms, and principal component analysis with scree and loading plots. These steps help confirm that the input data are suitable for downstream analyses and that no major outliers are present.
Part 2 performs differential gene expression testing with DESeq2 across the experimental contrasts of interest. Results are summarized with volcano plots, MA plots, and histograms of DEG counts. Functional interpretation is included through GO enrichment, Venn and UpSet diagrams of shared DEG sets, and aggregate time-course lineplots for groups of functionally related genes. Figures are exported in SVG format for high-quality, publication-ready visualization.
All code is written in R and organized into two R Markdown notebooks (one for QC, one for DGE and downstream analysis), with knitted HTML reports and all figures included in the repository. Supporting functions are organized into an R/
folder for reuse across analyses. This structure makes it straightforward to extend the workflow to additional datasets while maintaining clarity and reproducibility.
This repository is organized for clarity and reproducibility. Source analyses live in
0-rmarkdown/
, their knitted reports are in 0-knit-html/
,
reusable helpers live under R/
, and exported figures are in
1-figures/
. Raw data are intentionally not included (unpublished).
Quick tree view
Bulk-RNAseq-visualization-workflow-ZRN01-RM/ ββ README.md ββ .gitignore ββ 0-rmarkdown/ β ββ 1-preprocessing-and-quality-control.Rmd β ββ 2-differential-gene-expression-figures.Rmd ββ 0-knit-html/ β ββ 1-preprocessing-and-quality-control.html β ββ 2-differential-gene-expression-figures.html ββ R/ β ββ β¦ ββ 1-figures/ ββ quality_control/ β ββ pca_plot_noloadings.svg β ββ rle_combined_fig.svg β ββ β¦ ββ differential_gene_expression/ ββ volcano_combined_2x2_T21_vs_D21.svg ββ GO_enrichment_day7.svg ββ β¦
Folder-by-folder guide
1-preprocessing-and-quality-control.Rmd
β QC, clustering, PCA, normalization checks.2-differential-gene-expression-figures.Rmd
β DESeq2 contrasts, volcano/MA, GO, UpSet/Venn, lineplots.
- Human-readable exports of the Rmd notebooks for quick review.
- Reusable helpers supporting the analysis and figure generation.
quality_control/
β PCA (Β± loadings), scree, RLE, dispersion, correlation/dendrogram, etc.differential_gene_expression/
β Volcano & MA plots, GO enrichment, UpSet/Venn, aggregate lineplots.
- Raw counts and metadata (dataset is unpublished).
How to navigate this project
- Skim the knitted reports in
0-knit-html/
for a quick overview. - Open the corresponding Rmd in
0-rmarkdown/
for code + narrative. - Import helpers from
R/
as needed. - Figures referenced below live in
1-figures/
.
This section of the workflow is primarily focused on assessing the quality of samples and to identify potential outlier samples. Although multiQC or other RNAseq QC pipelines will likely identify problematic samples upstream of generation of raw counts, sample clustering can also provide biologically insightful results.
1a. CPM distribution filtering 1b. Distribution of p-values across DESeq2 contrasts of interest 1c. log2FoldChange scatterplots to compare shrinkage vs. no shrinkage 1c. DESeq2 normalization assessment 1d. Sample-to-sample correlation matrix heatmap 1e. PCA plot (Β± loadings + Scree) 1f. Lineplot of gene subtypes across all samples
Criteria: >3 CPM in at least 3 replicates
total_genes <- nrow(cpm_matrix)
expressed_genes <- rowSums(cpm_matrix > 3) >= 3
n_expressed <- sum(expressed_genes)
All samples appear to share similar library sizes, so there are
Interestingly, in the case of this experiment, our condition of interest (genotype) appears to induce a large transcriptional pertubation. ~9000 genes pass our pvalue filter of 0.05. This result is not necessarily problematic since we will apply a log2FoldChange cutoff, which should reduce potential FPs.
DESeq2 offers a wi
For QC purposes, this plot is likely superfluous/unnecessary. Regardless, samples with technical/sequencing issues will show an altered number of counts for a particular gene subtype.