Reproducible repository for bulk RNA-seq analyses (colon organoids and keratinocyte cell lines) – from my undergrad 2nd-3rd year (2022-2023)
This repository contains scripts, metadata, raw and processed count tables, and results for differential expression analyses and downstream visualizations. The recent reorganization added a recommended folder layout, helper files for reproducible environments, and a .gitignore tuned to ignore intermediate data files.
.
├── colon organoids/ # Analysis scripts and data for colon organoid experiments
├── keratinocyte/ # Analysis scripts and data for keratinocyte cell lines
├── pipeline/ # Reusable pipeline scripts for RNA-seq analysis
├── data/ # Raw and processed data files
├── results/ # Output files from analyses
├── scripts/ # Utility scripts and shared functions
├── requirements.txt # Python package dependencies
└── R_requirements.R # R package dependencies
The following shell snippet shows a common end-to-end workflow: fetch FASTQ from SRA/GEO, optionally trim, align to a reference, and generate gene-level counts with featureCounts. There is an alternative Salmon (quasi-mapping) example included.
- Bash tools for FASTQ to counts:
- SRA Toolkit (
prefetch,fasterq-dump): download FASTQ from SRA/GEO - FASTQC: quality control of raw FASTQ files
- fastp: trimming and filtering reads
- HISAT2 or STAR: alignment to reference genome
- samtools: BAM file manipulation
- subread (featureCounts): gene-level counting
- Salmon: fast transcript quantification (alignment-free)
- SRA Toolkit (
Install commands (Ubuntu/Debian):
sudo apt-get update
sudo apt-get install fastqc hisat2 star samtools
# SRA Toolkit
conda install -c bioconda sra-tools
# fastp
conda install -c bioconda fastp
# subread (featureCounts)
conda install -c bioconda subread
# Salmon
conda install -c bioconda salmonOr use mamba for faster conda installs.
Role in pipeline:
- FASTQC: run after download/trimming for QC reports
- fastp: trim/filter before alignment
- HISAT2/STAR: align reads to genome
- samtools: sort/index BAM files
- featureCounts: count reads per gene
- Salmon: alternative quantification (transcript/gene)
# variables
SRR=SRR22450503 # replace with SRA run accession
THREADS=8
REF_FA=/path/to/genome.fa
GTF=/path/to/annotations.gtf
HISAT2_INDEX=/path/to/hisat2_index_prefix
OUTDIR=data/raw/${SRR}
mkdir -p ${OUTDIR}
# 1) fetch FASTQ
prefetch ${SRR}
fasterq-dump --split-files --outdir ${OUTDIR} ${SRR}
# 2) trim (optional)
fastp -i ${OUTDIR}/${SRR}_1.fastq -I ${OUTDIR}/${SRR}_2.fastq \
-o ${OUTDIR}/${SRR}_1.trim.fastq -O ${OUTDIR}/${SRR}_2.trim.fastq \
-w ${THREADS} -h ${OUTDIR}/${SRR}.fastp.html -j ${OUTDIR}/${SRR}.fastp.json
# 3) align with HISAT2
hisat2 -p ${THREADS} -x ${HISAT2_INDEX} \
-1 ${OUTDIR}/${SRR}_1.trim.fastq -2 ${OUTDIR}/${SRR}_2.trim.fastq \
| samtools sort -@ ${THREADS} -o ${OUTDIR}/${SRR}.sorted.bam
samtools index ${OUTDIR}/${SRR}.sorted.bam
# 4) count reads at gene level
featureCounts -T ${THREADS} -a ${GTF} -o ${OUTDIR}/${SRR}.featureCounts.txt ${OUTDIR}/${SRR}.sorted.bam
# The counts file contains gene-level raw counts suitable for DESeq2/edgeR downstream.# create salmon index (once)
salmon index -t ${REF_FA} -i salmon_index --type quasi -k 31
# quantify
salmon quant -i salmon_index -l A \
-1 ${OUTDIR}/${SRR}_1.trim.fastq -2 ${OUTDIR}/${SRR}_2.trim.fastq \
-p ${THREADS} -o ${OUTDIR}/salmon_${SRR}
# export gene-level counts (tximport in R or use salmon2gene tools)- Replace variables (paths, indexes) with your environment values.
- For GEO datasets that provide raw FASTQ links, use
wgetorcurlinstead ofprefetch. - If your reads are single-end adjust
fasterq-dumpand alignment commands accordingly. - Save these pipeline scripts in
pipeline/(created below) and run from project root so outputs go todata/raw/ordata/processed/.
colon organoids/,keratinocyte/— existing analysis folders with R scripts, count matrices and results.data/— recommended place for data; containsraw/andprocessed/subfolders.scripts/— recommended place for analysis and utility scripts.results/— figures, tables and exported outputs.R_requirements.R— R and Bioconductor package installer to reproduce the R environment.requirements.txt— Python dependencies for any auxiliary scripts.
Prerequisites
- R (>= 4.0) and Rtools on Windows if compiling packages.
- Optional: Python 3.8+ if you use Python scripts.
Quick start
- Clone the repository and change into it:
git clone <repo-url>
cd "RNA Seq Analysis"- Install R packages (run inside an R session):
source("R_requirements.R")- (Optional) Create and activate a Python virtual environment and install Python dependencies:
python -m venv .venv; .\.venv\Scripts\Activate.ps1; pip install -r requirements.txt- Place raw immutable inputs (FASTQ, original count matrices) in
data/raw/. Place derived outputs indata/processed/.
data/raw/— immutable original inputs.data/processed/— derived tables (normalized counts, filtered results).scripts/— modular, documented scripts; place helper functions inscripts/utils/as needed.results/— final figures, tables and CSVs used for reporting.
The repository .gitignore was updated to ignore *.csv and common R artifacts to avoid committing large or intermediate tables. Important: files already tracked by git will remain tracked. To stop tracking a file already in the repo, run:
git rm --cached path\\to\\file.csv
git commit -m "Stop tracking large CSV"If you want specific CSVs to remain tracked, I can add explicit negation rules (e.g., !colon organoids/meta_colon.csv) to the .gitignore.
- Inspect the R scripts in
colon organoids/andkeratinocyte/to identify pipeline steps. - Prefer running analyses from the repository root so that relative paths (e.g.,
data/processed/) resolve correctly. - Example: run an R script from PowerShell:
Rscript scripts/my_analysis.R- Record package versions using
sessionInfo()in R and include this output in result folders. - Use
data/rawvsdata/processed/split to prevent accidental modification of raw inputs. - Use branches and PRs for large reorganizations.
- Open an issue to propose large structural changes.
- Use branches for work and submit PRs with small, reviewable changes.
This project uses the license in LICENSE. For questions, open an issue or contact me at joy21.dev.pd@gmail.com.