DEGA is a reproducible pipeline and interactive Jupyter notebook for performing differential gene expression (DGE) analysis on gene expression datasets. The repository includes a fully documented notebook (DEGA.ipynb
) present in colab folder that runs the analysis, plus an output folder containing publication-ready tables, plots, and summaries generated by the notebook.
-
2025-08-05 — Analysis run and outputs exported. This release includes:
publication_ready_results.csv
andcomprehensive_deg_results.csv
.- Figures:
volcano_plot.png
,ma_plot.png
,top_genes_heatmap.png
,top_genes_boxplots.png
,exploratory_analysis.png
,quality_assessment.png
, and more. statistical_summary.txt
summarizing the key metrics from the run.
DEGA is intended for researchers who want a clear, reproducible workflow to go from raw or pre-processed expression tables to:
- Quality assessment & exploratory data analysis (PCA, clustering, QC plots)
- Differential expression testing (fold-change, adjusted p-values)
- Visualization (volcano, MA, heatmaps, boxplots)
- Export of publication-ready tables
The repository contain notebooks that performs the entire analysis and writes output files which can be seen in a outputs folder.
--
The notebook requires a standard Python stack. The first cell installs and imports the dependencies used in the analysis. Recommended to create an isolated environment:
# create environment (conda recommended)
conda create -n dega python=3.10 -y
conda activate dega
# install core packages
pip install --upgrade pip
pip install jupyterlab geoquery GEOparse pandas numpy scipy matplotlib seaborn scikit-learn rpy2
# optional extras used by the notebook for exporting/figures
pip install openpyxl xlsxwriter plotly kaleido adjustText
The notebook's first cell includes
pip install
statements so it can be run in a fresh Colab/Binder session as well.
Open the notebook in Google Colab (or run locally) — the setup cell will install dependencies automatically.
DEGA.ipynb
is divided into the following high-level sections (run in this order):
- Install and import libraries — ensures all Python/R dependencies are available.
- Load data & sample — load expression matrices and the sample metadata file (or download via GEO if configured).
- Preprocessing & filtering — low-expression filtering and optional normalization steps.
- Exploratory data analysis — PCA, sample QC, sample clustering, QC plots.
- Differential expression testing — statistical tests, p-value adjustment, fold-change calculation.
- Post-processing & filtering — select significant genes by p-value and log2 fold-change thresholds.
- Visualization — volcano plot, MA-plot, heatmaps, boxplots for top genes.
- Export results — write
comprehensive_deg_results.csv
,publication_ready_results.csv
, figures, and astatistical_summary.txt
.
Notes:
- The notebook defines threshold variables (e.g.
p_threshold
,log2fc_threshold
) near the DGE section — adjust them before running the visualization cells. - The notebook prints progress and places outputs in the local working directory (see
deg_analysis_output
zip for an example layout).
comprehensive_deg_results.csv
— full results table containing expression means, standard deviations, log2 fold-change, raw and adjusted p-values, and flags for significance/regulation.publication_ready_results.csv
— curated table ready for inclusion in papers/supplementary material.supplementary_all_genes_analysis.csv
— additional summary metrics for all genes.expression_filtered.csv
— filtered expression matrix used for downstream analysis.statistical_summary.txt
— short text summary (date of analysis, number of genes analyzed, counts of up/downregulated genes, etc.).- Figures:
volcano_plot.png
,ma_plot.png
,top_genes_heatmap.png
,top_genes_boxplots.png
,exploratory_analysis.png
,quality_assessment.png
,expression_clusters.png
,treatment_effect_preview.png
,comprehensive_treatment_validation.png
.

The latest statistical_summary.txt
(analysis date: 2025-08-05) reports:
- Total genes analyzed: 1000
- Significant genes: 1 (Percent significant: 0.10%)
- Upregulated genes: 54
- Downregulated genes: 77
- Mean fold change (significant): 2.04
See
deg_analysis_output/statistical_summary.txt
for the full summary and top gene lists.
The notebook also demonstrates how to regenerate figures and adjust significance thresholds.
repo-root/
├── colab # Main analysis notebook
├── notebooks # each cell in a separate file
├── outputs # expected results(exported CSVs and figures)
├── requirements.txt # rquirements to use the repo
└── README.md # This file
- The notebook tries to install exact Python packages at runtime (see the first cell).
- For full reproducibility, record the output of
pip freeze
or export the conda environment before running the analysis. - If results are to be used in publications, set random seeds and record software versions used (the notebook prints the analysis date in
statistical_summary.txt
).
Contributions and issues are welcome. Please open an issue describing the request or submit a pull request with tests and updated notebook outputs where appropriate. Suggested improvements:
- Add a command-line wrapper to run the pipeline headlessly.
- Add unit tests for core pre-processing functions.
- Add support for common normalization methods (DESeq2 via rpy2, limma-voom, edgeR).