Skip to content

This project contains the code for the manuscript "Explainable Machine Learning Identifies Factors for Dosage Compensation in Aneuploid Human Cancer Cells" by Heller et al. (https://doi.org/10.1101/2025.05.12.653427).

Notifications You must be signed in to change notification settings

sheltzer-lab/DosageCompensationFactors

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DosageCompensationFactors

This project contains the code for the manuscript "Explainable Machine Learning Identifies Factors for Dosage Compensation in Aneuploid Human Cancer Cells" by Heller et al. (https://doi.org/10.1101/2025.05.12.653427). The project is written in R and contains code for producing the figures in the manuscript as well as further analyses.

Abstract

Aneuploidy, a hallmark of cancer, leads to widespread changes in chromosome copy number, altering the abundance of hundreds or thousands of proteins. How-ever, evidence suggests that levels of proteins encoded on affected chromosomes are often buffered toward their abundances observed in diploid cells. Despite its preval-ence, the molecular mechanisms driving this protein dosage compensation remain largely unknown. It is unclear whether all proteins are buffered to a similar degree, what factors determine buffering, and whether dosage compensation varies across different cell lines or tumor types. Moreover, its potential adaptive advantage and therapeutic relevance remain unexplored. Here, we established a novel approach to quantify protein dosage buffering in a gene copy number-dependent manner, show-ing that dosage compensation is widespread but variable in cancer cell lines and in vivo tumor samples. By developing multifactorial machine learning models, we identify mean gene dependency, protein complex participation, haploinsufficiency, and mRNA decay as key predictors of buffering. We also show that dosage com-pensation can affect oncogenic potential and that higher buffering correlates with reduced proteotoxic stress and increased drug resistance. These findings highlight protein dosage compensation as a crucial regulatory mechanism and a potential therapeutic target in aneuploid cancers.

Requirements

  • R (>= 4.3.0)

Usage

All scripts need to be called from this directory (location of the DosageCompensationFactors.Rproj file) as the working directory. To ensure this, scripts use the R-package here to set the working directory and for building file paths.

R-Scripts can be executed with an IDE of your choice (RStudio, PyCharm, ...) or by executing

RScript ./path/to/script.R

Package installation

Renv is used in this project to keep track of all required packages. Install the renv package using te following command in an R console:

install.packages("renv")

To restore the environment and install all necessary packages, execute the following command in the R console from the directory where renv.lock is located at:

renv::restore()

Data Sources

Before running the script, all necessary data needs to be placed in the Data folder. The file ./Data/DatasetReferences.md describes How to obtain these datasets and where to place them.

Exported Data Format

The code in this project exports some of its intermediate data in the Apache Parquet format under ./Output/Data. Apache Parquet is a file format that allows for compressed storage and fast readability of datasets, and stores the datatype of each variable with the dataset for ease of handling (https://parquet.apache.org/). Parquet files can be read with the arrow R-package.

Folder Structure

  • ./Code: Contains the code of the project. The code subfolders are numbered by order of execution. All files within a subfolder can be executed in parallel. Scripts that are not positioned in a subfolder (e.g. ./Code/analysis.R) contain functions that are sourced and called by other files.
  • ./Data: Contains the data required for running the code.
  • ./Output: Contains all figures, tables, reports, and datasets generated by the code.
  • ./renv: Contains scripts for establishing the renv environment and package management
  • ./Code/00_Preparation: Contains scripts that need to be executed before analysing the data (e.g. downloading copy number data, mapping cell line and Uniprot IDs)
  • ./Code/01_Preprocessing: Contains the code to preprocess proteomics, copy number, and further datasets.
  • ./Code/02_DosageCompensation: Contains scripts that determine the buffering ratio and bufffering classes for all datasets.
  • ./Code/02_DosageCompensation/Evaluation: Contains scripts for visualizing and evaluating calculated buffering ratios. Needs to be executed after ./Code/02_DosageCompensation/Evaluation/calculate_dc.R
  • ./Code/03_Analysis: Contains code that analyses multiple aspects of the generated buffering ratios and classes (unifactorial and multifactorial dosage compansation factor analysis, calculating sample/model buffering ratios, etc.).
  • ./Code/03_Analysis/Evaluation: Contains code that evaluates the performance of multifactorial models against each other (Out-of-sample prediction) and checks if SHAP-values of one factor are confounded by other factors.
  • ./Code/04_Screens: Contains code that uses buffering ratios and sample buffering ratios to identify differences in gene expression, drug response, proliferation, clinical features, mutations, ubiquitination, and gene essentiality between high and low buffering samples.
  • ./Code/05_Publication: Contains the code that generated the figures in the manuscript. Datasets generated in previous scripts are loaded and plots are exported as PDF.