Integrating Gene Expression, Mutation and Copy Number Data to Identify Driver Genes of Recurrent Chromosome-Arm Losses

This repository contains the code associated with the paper “Integrating Gene Expression, Mutation and Copy Number Data to Identify Driver Genes of Recurrent Chromosome-Arm Losses.” The manuscript is available at bioRxiv.

Introduction

The code in this repository is designed to identify driver genes responsible for cancer type–specific recurring arm losses using data from 20 cancer types provided by TCGA. It is written in R (tested on R versions 4.3 and 4.4) and uses Snakemake to orchestrate most of the execution pipeline. Additionally, MutSig2CV and GISTIC2 are employed.

Setup

The required R packages and Linux dependencies are listed in the dependencies.txt file.
Instructions for installing Snakemake are available here.

Both GISTIC2 and MutSig2CV can be installed directly or run via Docker containers (see below).

Running the Code

The execution of the code involves several steps:

Running the Snakemake Pipeline
Calculates the mutation and CNV rates given on arm loss/no arm loss for each cancer type and frequently lost arm. It also performs differential gene expression and pathway analysis and prepares the .seg and mutation files for the following step.
Running MutSig2CV and GISTIC2
These tools are executed for each pair of cancer type and frequently lost arm on the 2 groups of samples separately (with and without arm loss).
Running PRODIGY on GISTIC2 Results
This step (also using Snakemake) applies the PRODIGY algorithm to the GISTIC2 results.
Summarizing the Results
Plotting (Optional)

Running the Snakemake Pipeline

The configuration file that specifies the data, logs, and output directories is located at snakemake_scripts/config.yaml. By default, these directories are created under the root of the repository. The pipeline expects the following inputs:

GISTIC2 Results:
Cancer type–specific results downloaded from Broad GDAC.
Raw Gene Counts:
Downloaded using TCGAbiolinks (htseq-count data from the legacy archive, which is no longer available).
FPKM Gene Counts:
Downloaded using TCGAbiolinks.

Since the TCGA legacy archive is no longer accessible through TCGAbiolinks, and to simplify data access, the required data has been uploaded to TODO. Please place this data in the data directory.

To run the pipeline, execute the following commands:

cd snakemake_scripts
snakemake all --cores NUMBER_OF_CORES_TO_USE

Running MutSig2CV and GISTIC2

We encountered errors when trying to run multiple instances of these tools in parallel, so we created several Docker containers for each tool. The Docker image for GISTIC2 is available here and for MutSigCV here.

Note that the MutSigCV container does not actually have MutSig2CV installed and is only used for its Matlab setup. Alternatively, you can use a Docker image that includes MutSig2CV.

Download MutSig2CV by executing the following commands from the root of the repository:

wget http://software.broadinstitute.org/cancer/cga/sites/default/files/data/tools/mutsig/MutSig2CV.tar.gz
tar -xvzf MutSig2CV.tar.gz

The markers, reference, and CNV files required for running GISTIC2 are also available here and should be placed in the data directory. These files were also downloaded from Broad GDAC.

There are 460 runs of each tool, so the process can take a long time. Using 5 containers for each tool, the runs took approximately 72 hours. They can be executed in parallel. Note that for step 3, only the GISTIC2 results are needed.

To run all GISTIC instances, first insert the container names in the container_names list in code/run_gistic.sh, then run the script from the code directory:

cd code
./run_gistic.sh out data log/GISTIC

To run all MutSig2CV instances, first insert the container names in the container_names list in code/run_mutsig.sh, then run the script from the code directory (preferably in a different terminal to run these in parallel):

cd code
./run_mutsig.sh out data log/MutSig

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
code		code
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
dependencies.txt		dependencies.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Integrating Gene Expression, Mutation and Copy Number Data to Identify Driver Genes of Recurrent Chromosome-Arm Losses

Introduction

Setup

Running the Code

Running the Snakemake Pipeline

Running MutSig2CV and GISTIC2

About

Uh oh!

Releases

Packages

Languages

License

Shamir-Lab/aneuploidy-drivers-detection

Folders and files

Latest commit

History

Repository files navigation

Integrating Gene Expression, Mutation and Copy Number Data to Identify Driver Genes of Recurrent Chromosome-Arm Losses

Introduction

Setup

Running the Code

Running the Snakemake Pipeline

Running MutSig2CV and GISTIC2

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages