Skip to content

Commit 21d5aed

Browse files
authored
Merge pull request #119 from greenelab/envest/update_readme
Envest/update readme
2 parents 4852bec + 64c9817 commit 21d5aed

File tree

4 files changed

+135
-61
lines changed

4 files changed

+135
-61
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ models
1111
results
1212
plots/main/*.pdf
1313
plots/supplementary/*.pdf
14+
plots/visualize_expression/*.pdf
1415
.Rproj.user
1516
RNAseq_titration_results.Rproj
1617
._RNAseq_titration_results.Rproj

README.md

Lines changed: 134 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -1,113 +1,186 @@
11
# Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously
22

3-
The full output of a [version](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v1.1) of this analysis is available at Figshare under the DOI: [10.6084/m9.figshare.5035997.v2](https://doi.org/10.6084/m9.figshare.5035997.v2)
3+
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
4+
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
5+
**Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)*
6+
7+
- [Summary](#summary)
8+
- [Requirements](#requirements)
9+
- [Obtaining and running the Docker container](#obtaining-and-running-the-docker-container)
10+
- [Download data from The Cancer Genome Atlas (TCGA)](#download-data-from-the-cancer-genome-atlas-tcga)
11+
- [Recreate manuscript results](#recreate-manuscript-results)
12+
- [Methods](#methods)
13+
- [Machine Learning Pipeline](#machine-learning-pipeline)
14+
- [Differential Expression Pipeline](#differential-expression-pipeline)
15+
- [Running individual experiments](#running-individual-experiments)
16+
- [Machine learning](#machine-learning)
17+
- [Differential expression](#differential-expression)
18+
- [Other scripts](#other-scripts)
19+
- [Manuscript versions](#manuscript-versions)
20+
- [Funding](#funding)
21+
22+
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
423

524
## Summary
625

726
We performed a series of supervised and unsupervised machine learning
8-
evaluations, as well as differential expression analyses, to assess which
27+
evaluations, as well as differential expression and pathway analyses, to assess which
928
normalization methods are best suited for combining data from microarray and
1029
RNA-seq platforms.
1130

12-
We evaluated five normalization approaches for all methods:
31+
We evaluated six normalization approaches for all methods:
1332

1433
1. log-transformation (LOG)
1534
2. [non-paranormal transformation](https://arxiv.org/abs/0903.0649) (NPN)
1635
3. [quantile normalization](http://bmbolstad.com/misc/normalize/bolstad_norm_paper.pdf) (QN)
17-
4. [Training Distribution Matching](https://peerj.com/articles/1621/) (TDM)
18-
5. standardizing scores (z-scoring; Z).
36+
4. quantile normalization followed by z-scoring (QN-Z)
37+
5. [Training Distribution Matching](https://peerj.com/articles/1621/) (TDM)
38+
6. z-scoring (Z)
1939

20-
A [version](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v1.0) of this project is detailed in our pre-print
21-
[Cross-Platform Normalization Enables Machine Learning Model Training On Microarray And RNA-Seq Data Simultaneously](https://doi.org/10.1101/118349).
2240

23-
_We are actively making improvements to this codebase; see [#12](https://github.com/greenelab/RNAseq_titration_results/issues/12)._
24-
25-
## Breast Cancer Data
2641

27-
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.58862.svg)](https://doi.org/10.5281/zenodo.58862)
42+
## Requirements
43+
44+
We recommend using the docker image `envest/rnaseq_titration_results:R-4.1.2` to handle package and dependency installation.
45+
See `docker/R-4.1.2/Dockerfile` for more information.
46+
47+
Our analysis ([v2.0](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v2.0)) was run using 7 cores on an AWS instance with 16 cores, 128 GB memory, and an allocated 1 TB of space.
2848

29-
The Cancer Genome Atlas BRCA data used for these analyses
49+
### Obtaining and running the Docker container
50+
51+
Pull the docker image using:
52+
53+
```
54+
docker pull envest/rnaseq_titration_results:R-4.1.2
55+
```
56+
57+
Then run the command to start up a container, replacing `[PASSWORD]` with your own password:
58+
59+
```
60+
docker run --mount type=bind,target=/home/rstudio,source=$PWD -e PASSWORD=[PASSWORD] -p 8787:8787 envest/rnaseq_titration_results:R-4.1.2
61+
```
62+
63+
Navigate to <http://localhost:8787/> and login to the RStudio server with the username `rstudio` and the password you set above.
64+
65+
66+
## Download data from The Cancer Genome Atlas (TCGA)
67+
68+
TCGA data from 520 breast cancer (BRCA) patients used for these analyses
3069
is [available at zenodo](https://zenodo.org/record/58862).
3170

71+
Data from 150 glioblastoma (GBM) patients is available from the [Genomic Data Commons PanCan Atlas](https://gdc.cancer.gov/about-data/publications/pancanatlas).
72+
73+
To download data, run the data download script in the top directory:
74+
75+
```
76+
bash download_TCGA_data.sh
77+
```
78+
79+
## Recreate manuscript results
80+
81+
After data has been downloaded, running
82+
3283
```
33-
# To download data, run in top directory:
34-
sh brca_data_download.sh
84+
bash run_all_analyses_and_plots.sh
3585
```
3686

37-
## Analysis
87+
with [v2.0](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v2.0) of this repository will reproduce the results presented in our manuscript.
88+
We recommend running all analyses within the project Docker container.
89+
90+
## Methods
3891

3992
### Machine Learning Pipeline
4093

4194
Here's a schematic overview of our machine learning experiments:
4295

43-
![](https://github.com/greenelab/RNAseq_titration_results/blob/master/diagrams/RNA-seq_titration_ML_overview.png)
96+
![](diagrams/RNA-seq_titration_ML_overview.png)
4497

4598
**Overview of supervised and unsupervised machine learning experiments.**
4699

47-
1. 520 TCGA Breast Cancer samples run on both microarray and RNA-seq were split
48-
into a training (2/3) and holdout set (1/3).
49-
2. RNA-seq’d samples were "titrated" into the training set, 10% at a time (0-100%)
50-
resulting in eleven training sets for each normalization method.
51-
3. _Machine learning applications._ Three supervised multi-class (BRCA PAM50 subtype)
52-
classifiers—LASSO, linear SVM, and Random Forest—were trained on each training set
53-
and tested on the microarray and RNA-seq holdout sets. The holdout sets were projected
54-
onto and back out of the training set space using two unsupervised techniques, Independent
55-
and Principal Components Analysis, to obtain reconstructed holdout sets. The
56-
classifiers used in step 4A above were used to predict on the reconstructed holdout
57-
sets.
100+
1. Matched samples run on both microarray and RNA-seq were split into a training (2/3) and holdout set (1/3).
101+
2. RNA-seq samples were "titrated" into the training set, 10% at a time (0-100%), replacing their matched array samples, resulting in eleven training sets for each normalization method.
102+
3. Machine learning applications:
58103

59-
```
60-
# To run the machine learning pipeline, run in top directory:
61-
sh run_machine_learning_experiments.sh
104+
- _Supervised learning_:
105+
We trained three classifiers – LASSO, linear SVM, and Random Forest — on each training set and tested them on the microarray and RNA-seq holdout sets.
106+
The models were trained to predict tumor subtype (both cancer types have 5 subtypes) and the binary mutation status of _TP53_ and _PIK3CA_.
62107

63-
# To run one repeat of the subtype classifier pipeline, use:
64-
Rscript run_experiments.R
65-
```
108+
- _Unsupervised learning_:
109+
We projected holdout sets onto and back out of the training set space using Principal Components Analysis to obtain reconstructed holdout sets.
110+
We then used the trained subtype classifiers to predict on the reconstructed holdout sets.
111+
[PLIER](https://github.com/wgmao/PLIER) (Pathway-Level Information ExtractoR) identified coordinated sets of genes in each cancer type.
66112

67113
### Differential Expression Pipeline
68114

69115
Here's a schematic overview of our main differential expression experiment:
70116

71-
![](https://github.com/greenelab/RNAseq_titration_results/blob/master/diagrams/RNA-seq_titration_diff_expression_overview.png?raw=true)
117+
![](diagrams/RNA-seq_titration_diff_expression_overview.png)
72118

73119
**Overview of differential expression experiment.**
74120

75-
1. All matched TCGA breast cancer samples (n = 520) were considered when building the platform-specific
76-
“silver standards.” These standards are the genes that were differentially
77-
expressed at a specified False Discovery Rate (FDR) using data sets comprised
78-
entirely of one platform and processed in a standard way: log2-transformed
79-
microarray data and “untransformed” RSEM count data (preprocessed using the
80-
`limma::voom` function).
81-
2. RNA-seq’d samples were ‘titrated’ into the data set,
82-
10% at a time (0-100%) resulting in eleven experimental sets for each n
83-
ormalization method.
84-
3. Differentially expressed genes (DEGs) were identified using
85-
the `limma` package. We compared the Her2 and LumA subtypes as well as Basal
86-
v. all other samples.
87-
4. Lists of experimental DEGs were compared to standard gene
88-
sets using Jaccard similarity.
121+
1. All matched samples were considered when building the platform-specific “silver standards.”
122+
These standards are the genes that were differentially expressed at a specified False Discovery Rate (FDR) using data sets comprised entirely of one platform and processed in a standard way: log2-transformed
123+
microarray data and “untransformed” RNA-seq data.
124+
2. RNA-seq samples were "titrated" into the data set, 10% at a time (0-100%), replacing their matched array samples, resulting in eleven experimental sets for each normalization method.
125+
3. Differentially expressed genes (DEGs) were identified usingthe `limma` package.
126+
For BRCA, we compared the Her2 and LumA subtypes as well as Basal v. all other subtypes.
127+
For GBM, we compared the Classical and Mesenchymal subtypes as well as Proneural v. all other subtypes.
128+
4. Lists of experimental DEGs were compared to standard genesets using Jaccard similarity and Spearman rank correlation.
129+
130+
In the "small n" experiment, between 3 and 50 samples were selected from each subtype for DEG comparison.
131+
132+
133+
## Running individual experiments
134+
135+
#### Machine learning
136+
137+
To run the machine learning pipeline, run in top directory:
89138

90139
```
91-
# Note: This requires the data to be processed to include matched samples only,
92-
# and split into training and test sets (0-expression_data_overlap_and_split.R)
140+
bash run_machine_learning_experiments.sh [cancer type] [prediction task] [n cores]
141+
```
142+
143+
where
93144

94-
# To run the differential expression pipeline, run in top directory:
95-
sh run_differential_expression_experiments.sh
145+
- `[cancer type]` is `BRCA` or `GBM`
146+
- `[prediction task]` is `subtype`, `TP53`, or `PIK3CA`
147+
- `[n cores]` is the number of cores you want to run in parallel
148+
149+
#### Differential expression
150+
151+
⚠️ _This requires the data to be processed to include matched samples only, and split into training and test sets via `0-expression_data_overlap_and_split.R` in the machine learning pipeline._
152+
153+
To run the differential expression pipeline, run in top directory:
154+
155+
```
156+
bash run_differential_expression_experiments.sh [cancer type] [subtype vs others] [subtype vs subtype] [subtype vs subtype small] [n cores]
96157
```
97158

98-
## Requirements
159+
where
160+
161+
- `[cancer type]` is `BRCA` or `GBM`
162+
- `[subtype vs others]` is the subtype to be compared against all other subtypes
163+
- `[subtype vs subtype]` are the two subtypes to be compared (comma-separated, e.g. `Her2,LumA`)
164+
- `[subtype vs subtype small]` are the two subtypes to be compared at small sample sizes (comma-separated, e.g. `Her2,LumA`)
165+
- `[n cores]` is the number of cores you want to run in parallel
166+
167+
#### Other scripts
99168

100-
This analysis was performed in R. It requires R & Bioconductor packages
101-
detailed in `check_installs.R` to be installed.
169+
To search for the number of publicly available microarray and RNA-seq samples from [GEO](https://www.ncbi.nlm.nih.gov/geo/) and [ArrayExpress](https://www.ebi.ac.uk/arrayexpress/), run
102170

103-
One github package (`TDM`) is required. To install, run:
171+
```
172+
python3 search_geo_arrayexpress.py
173+
```
174+
and check the output in `results/array_rnaseq_ratio`.
104175

105-
library(devtools)
106-
devtools::install_github("greenelab/TDM")
176+
## Manuscript versions
107177

108-
**This analysis is [in the process](https://github.com/greenelab/RNAseq_titration_results/issues/18) of being moved to a Docker image.**
178+
| Version | Relevant links |
179+
| :------ | :------------- |
180+
| [v2.0](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v2.0) | [Figshare+ data](https://doi.org/10.25452/figshare.plus.19629864.v1), [Data for plots](https://doi.org/10.6084/m9.figshare.19686453) |
181+
| [v1.1](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v1.1) | [Figshare full results](https://doi.org/10.6084/m9.figshare.5035997.v2) |
182+
| [v1.0](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v1.0) | [Pre-print](https://doi.org/10.1101/118349) |
109183

110184
## Funding
111185

112-
This work was supported the Gordon and Betty Moore Foundation [GBMF 4552] and
113-
the National Institutes of Health [T32-AR007442, U01-TR001263].
186+
This work was supported by the Gordon and Betty Moore Foundation [GBMF 4552], Alex's Lemonade Stand Foundation [GR-000002471], and the National Institutes of Health [T32-AR007442, U01-TR001263, R01-CA237170, K12GM081259].
-359 KB
Loading
-350 KB
Loading

0 commit comments

Comments
 (0)