|
1 | 1 | # Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously |
2 | 2 |
|
3 | | -The full output of a [version](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v1.1) of this analysis is available at Figshare under the DOI: [10.6084/m9.figshare.5035997.v2](https://doi.org/10.6084/m9.figshare.5035997.v2) |
| 3 | +<!-- START doctoc generated TOC please keep comment here to allow auto update --> |
| 4 | +<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> |
| 5 | +**Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)* |
| 6 | + |
| 7 | +- [Summary](#summary) |
| 8 | +- [Requirements](#requirements) |
| 9 | + - [Obtaining and running the Docker container](#obtaining-and-running-the-docker-container) |
| 10 | +- [Download data from The Cancer Genome Atlas (TCGA)](#download-data-from-the-cancer-genome-atlas-tcga) |
| 11 | +- [Recreate manuscript results](#recreate-manuscript-results) |
| 12 | +- [Methods](#methods) |
| 13 | + - [Machine Learning Pipeline](#machine-learning-pipeline) |
| 14 | + - [Differential Expression Pipeline](#differential-expression-pipeline) |
| 15 | +- [Running individual experiments](#running-individual-experiments) |
| 16 | + - [Machine learning](#machine-learning) |
| 17 | + - [Differential expression](#differential-expression) |
| 18 | + - [Other scripts](#other-scripts) |
| 19 | +- [Manuscript versions](#manuscript-versions) |
| 20 | +- [Funding](#funding) |
| 21 | + |
| 22 | +<!-- END doctoc generated TOC please keep comment here to allow auto update --> |
4 | 23 |
|
5 | 24 | ## Summary |
6 | 25 |
|
7 | 26 | We performed a series of supervised and unsupervised machine learning |
8 | | -evaluations, as well as differential expression analyses, to assess which |
| 27 | +evaluations, as well as differential expression and pathway analyses, to assess which |
9 | 28 | normalization methods are best suited for combining data from microarray and |
10 | 29 | RNA-seq platforms. |
11 | 30 |
|
12 | | -We evaluated five normalization approaches for all methods: |
| 31 | +We evaluated six normalization approaches for all methods: |
13 | 32 |
|
14 | 33 | 1. log-transformation (LOG) |
15 | 34 | 2. [non-paranormal transformation](https://arxiv.org/abs/0903.0649) (NPN) |
16 | 35 | 3. [quantile normalization](http://bmbolstad.com/misc/normalize/bolstad_norm_paper.pdf) (QN) |
17 | | -4. [Training Distribution Matching](https://peerj.com/articles/1621/) (TDM) |
18 | | -5. standardizing scores (z-scoring; Z). |
| 36 | +4. quantile normalization followed by z-scoring (QN-Z) |
| 37 | +5. [Training Distribution Matching](https://peerj.com/articles/1621/) (TDM) |
| 38 | +6. z-scoring (Z) |
19 | 39 |
|
20 | | -A [version](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v1.0) of this project is detailed in our pre-print |
21 | | -[Cross-Platform Normalization Enables Machine Learning Model Training On Microarray And RNA-Seq Data Simultaneously](https://doi.org/10.1101/118349). |
22 | 40 |
|
23 | | -_We are actively making improvements to this codebase; see [#12](https://github.com/greenelab/RNAseq_titration_results/issues/12)._ |
24 | | - |
25 | | -## Breast Cancer Data |
26 | 41 |
|
27 | | -[](https://doi.org/10.5281/zenodo.58862) |
| 42 | +## Requirements |
| 43 | + |
| 44 | +We recommend using the docker image `envest/rnaseq_titration_results:R-4.1.2` to handle package and dependency installation. |
| 45 | +See `docker/R-4.1.2/Dockerfile` for more information. |
| 46 | + |
| 47 | +Our analysis ([v2.0](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v2.0)) was run using 7 cores on an AWS instance with 16 cores, 128 GB memory, and an allocated 1 TB of space. |
28 | 48 |
|
29 | | -The Cancer Genome Atlas BRCA data used for these analyses |
| 49 | +### Obtaining and running the Docker container |
| 50 | + |
| 51 | +Pull the docker image using: |
| 52 | + |
| 53 | +``` |
| 54 | +docker pull envest/rnaseq_titration_results:R-4.1.2 |
| 55 | +``` |
| 56 | + |
| 57 | +Then run the command to start up a container, replacing `[PASSWORD]` with your own password: |
| 58 | + |
| 59 | +``` |
| 60 | +docker run --mount type=bind,target=/home/rstudio,source=$PWD -e PASSWORD=[PASSWORD] -p 8787:8787 envest/rnaseq_titration_results:R-4.1.2 |
| 61 | +``` |
| 62 | + |
| 63 | +Navigate to <http://localhost:8787/> and login to the RStudio server with the username `rstudio` and the password you set above. |
| 64 | + |
| 65 | + |
| 66 | +## Download data from The Cancer Genome Atlas (TCGA) |
| 67 | + |
| 68 | +TCGA data from 520 breast cancer (BRCA) patients used for these analyses |
30 | 69 | is [available at zenodo](https://zenodo.org/record/58862). |
31 | 70 |
|
| 71 | +Data from 150 glioblastoma (GBM) patients is available from the [Genomic Data Commons PanCan Atlas](https://gdc.cancer.gov/about-data/publications/pancanatlas). |
| 72 | + |
| 73 | +To download data, run the data download script in the top directory: |
| 74 | + |
| 75 | +``` |
| 76 | +bash download_TCGA_data.sh |
| 77 | +``` |
| 78 | + |
| 79 | +## Recreate manuscript results |
| 80 | + |
| 81 | +After data has been downloaded, running |
| 82 | + |
32 | 83 | ``` |
33 | | -# To download data, run in top directory: |
34 | | -sh brca_data_download.sh |
| 84 | +bash run_all_analyses_and_plots.sh |
35 | 85 | ``` |
36 | 86 |
|
37 | | -## Analysis |
| 87 | +with [v2.0](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v2.0) of this repository will reproduce the results presented in our manuscript. |
| 88 | +We recommend running all analyses within the project Docker container. |
| 89 | + |
| 90 | +## Methods |
38 | 91 |
|
39 | 92 | ### Machine Learning Pipeline |
40 | 93 |
|
41 | 94 | Here's a schematic overview of our machine learning experiments: |
42 | 95 |
|
43 | | - |
| 96 | + |
44 | 97 |
|
45 | 98 | **Overview of supervised and unsupervised machine learning experiments.** |
46 | 99 |
|
47 | | -1. 520 TCGA Breast Cancer samples run on both microarray and RNA-seq were split |
48 | | -into a training (2/3) and holdout set (1/3). |
49 | | -2. RNA-seq’d samples were "titrated" into the training set, 10% at a time (0-100%) |
50 | | -resulting in eleven training sets for each normalization method. |
51 | | -3. _Machine learning applications._ Three supervised multi-class (BRCA PAM50 subtype) |
52 | | -classifiers—LASSO, linear SVM, and Random Forest—were trained on each training set |
53 | | -and tested on the microarray and RNA-seq holdout sets. The holdout sets were projected |
54 | | -onto and back out of the training set space using two unsupervised techniques, Independent |
55 | | -and Principal Components Analysis, to obtain reconstructed holdout sets. The |
56 | | -classifiers used in step 4A above were used to predict on the reconstructed holdout |
57 | | -sets. |
| 100 | +1. Matched samples run on both microarray and RNA-seq were split into a training (2/3) and holdout set (1/3). |
| 101 | +2. RNA-seq samples were "titrated" into the training set, 10% at a time (0-100%), replacing their matched array samples, resulting in eleven training sets for each normalization method. |
| 102 | +3. Machine learning applications: |
58 | 103 |
|
59 | | -``` |
60 | | -# To run the machine learning pipeline, run in top directory: |
61 | | -sh run_machine_learning_experiments.sh |
| 104 | + - _Supervised learning_: |
| 105 | +We trained three classifiers – LASSO, linear SVM, and Random Forest — on each training set and tested them on the microarray and RNA-seq holdout sets. |
| 106 | +The models were trained to predict tumor subtype (both cancer types have 5 subtypes) and the binary mutation status of _TP53_ and _PIK3CA_. |
62 | 107 |
|
63 | | -# To run one repeat of the subtype classifier pipeline, use: |
64 | | -Rscript run_experiments.R |
65 | | -``` |
| 108 | + - _Unsupervised learning_: |
| 109 | +We projected holdout sets onto and back out of the training set space using Principal Components Analysis to obtain reconstructed holdout sets. |
| 110 | +We then used the trained subtype classifiers to predict on the reconstructed holdout sets. |
| 111 | +[PLIER](https://github.com/wgmao/PLIER) (Pathway-Level Information ExtractoR) identified coordinated sets of genes in each cancer type. |
66 | 112 |
|
67 | 113 | ### Differential Expression Pipeline |
68 | 114 |
|
69 | 115 | Here's a schematic overview of our main differential expression experiment: |
70 | 116 |
|
71 | | - |
| 117 | + |
72 | 118 |
|
73 | 119 | **Overview of differential expression experiment.** |
74 | 120 |
|
75 | | -1. All matched TCGA breast cancer samples (n = 520) were considered when building the platform-specific |
76 | | -“silver standards.” These standards are the genes that were differentially |
77 | | -expressed at a specified False Discovery Rate (FDR) using data sets comprised |
78 | | -entirely of one platform and processed in a standard way: log2-transformed |
79 | | -microarray data and “untransformed” RSEM count data (preprocessed using the |
80 | | -`limma::voom` function). |
81 | | -2. RNA-seq’d samples were ‘titrated’ into the data set, |
82 | | -10% at a time (0-100%) resulting in eleven experimental sets for each n |
83 | | -ormalization method. |
84 | | -3. Differentially expressed genes (DEGs) were identified using |
85 | | -the `limma` package. We compared the Her2 and LumA subtypes as well as Basal |
86 | | -v. all other samples. |
87 | | -4. Lists of experimental DEGs were compared to standard gene |
88 | | -sets using Jaccard similarity. |
| 121 | +1. All matched samples were considered when building the platform-specific “silver standards.” |
| 122 | +These standards are the genes that were differentially expressed at a specified False Discovery Rate (FDR) using data sets comprised entirely of one platform and processed in a standard way: log2-transformed |
| 123 | +microarray data and “untransformed” RNA-seq data. |
| 124 | +2. RNA-seq samples were "titrated" into the data set, 10% at a time (0-100%), replacing their matched array samples, resulting in eleven experimental sets for each normalization method. |
| 125 | +3. Differentially expressed genes (DEGs) were identified usingthe `limma` package. |
| 126 | +For BRCA, we compared the Her2 and LumA subtypes as well as Basal v. all other subtypes. |
| 127 | +For GBM, we compared the Classical and Mesenchymal subtypes as well as Proneural v. all other subtypes. |
| 128 | +4. Lists of experimental DEGs were compared to standard genesets using Jaccard similarity and Spearman rank correlation. |
| 129 | + |
| 130 | +In the "small n" experiment, between 3 and 50 samples were selected from each subtype for DEG comparison. |
| 131 | + |
| 132 | + |
| 133 | +## Running individual experiments |
| 134 | + |
| 135 | +#### Machine learning |
| 136 | + |
| 137 | +To run the machine learning pipeline, run in top directory: |
89 | 138 |
|
90 | 139 | ``` |
91 | | -# Note: This requires the data to be processed to include matched samples only, |
92 | | -# and split into training and test sets (0-expression_data_overlap_and_split.R) |
| 140 | +bash run_machine_learning_experiments.sh [cancer type] [prediction task] [n cores] |
| 141 | +``` |
| 142 | + |
| 143 | +where |
93 | 144 |
|
94 | | -# To run the differential expression pipeline, run in top directory: |
95 | | -sh run_differential_expression_experiments.sh |
| 145 | +- `[cancer type]` is `BRCA` or `GBM` |
| 146 | +- `[prediction task]` is `subtype`, `TP53`, or `PIK3CA` |
| 147 | +- `[n cores]` is the number of cores you want to run in parallel |
| 148 | + |
| 149 | +#### Differential expression |
| 150 | + |
| 151 | +⚠️ _This requires the data to be processed to include matched samples only, and split into training and test sets via `0-expression_data_overlap_and_split.R` in the machine learning pipeline._ |
| 152 | + |
| 153 | +To run the differential expression pipeline, run in top directory: |
| 154 | + |
| 155 | +``` |
| 156 | +bash run_differential_expression_experiments.sh [cancer type] [subtype vs others] [subtype vs subtype] [subtype vs subtype small] [n cores] |
96 | 157 | ``` |
97 | 158 |
|
98 | | -## Requirements |
| 159 | +where |
| 160 | + |
| 161 | +- `[cancer type]` is `BRCA` or `GBM` |
| 162 | +- `[subtype vs others]` is the subtype to be compared against all other subtypes |
| 163 | +- `[subtype vs subtype]` are the two subtypes to be compared (comma-separated, e.g. `Her2,LumA`) |
| 164 | +- `[subtype vs subtype small]` are the two subtypes to be compared at small sample sizes (comma-separated, e.g. `Her2,LumA`) |
| 165 | +- `[n cores]` is the number of cores you want to run in parallel |
| 166 | + |
| 167 | +#### Other scripts |
99 | 168 |
|
100 | | -This analysis was performed in R. It requires R & Bioconductor packages |
101 | | -detailed in `check_installs.R` to be installed. |
| 169 | +To search for the number of publicly available microarray and RNA-seq samples from [GEO](https://www.ncbi.nlm.nih.gov/geo/) and [ArrayExpress](https://www.ebi.ac.uk/arrayexpress/), run |
102 | 170 |
|
103 | | -One github package (`TDM`) is required. To install, run: |
| 171 | +``` |
| 172 | +python3 search_geo_arrayexpress.py |
| 173 | +``` |
| 174 | +and check the output in `results/array_rnaseq_ratio`. |
104 | 175 |
|
105 | | - library(devtools) |
106 | | - devtools::install_github("greenelab/TDM") |
| 176 | +## Manuscript versions |
107 | 177 |
|
108 | | -**This analysis is [in the process](https://github.com/greenelab/RNAseq_titration_results/issues/18) of being moved to a Docker image.** |
| 178 | +| Version | Relevant links | |
| 179 | +| :------ | :------------- | |
| 180 | +| [v2.0](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v2.0) | [Figshare+ data](https://doi.org/10.25452/figshare.plus.19629864.v1), [Data for plots](https://doi.org/10.6084/m9.figshare.19686453) | |
| 181 | +| [v1.1](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v1.1) | [Figshare full results](https://doi.org/10.6084/m9.figshare.5035997.v2) | |
| 182 | +| [v1.0](https://github.com/greenelab/RNAseq_titration_results/releases/tag/v1.0) | [Pre-print](https://doi.org/10.1101/118349) | |
109 | 183 |
|
110 | 184 | ## Funding |
111 | 185 |
|
112 | | -This work was supported the Gordon and Betty Moore Foundation [GBMF 4552] and |
113 | | -the National Institutes of Health [T32-AR007442, U01-TR001263]. |
| 186 | +This work was supported by the Gordon and Betty Moore Foundation [GBMF 4552], Alex's Lemonade Stand Foundation [GR-000002471], and the National Institutes of Health [T32-AR007442, U01-TR001263, R01-CA237170, K12GM081259]. |
0 commit comments