Skip to content

Commit 693de19

Browse files
Merge pull request #156 from asaravia-butler/GLmicroarray
GLmicroarray updates
2 parents 3e4d624 + 0446936 commit 693de19

File tree

33 files changed

+303
-148
lines changed

33 files changed

+303
-148
lines changed

Microarray/Affymetrix/Pipeline_GL-DPPD-7114_Versions/GL-DPPD-7114.md

Lines changed: 70 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
# GeneLab bioinformatics processing pipeline for Affymetrix microarray data <!-- omit in toc -->
22

3-
> **This page holds an overview and instructions for how GeneLab processes Affymetrix microarray datasets. Exact processing commands and GL-DPPD-7114 version used for specific GeneLab datasets (GLDS) are provided with their processed data in the [Open Science Data
4-
Repository (OSDR)](https://osdr.nasa.gov/bio/repo).**
3+
> **This page holds an overview and instructions for how GeneLab processes Affymetrix microarray datasets. Exact processing commands and GL-DPPD-7114 version used for specific GeneLab datasets (GLDS) are provided with their processed data in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo).**
54
>
65
> \* The pipeline detailed below is currently used for animal and Arabidopsis Thaliana studies only, it will be updated soon for processing microbe microarray data and other plant data.
76
@@ -74,9 +73,7 @@ Lauren Sanders (acting GeneLab Project Scientist)
7473

7574
# General processing overview with example commands
7675

77-
> Exact processing commands for a specific GLDS that has been released are provided with the processed data in the [OSDR](https://osdr.nasa.gov/bio/repo).
78-
>
79-
> All output files in **bold** are published with the Affymetrix microarray processed data in the [OSDR](https://osdr.nasa.gov/bio/repo).
76+
> Exact processing commands and output files listed in **bold** below are included with each Microarray processed dataset in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).
8077
8178
---
8279

@@ -167,6 +164,36 @@ dir.create(DIR_DGE)
167164
original_par <- par()
168165
options(preferRaster=TRUE) # use Raster when possible to avoid antialiasing artifacts in images
169166

167+
# Utility function to improve robustness of function calls
168+
# Used to remedy intermittent internet issues during runtime
169+
retry_with_delay <- function(func, ...) {
170+
max_attempts = 5
171+
initial_delay = 10
172+
delay_increase = 30
173+
attempt <- 1
174+
current_delay <- initial_delay
175+
while (attempt <= max_attempts) {
176+
result <- tryCatch(
177+
expr = func(...),
178+
error = function(e) e
179+
)
180+
181+
if (!inherits(result, "error")) {
182+
return(result)
183+
} else {
184+
if (attempt < max_attempts) {
185+
message(paste("Retry attempt", attempt, "failed for function with name <", deparse(substitute(func)) ,">. Retrying in", current_delay, "second(s)..."))
186+
Sys.sleep(current_delay)
187+
current_delay <- current_delay + delay_increase
188+
} else {
189+
stop(paste("Max retry attempts reached. Last error:", result$message))
190+
}
191+
}
192+
193+
attempt <- attempt + 1
194+
}
195+
}
196+
170197
df_rs <- read.csv(runsheet, check.names = FALSE) %>%
171198
dplyr::mutate_all(function(x) iconv(x, "latin1", "ASCII", sub="")) # Convert all characters to ascii, when not possible, remove the character
172199
## Determines the organism specific annotation file to use based on the organism in the runsheet
@@ -187,7 +214,7 @@ fetch_organism_specific_annotation_file_path <- function(organism) {
187214

188215
return(annotation_file_path)
189216
}
190-
annotation_file_path <- fetch_organism_specific_annotation_file_path(unique(df_rs$organism))
217+
annotation_file_path <- retry_with_delay(fetch_organism_specific_annotation_file_path, unique(df_rs$organism))
191218

192219
allTrue <- function(i_vector) {
193220
if ( length(i_vector) == 0 ) {
@@ -221,7 +248,7 @@ downloadFilesFromRunsheet <- function(df_runsheet) {
221248

222249
if ( runsheetPathsAreURIs(df_rs) ) {
223250
print("Determined Raw Data Locations are URIS")
224-
local_paths <- downloadFilesFromRunsheet(df_rs)
251+
local_paths <- retry_with_delay(downloadFilesFromRunsheet, df_rs)
225252
} else {
226253
print("Or Determined Raw Data Locations are local paths")
227254
local_paths <- df_rs$`Array Data File Path`
@@ -247,9 +274,12 @@ df_local_paths <- data.frame(`Sample Name` = df_rs$`Sample Name`, `Local Paths`
247274

248275

249276
# Load raw data into R object
250-
raw_data <- oligo::read.celfiles(df_local_paths$`Local Paths`,
251-
sampleNames = df_local_paths$`Sample Name`# Map column names as Sample Names (instead of default filenames)
252-
)
277+
# Retry with delay here to accomodate oligo's automatic loading of annotation packages and occasional internet related failures to load
278+
raw_data <- retry_with_delay(
279+
oligo::read.celfiles,
280+
df_local_paths$`Local Paths`,
281+
sampleNames = df_local_paths$`Sample Name`# Map column names as Sample Names (instead of default filenames)
282+
)
253283

254284

255285
# Summarize raw data
@@ -355,6 +385,14 @@ if (inherits(raw_data, "GeneFeatureSet")) {
355385
ylim=c(-2, 4),
356386
main="" # This function uses 'main' as a suffix to the sample name. Here we want just the sample name, thus here main is an empty string
357387
)
388+
} else if (inherits(raw_data, "ExpressionFeatureSet")) {
389+
MA_plot <- oligo::MAplot(
390+
raw_data,
391+
ylim=c(-2, 4),
392+
main="" # This function uses 'main' as a suffix to the sample name. Here we want just the sample name, thus here main is an empty string
393+
)
394+
} else {
395+
stop(glue::glue("No strategy for MA plots for {raw_data}"))
358396
}
359397
```
360398

@@ -677,11 +715,12 @@ if (organism %in% c("athaliana")) {
677715
ensembl_genomes_portal = "plants"
678716
print(glue::glue("Using ensembl genomes ftp to get specific version of probeset id mapping table. Ensembl genomes portal: {ensembl_genomes_portal}, version: {ensembl_genomes_version}"))
679717
expected_attribute_name <- getBioMartAttribute(df_rs)
680-
df_mapping <- get_ensembl_genomes_mappings_from_ftp(
681-
organism = organism,
682-
ensembl_genomes_portal = ensembl_genomes_portal,
683-
ensembl_genomes_version = ensembl_genomes_version,
684-
biomart_attribute = expected_attribute_name
718+
df_mapping <- retry_with_delay(
719+
get_ensembl_genomes_mappings_from_ftp,
720+
organism = organism,
721+
ensembl_genomes_portal = ensembl_genomes_portal,
722+
ensembl_genomes_version = ensembl_genomes_version,
723+
biomart_attribute = expected_attribute_name
685724
)
686725

687726
# TAIR from the mapping tables tend to be in the format 'AT1G01010.1' but the raw data has 'AT1G01010'
@@ -856,8 +895,8 @@ design_data <- runsheetToDesignMatrix(runsheet)
856895
design <- design_data$matrix
857896

858897
# Write SampleTable.csv and contrasts.csv file
859-
write.csv(design_data$groups, file.path(DIR_DGE, "SampleTable.csv"), row.names = FALSE)
860-
write.csv(design_data$contrasts, file.path(DIR_DGE, "contrasts.csv"))
898+
write.csv(design_data$groups, file.path(DIR_DGE, "SampleTable_GLmicroarray.csv"), row.names = FALSE)
899+
write.csv(design_data$contrasts, file.path(DIR_DGE, "contrasts_GLmicroarray.csv"))
861900
```
862901

863902
**Input Data:**
@@ -867,8 +906,8 @@ write.csv(design_data$contrasts, file.path(DIR_DGE, "contrasts.csv"))
867906
**Output Data:**
868907

869908
- `design` (R object containing the limma study design matrix, indicating the group that each sample belongs to)
870-
- **SampleTable.csv** (table containing samples and their respective groups)
871-
- **contrasts.csv** (table containing all pairwise comparisons)
909+
- **SampleTable_GLmicroarray.csv** (table containing samples and their respective groups)
910+
- **contrasts_GLmicroarray.csv** (table containing all pairwise comparisons)
872911

873912
<br>
874913

@@ -1119,7 +1158,7 @@ if (!setequal(FINAL_COLUMN_ORDER, colnames(df_interim))) {
11191158
df_interim <- df_interim %>% dplyr::relocate(dplyr::all_of(FINAL_COLUMN_ORDER))
11201159

11211160
# Save to file
1122-
write.csv(df_interim, file.path(DIR_DGE, "differential_expression.csv"), row.names = FALSE)
1161+
write.csv(df_interim, file.path(DIR_DGE, "differential_expression_GLmicroarray.csv"), row.names = FALSE)
11231162

11241163
## Output column subset file with just normalized probeset level expression values
11251164
write.csv(
@@ -1128,12 +1167,12 @@ write.csv(
11281167
"ProbesetID",
11291168
"count_ENSEMBL_mappings",
11301169
all_samples)
1131-
], file.path(DIR_NORMALIZED_EXPRESSION, "normalized_expression_probeset.csv"), row.names = FALSE)
1170+
], file.path(DIR_NORMALIZED_EXPRESSION, "normalized_expression_probeset_GLmicroarray.csv"), row.names = FALSE)
11321171

11331172
### Generate and export PCA table for GeneLab visualization plots
11341173
PCA_raw <- prcomp(t(exprs(probeset_level_data)), scale = FALSE) # Note: expression at the Probeset level is already log2 transformed
11351174
write.csv(PCA_raw$x,
1136-
file.path(DIR_DGE, "visualization_PCA_table.csv")
1175+
file.path(DIR_DGE, "visualization_PCA_table_GLmicroarray.csv")
11371176
)
11381177

11391178
## Generate raw intensity matrix that includes annotations
@@ -1182,7 +1221,7 @@ background_corrected_data_annotated <- oligo::exprs(background_corrected_data) %
11821221
background_corrected_data_annotated <- background_corrected_data_annotated %>%
11831222
dplyr::relocate(dplyr::all_of(FINAL_COLUMN_ORDER))
11841223

1185-
write.csv(background_corrected_data_annotated, file.path(DIR_RAW_DATA, "raw_intensities_probe.csv"), row.names = FALSE)
1224+
write.csv(background_corrected_data_annotated, file.path(DIR_RAW_DATA, "raw_intensities_probe_GLmicroarray.csv"), row.names = FALSE)
11861225

11871226
## Generate normalized expression matrix that includes annotations
11881227
norm_data_matrix_annotated <- oligo::exprs(norm_data) %>%
@@ -1202,7 +1241,7 @@ norm_data_matrix_annotated <- oligo::exprs(norm_data) %>%
12021241
norm_data_matrix_annotated <- norm_data_matrix_annotated %>%
12031242
dplyr::relocate(dplyr::all_of(FINAL_COLUMN_ORDER))
12041243

1205-
write.csv(norm_data_matrix_annotated, file.path(DIR_NORMALIZED_EXPRESSION, "normalized_intensities_probe.csv"), row.names = FALSE)
1244+
write.csv(norm_data_matrix_annotated, file.path(DIR_NORMALIZED_EXPRESSION, "normalized_intensities_probe_GLmicroarray.csv"), row.names = FALSE)
12061245

12071246
```
12081247

@@ -1216,8 +1255,10 @@ write.csv(norm_data_matrix_annotated, file.path(DIR_NORMALIZED_EXPRESSION, "norm
12161255

12171256
**Output Data:**
12181257

1219-
- **differential_expression.csv** (table containing normalized probeset expression values for each sample, group statistics, Limma probeset DE results for each pairwise comparison, and gene annotations. The ProbesetID is the unique index column.)
1220-
- **normalized_expression_probeset.csv** (table containing the background corrected, normalized probeset expression values for each sample. The ProbesetID is the unique index column.)
1221-
- visualization_PCA_table.csv (file used to generate GeneLab PCA plots)
1222-
- **raw_intensities_probe.csv** (table containing the background corrected, unnormalized probe intensity values for each sample including gene annotations. The ProbeID is the unique index column.)
1223-
- **normalized_intensities_probe.csv** (table containing the background corrected, normalized probe intensity values for each sample including gene annotations. The ProbeID is the unique index column.)
1258+
- **differential_expression_GLmicroarray.csv** (table containing normalized probeset expression values for each sample, group statistics, Limma probeset DE results for each pairwise comparison, and gene annotations. The ProbesetID is the unique index column.)
1259+
- **normalized_expression_probeset_GLmicroarray.csv** (table containing the background corrected, normalized probeset expression values for each sample. The ProbesetID is the unique index column.)
1260+
- visualization_PCA_table_GLmicroarray.csv (file used to generate GeneLab PCA plots)
1261+
- **raw_intensities_probe_GLmicroarray.csv** (table containing the background corrected, unnormalized probe intensity values for each sample including gene annotations. The ProbeID is the unique index column.)
1262+
- **normalized_intensities_probe_GLmicroarray.csv** (table containing the background corrected, normalized probe intensity values for each sample including gene annotations. The ProbeID is the unique index column.)
1263+
1264+
> All steps of the Microarray pipeline are performed using R markdown and the completed R markdown is rendered (via Quarto) as an html file (**NF_MAAffymetrix_v\*_GLmicroarray.html**) and published in the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/) for the respective dataset.

Microarray/Affymetrix/README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,5 +21,7 @@
2121
- Contains instructions for installing and running the GeneLab NF_MAAffymetrix workflow
2222

2323
---
24-
**Developed and maintained by:**
25-
Jonathan Oribello
24+
**Developed by:**
25+
Jonathan Oribello
26+
**Maintained by:**
27+
Alexis Torres (alexis.torres@nasa.gov)

Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/CHANGELOG.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,18 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8-
## [1.0.2](https://github.com/asaravia-butler/GeneLab_Data_Processing/tree/NF_MAAffymetrix_1.0.2/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix)
8+
## [1.0.3](https://github.com/asaravia-butler/GeneLab_Data_Processing/tree/NF_MAAffymetrix_1.0.3/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix) - 2024-02-26
9+
10+
### Added
11+
12+
- Retry wrapper for functions that utilize internet resources. This is aimed to reduce failures due solely due to intermittent network issues. (ceb6d9a3)
13+
14+
### Fixed
15+
16+
- Missing Raw Data MA Plots when handling designs that loaded as `ExpressionFeatureSet` objects. (7af7192e)
17+
- Additionally, future unhandled raw data classes will raise an exception rather than fail to plot silently.
18+
19+
## [1.0.2](https://github.com/asaravia-butler/GeneLab_Data_Processing/tree/NF_MAAffymetrix_1.0.2/Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix) - 2023-05-24
920

1021
### Added
1122

Microarray/Affymetrix/Workflow_Documentation/NF_MAAffymetrix/README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -97,9 +97,9 @@ All files required for utilizing the NF_MAAffymetrix GeneLab workflow for proces
9797
copy of latest NF_MAAffymetrix version on to your system, the code can be downloaded as a zip file from the release page then unzipped after downloading by running the following commands:
9898
9999
```bash
100-
wget https://github.com/asaravia-butler/GeneLab_Data_Processing/releases/download/NF_MAAffymetrix_1.0.2/NF_MAAffymetrix_1.0.2.zip
100+
wget https://github.com/asaravia-butler/GeneLab_Data_Processing/releases/download/NF_MAAffymetrix_1.0.3/NF_MAAffymetrix_1.0.3.zip
101101
102-
unzip NF_MAAffymetrix_1.0.2.zip
102+
unzip NF_MAAffymetrix_1.0.3.zip
103103
```
104104
105105
<br>
@@ -108,15 +108,15 @@ unzip NF_MAAffymetrix_1.0.2.zip
108108
109109
### 3. Run the Workflow
110110
111-
While in the location containing the `NF_MAAffymetrix_1.0.2` directory that was downloaded in [step 2](#2-download-the-workflow-files), you are now able to run the workflow. Below are three examples of how to run the NF_MAAffymetrix workflow:
111+
While in the location containing the `NF_MAAffymetrix_1.0.3` directory that was downloaded in [step 2](#2-download-the-workflow-files), you are now able to run the workflow. Below are three examples of how to run the NF_MAAffymetrix workflow:
112112
> Note: Nextflow commands use both single hyphen arguments (e.g. -help) that denote general nextflow arguments and double hyphen arguments (e.g. --ensemblVersion) that denote workflow specific parameters. Take care to use the proper number of hyphens for each argument.
113113
114114
<br>
115115
116116
#### 3a. Approach 1: Run the workflow on a GeneLab Affymetrix Microarray dataset
117117
118118
```bash
119-
nextflow run NF_MAAffymetrix_1.0.2/main.nf \
119+
nextflow run NF_MAAffymetrix_1.0.3/main.nf \
120120
-profile singularity \
121121
--osdAccession OSD-266 \
122122
--gldsAccession GLDS-266
@@ -129,7 +129,7 @@ nextflow run NF_MAAffymetrix_1.0.2/main.nf \
129129
> Note: Specifications for creating a runsheet manually are described [here](examples/runsheet/README.md).
130130
131131
```bash
132-
nextflow run NF_MAAffymetrix_1.0.2/main.nf \
132+
nextflow run NF_MAAffymetrix_1.0.3/main.nf \
133133
-profile singularity \
134134
--runsheetPath </path/to/runsheet>
135135
```
@@ -141,7 +141,7 @@ nextflow run NF_MAAffymetrix_1.0.2/main.nf \
141141
> Note: Specifications for the ISA Tab Archive format can be found [here](https://isa-specs.readthedocs.io/en/latest/isatab.html).
142142
143143
```bash
144-
nextflow run NF_MAAffymetrix_1.0.2/main.nf \
144+
nextflow run NF_MAAffymetrix_1.0.3/main.nf \
145145
-profile singularity \
146146
--isaArchivePath </path/to/isaArchive>
147147
```
@@ -150,7 +150,7 @@ nextflow run NF_MAAffymetrix_1.0.2/main.nf \
150150
151151
**Required Parameters For All Approaches:**
152152
153-
* `NF_MAAffymetrix_1.0.2/main.nf` - Instructs Nextflow to run the NF_MAAffymetrix workflow
153+
* `NF_MAAffymetrix_1.0.3/main.nf` - Instructs Nextflow to run the NF_MAAffymetrix workflow
154154
155155
* `-profile` - Specifies the configuration profile(s) to load, `singularity` instructs Nextflow to setup and use singularity for all software called in the workflow
156156
@@ -182,7 +182,7 @@ nextflow run NF_MAAffymetrix_1.0.2/main.nf \
182182
All parameters listed above and additional optional arguments for the NF_MAAffymetrix workflow, including debug related options that may not be immediately useful for most users, can be viewed by running the following command:
183183
184184
```bash
185-
nextflow run NF_MAAffymetrix_1.0.2/main.nf --help
185+
nextflow run NF_MAAffymetrix_1.0.3/main.nf --help
186186
```
187187
188188
See `nextflow run -h` and [Nextflow's CLI run command documentation](https://nextflow.io/docs/latest/cli.html#run) for more options and details common to all nextflow workflows.
@@ -196,7 +196,7 @@ See `nextflow run -h` and [Nextflow's CLI run command documentation](https://nex
196196
All R code steps and output are rendered within a Quarto document yielding the following:
197197
198198
- Output:
199-
- NF_MAAffymetrix_1.0.2.html (html report containing executed code and output including QA plots)
199+
- NF_MAAffymetrix_1.0.3.html (html report containing executed code and output including QA plots)
200200
201201
202202
The outputs from the Analysis Staging and V&V Pipeline Subworkflows are described below:

0 commit comments

Comments
 (0)