Merge pull request #124 from cyouh95/DEV_NF_MAAgilent_1ch

asaravia-butler · web-flow · commit 8383bb58000f · 2024-10-22T09:27:19.000-07:00
NF_MAAgilent1ch: Update workflow version from 1.0.3 to 1.0.4
diff --git a/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/CHANGELOG.md b/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/CHANGELOG.md
@@ -5,6 +5,20 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [1.0.4](https://github.com/nasa/GeneLab_Data_Processing/tree/NF_MAAgilent1ch_1.0.4/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch) - 2024-10-02
+
+### Added
+
+- Add automatic generation of processed data protocol ([#85](https://github.com/nasa/GeneLab_Data_Processing/issues/85))
+
+### Changed
+
+- Small bug fixes in `Agile1CMP.qmd`
+  - Check if `getBM()` returned results before concatenating it to dataframe to avoid error in `bind_rows()` ([#96](https://github.com/nasa/GeneLab_Data_Processing/issues/96))
+  - When renaming column names, specify which columns to rename to avoid unintentional renaming ([#97](https://github.com/nasa/GeneLab_Data_Processing/issues/97))
+  - When renaming factor names, prevent cases where a factor is partially renamed because it contains a substring that is another factor ([#100](https://github.com/nasa/GeneLab_Data_Processing/issues/100))
+- Update software table generation to exclude `R.utils` from table if data files are not compressed ([#99](https://github.com/nasa/GeneLab_Data_Processing/issues/99))
+
 ## [1.0.3](https://github.com/nasa/GeneLab_Data_Processing/tree/NF_MAAgilent1ch_1.0.3/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch) - 2024-05-17
 
 ### Changed
diff --git a/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/README.md b/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/README.md
@@ -93,9 +93,9 @@ We recommend installing Singularity on a system wide level as per the associated
 All files required for utilizing the NF_MAAgilent1ch GeneLab workflow for processing Agilent 1 Channel Microarray data are in the [workflow_code](workflow_code) directory. To get a copy of latest NF_MAAgilent1ch version on to your system, the code can be downloaded as a zip file from the release page then unzipped after downloading by running the following commands: 
 
 ```bash
-wget https://github.com/nasa/GeneLab_Data_Processing/releases/download/NF_MAAgilent1ch_1.0.3/NF_MAAgilent1ch_1.0.3.zip
+wget https://github.com/nasa/GeneLab_Data_Processing/releases/download/NF_MAAgilent1ch_1.0.4/NF_MAAgilent1ch_1.0.4.zip
 
-unzip NF_MAAgilent1ch_1.0.3.zip
+unzip NF_MAAgilent1ch_1.0.4.zip
 ```
 
 <br>
@@ -104,15 +104,15 @@ unzip NF_MAAgilent1ch_1.0.3.zip
 
 ### 3. Run the Workflow
 
-While in the location containing the `NF_MAAgilent1ch_1.0.3` directory that was downloaded in [step 2](#2-download-the-workflow-files), you are now able to run the workflow. Below are three examples of how to run the NF_MAAgilent1ch workflow:
+While in the location containing the `NF_MAAgilent1ch_1.0.4` directory that was downloaded in [step 2](#2-download-the-workflow-files), you are now able to run the workflow. Below are three examples of how to run the NF_MAAgilent1ch workflow:
 > Note: Nextflow commands use both single hyphen arguments (e.g. -help) that denote general nextflow arguments and double hyphen arguments (e.g. --ensemblVersion) that denote workflow specific parameters.  Take care to use the proper number of hyphens for each argument.
 
 <br>
 
 #### 3a. Approach 1: Run the workflow on a GeneLab Agilent 1 Channel Microarray dataset
 
 ```bash
-nextflow run NF_MAAgilent1ch_1.0.3/main.nf \ 
+nextflow run NF_MAAgilent1ch_1.0.4/main.nf \ 
    -profile singularity \
    --osdAccession OSD-548 \
    --gldsAccession GLDS-548 
@@ -125,7 +125,7 @@ nextflow run NF_MAAgilent1ch_1.0.3/main.nf \
 > Note: Specifications for creating a runsheet manually are described [here](examples/runsheet/README.md).
 
 ```bash
-nextflow run NF_MAAgilent1ch_1.0.3/main.nf \ 
+nextflow run NF_MAAgilent1ch_1.0.4/main.nf \ 
    -profile singularity \
    --runsheetPath </path/to/runsheet> 
 ```
@@ -134,7 +134,7 @@ nextflow run NF_MAAgilent1ch_1.0.3/main.nf \
 
 **Required Parameters For All Approaches:**
 
-* `NF_MAAgilent1ch_1.0.3/main.nf` - Instructs Nextflow to run the NF_MAAgilent1ch workflow 
+* `NF_MAAgilent1ch_1.0.4/main.nf` - Instructs Nextflow to run the NF_MAAgilent1ch workflow 
 
 * `-profile` - Specifies the configuration profile(s) to load, `singularity` instructs Nextflow to setup and use singularity for all software called in the workflow
 
@@ -166,7 +166,7 @@ nextflow run NF_MAAgilent1ch_1.0.3/main.nf \
 All parameters listed above and additional optional arguments for the NF_MAAgilent1ch workflow, including debug related options that may not be immediately useful for most users, can be viewed by running the following command:
 
 ```bash
-nextflow run NF_MAAgilent1ch_1.0.3/main.nf --help
+nextflow run NF_MAAgilent1ch_1.0.4/main.nf --help
 ```
 
 See `nextflow run -h` and [Nextflow's CLI run command documentation](https://nextflow.io/docs/latest/cli.html#run) for more options and details common to all nextflow workflows.
@@ -180,7 +180,7 @@ See `nextflow run -h` and [Nextflow's CLI run command documentation](https://nex
 All R code steps and output are rendered within a Quarto document yielding the following:
 
    - Output:
-     - NF_MAAgilent1ch_1.0.3.html (html report containing executed code and output including QA plots)
+     - NF_MAAgilent1ch_1.0.4.html (html report containing executed code and output including QA plots)
   
 
 The outputs from the Analysis Staging and V&V Pipeline Subworkflows are described below:
diff --git a/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/workflow_code/bin/Agile1CMP.qmd b/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/workflow_code/bin/Agile1CMP.qmd
@@ -1,6 +1,6 @@
 ---
 title: "Agilent 1 Channel Processing"
-subtitle: "Workflow Version: NF_MAAgilent1ch_1.0.3"
+subtitle: "Workflow Version: NF_MAAgilent1ch_1.0.4"
 date: now
 title-block-banner: true
 format:
@@ -530,7 +530,10 @@ if (organism %in% c("athaliana")) {
             values = probe_id_chunk, 
             mart = ensembl)
 
-    df_mapping <- df_mapping %>% dplyr::bind_rows(chunk_results)
+    if (nrow(chunk_results) > 0) {
+      df_mapping <- df_mapping %>% dplyr::bind_rows(chunk_results)
+    }
+    
     Sys.sleep(10) # Slight break between requests to prevent back-to-back requests
   }
 }
@@ -712,7 +715,7 @@ reformat_names <- function(colname, group_name_mapping) {
                   stringr::str_replace(pattern = ".condition", replacement = "v")
   
   # remap to group names before make.names was applied
-  unique_group_name_mapping <- unique(group_name_mapping)
+  unique_group_name_mapping <- unique(group_name_mapping) %>% arrange(-nchar(safe_name))
   for ( i in seq(nrow(unique_group_name_mapping)) ) {
     safe_name <- unique_group_name_mapping[i,]$safe_name
     original_name <- unique_group_name_mapping[i,]$original_name
@@ -722,7 +725,7 @@ reformat_names <- function(colname, group_name_mapping) {
   return(new_colname)
 }
 
-df_interim <- df_interim %>% dplyr::rename_with( reformat_names, group_name_mapping = design_data$mapping )
+df_interim <- df_interim %>% dplyr::rename_with(reformat_names, .cols = matches('\\.condition|^Genes\\.'), group_name_mapping = design_data$mapping)
 
 
 # Concatenate expression values for each sample
diff --git a/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/workflow_code/bin/dp_tools__agilent_1_channel/config.yaml b/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/workflow_code/bin/dp_tools__agilent_1_channel/config.yaml
@@ -75,7 +75,9 @@ Staging:
             Sample name is used as a unique sample identifier during processing
           Example: Atha_Col-0_Root_WT_Ctrl_45min_Rep1_GSM502538
 
-        - ISA Field Name: Label
+        - ISA Field Name:
+            - Label
+            - Parameter Value[label]
           ISA Table Source: Sample
           Runsheet Column Name: Label
           Processing Usage: >-
diff --git a/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/workflow_code/main.nf b/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/workflow_code/main.nf
@@ -97,13 +97,11 @@ workflow {
     ch_software_versions = Channel.value(nf_version)
     AGILE1CH.out.versions | map{ it -> it.text } | mix(ch_software_versions) | set{ch_software_versions}
     VV_AGILE1CH.out.versions | map{ it -> it.text } | mix(ch_software_versions) | set{ch_software_versions}
-    ch_software_versions | unique 
-                         | collectFile(
-                            newLine: true, 
-                            sort: true,
-                            cache: false
-                            )
-                         | GENERATE_SOFTWARE_TABLE
+
+    GENERATE_SOFTWARE_TABLE(
+      ch_software_versions | unique | collectFile(newLine: true, sort: true, cache: false),
+      ch_runsheet | splitCsv(header: true, quote: '"') | first | map{ row -> row['Array Data File Name'] }
+    )
 
     emit:
       meta = ch_meta 
diff --git a/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/workflow_code/modules/GENERATE_SOFTWARE_TABLE/main.nf b/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/workflow_code/modules/GENERATE_SOFTWARE_TABLE/main.nf
@@ -5,12 +5,13 @@ process GENERATE_SOFTWARE_TABLE {
 
   input:
     path("software_versions.yaml")
+    val(filename)
   
   output:
     path("software_versions_GLmicroarray.md")
   
   script:
     """
-    SoftwareYamlToMarkdownTable.py software_versions.yaml
+    SoftwareYamlToMarkdownTable.py software_versions.yaml \"$filename\"
     """
 }
diff --git a/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/workflow_code/modules/GENERATE_SOFTWARE_TABLE/resources/usr/bin/SoftwareYamlToMarkdownTable.py b/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/workflow_code/modules/GENERATE_SOFTWARE_TABLE/resources/usr/bin/SoftwareYamlToMarkdownTable.py
@@ -41,14 +41,19 @@
 
 @click.command()
 @click.argument("input_yaml", type=click.Path(exists=True))
-def yamlToMarkdown(input_yaml: Path):
+@click.argument("filename")
+def yamlToMarkdown(input_yaml: Path, filename: str):
     """ Using a software versions """
     with open(input_yaml, "r") as f:
         data = yaml.safe_load(f)
 
     data.extend(ASSUMED_SOFTWARE)
     df = pd.DataFrame(data)
 
+    # If data files are not compressed, won't use R.utils to unzip them during processing
+    if not filename.endswith('.gz'):
+        AGILENT_SOFTWARE_DPPD.remove('r.utils')
+
     # Filter to direct software used (i.e. exclude dependencies of the software)
     df = df.loc[df["name"].str.lower().isin(AGILENT_SOFTWARE_DPPD)]
 
diff --git a/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/workflow_code/modules/POST_PROCESSING/GENERATE_PROTOCOL/main.nf b/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/workflow_code/modules/POST_PROCESSING/GENERATE_PROTOCOL/main.nf
@@ -0,0 +1,17 @@
+process GENERATE_PROTOCOL {
+  tag "${ params.gldsAccession }"
+  publishDir "${ params.outputDir }/${ params.gldsAccession }/GeneLab",
+    mode: params.publish_dir_mode
+
+  input:
+    path("software_versions_GLmicroarray.md")
+    val(organism)
+  
+  output:
+    path("PROTOCOL_GLmicroarray.txt")
+  
+  script:
+  """
+  generate_protocol.sh $workflow.manifest.version \"$organism\"
+  """
+}
diff --git a/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/workflow_code/modules/POST_PROCESSING/GENERATE_PROTOCOL/resources/usr/bin/generate_protocol.sh b/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/workflow_code/modules/POST_PROCESSING/GENERATE_PROTOCOL/resources/usr/bin/generate_protocol.sh
@@ -0,0 +1,68 @@
+#!/bin/bash
+set -u
+
+software_versions_file="software_versions_GLmicroarray.md"
+
+# Read the markdown table
+while read -r line; do
+    # Extract program, version, and link
+    program=$(echo "$line" | awk -F'|' '{gsub(/^[[:blank:]]+|[[:blank:]]+$/,"",$1); print $1}')
+    version=$(echo "$line" | awk -F'|' '{gsub(/^[[:blank:]]+|[[:blank:]]+$/,"",$2); print $2}')
+
+    # Skip the header row and rows without version information
+    if [[ $program != "Program" && $version != "Version" && ! -z $version ]]; then
+        # Replace invalid characters in program name with underscores
+        sanitized_program=$(echo "$program" | tr -cd '[:alnum:]_')
+
+        # Create environment variable name
+        env_var_name="${sanitized_program}_VERSION"
+
+        # Set the environment variable
+        export "$env_var_name=$version"
+    fi
+done < <(sed -n '/|/p' "$software_versions_file" | sed 's/^ *|//;s/|$//')
+
+# Print the extracted versions
+env | grep "_VERSION"
+
+# Get organism
+organism=$2
+
+# List of organisms
+organism_list=("Homo sapiens" "Mus musculus" "Rattus norvegicus" "Drosophila melanogaster" "Caenorhabditis elegans" "Danio rerio" "Saccharomyces cerevisiae")
+
+# Check the value of 'organism' variable and set 'GENE_MAPPING_STEP' accordingly
+if [[ $organism == "Arabidopsis thaliana" ]]; then
+    GENE_MAPPING_STEP="Ensembl gene ID mappings were retrieved for each probe using the Plants Ensembl database ftp server (plants.ensembl.org, release 54)."
+elif [[ " ${organism_list[*]} " == *"${organism//\"/}"* ]]; then
+    GENE_MAPPING_STEP="Ensembl gene ID mappings were retrieved for each probe using biomaRt (version ${biomaRt_VERSION}), Ensembl database (ensembl.org, release 107)."
+else
+    GENE_MAPPING_STEP="TBD"
+fi
+
+# Check the value of 'organism' variable and set 'GENE_ANNOTATION_DB' accordingly
+if [[ $organism == "Arabidopsis thaliana" ]]; then
+    GENE_ANNOTATION_DB="org.At.tair.db"
+elif [[ $organism == "Homo sapiens" ]]; then
+    GENE_ANNOTATION_DB="org.Hs.eg.db"
+elif [[ $organism == "Mus musculus" ]]; then
+    GENE_ANNOTATION_DB="org.Mm.eg.db"
+elif [[ $organism == "Rattus norvegicus" ]]; then
+    GENE_ANNOTATION_DB="org.Rn.eg.db"
+elif [[ $organism == "Drosophila melanogaster" ]]; then
+    GENE_ANNOTATION_DB="org.Dm.eg.db"
+elif [[ $organism == "Caenorhabditis elegans" ]]; then
+    GENE_ANNOTATION_DB="org.Ce.eg.db"
+elif [[ $organism == "Danio rerio" ]]; then
+    GENE_ANNOTATION_DB="org.Dr.eg.db"
+elif [[ $organism == "Saccharomyces cerevisiae" ]]; then
+    GENE_ANNOTATION_DB="org.Sc.sgd.db"
+else
+    GENE_ANNOTATION_DB="TBD"
+fi
+
+# Read the template file
+template="Data were processed as described in GL-DPPD-7112 ([https://github.com/nasa/GeneLab_Data_Processing/blob/master/Microarray/Agilent_1-channel/Pipeline_GL-DPPD-7112_Versions/GL-DPPD-7112.md]), using NF_MAAgilent1ch version $1 ([https://github.com/nasa/GeneLab_Data_Processing/tree/NF_MAAgilent1ch_$1/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch]). In short, a RunSheet containing raw data file location and processing metadata from the study's *ISA.zip file was generated using dp_tools (version ${dp_tools_VERSION}). The raw array data files were loaded into R (version ${R_VERSION}) using limma (version ${limma_VERSION}). Raw data quality assurance density, pseudo image, MA, and foreground-background plots were generated using limma (version ${limma_VERSION}), and boxplots were generated using ggplot2 (version ${ggplot2_VERSION}). The raw intensity data was background corrected and normalized across arrays via the limma (version ${limma_VERSION}) quantile method. Normalized data quality assurance density, pseudo image, and MA plots were generated using limma (version ${limma_VERSION}), and boxplots were generated using ggplot2 (version ${ggplot2_VERSION}). ${GENE_MAPPING_STEP} Differential expression analysis was performed in R (version ${R_VERSION}) using limma (version ${limma_VERSION}); all groups were compared pairwise for each probe to generate a moderated t-statistic and associated p- and adjusted p-value. Gene annotations were assigned using the custom annotation tables generated in-house as detailed in GL-DPPD-7110 ([https://github.com/nasa/GeneLab_Data_Processing/blob/GL_RefAnnotTable_1.0.0/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110.md]), with STRINGdb (version 2.8.4), PANTHER.db (version 1.0.11), and ${GENE_ANNOTATION_DB} (version 3.15.0)."
+
+# Output the filled template
+echo "$template" > PROTOCOL_GLmicroarray.txt
diff --git a/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/workflow_code/nextflow.config b/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/workflow_code/nextflow.config
@@ -45,7 +45,7 @@ manifest {
     mainScript = 'main.nf'
     defaultBranch = 'main'
     nextflowVersion = '>=23.10.1'
-    version = '1.0.3'
+    version = '1.0.4'
 }
 
 def trace_timestamp = new java.util.Date().format( 'yyyy-MM-dd_HH-mm-ss')
diff --git a/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/workflow_code/post_processing.nf b/Microarray/Agilent_1-channel/Workflow_Documentation/NF_MAAgilent1ch/workflow_code/post_processing.nf
@@ -7,6 +7,7 @@ c_reset = "\033[0m";
 
 include { GENERATE_MD5SUMS } from './modules/GENERATE_MD5SUMS.nf'
 include { UPDATE_ISA_TABLES } from './modules/UPDATE_ISA_TABLES.nf'
+include { GENERATE_PROTOCOL } from './modules/POST_PROCESSING/GENERATE_PROTOCOL'
 
 /**************************************************
 * HELP MENU  **************************************
@@ -49,6 +50,7 @@ workflow {
   main:
     ch_processed_directory = Channel.fromPath("${ params.outputDir }/${ params.gldsAccession }", checkIfExists: true)
     ch_runsheet = Channel.fromPath("${ params.outputDir }/${ params.gldsAccession }/Metadata/*_runsheet.csv", checkIfExists: true)
+    ch_software_versions = Channel.fromPath("${ params.outputDir }/${ params.gldsAccession }/GeneLab/software_versions_GLmicroarray.md", checkIfExists: true)
     GENERATE_MD5SUMS(      
       ch_processed_directory, 
       ch_runsheet,       
@@ -59,4 +61,8 @@ workflow {
       ch_runsheet,       
       "${ projectDir }/bin/dp_tools__agilent_1_channel" // dp_tools plugin
     )
+    GENERATE_PROTOCOL(
+      ch_software_versions,
+      ch_runsheet | splitCsv(header: true, quote: '"') | first | map{ row -> row['organism'] }
+    )
 }
diff --git a/Microarray/Agilent_1-channel/Workflow_Documentation/README.md b/Microarray/Agilent_1-channel/Workflow_Documentation/README.md
@@ -6,7 +6,7 @@
 
 |Pipeline Version|Current Workflow Version (for respective pipeline version)|Nextflow Version|
 |:---------------|:---------------------------------------------------------|:---------------|
-|*[GL-DPPD-7112.md](../Pipeline_GL-DPPD-7112_Versions/GL-DPPD-7112.md)|[NF_MAAgilent1ch_1.0.3](NF_MAAgilent1ch)|23.10.1|
+|*[GL-DPPD-7112.md](../Pipeline_GL-DPPD-7112_Versions/GL-DPPD-7112.md)|[NF_MAAgilent1ch_1.0.4](NF_MAAgilent1ch)|23.10.1|
 
 *Current GeneLab Pipeline/Workflow Implementation
 

Original file line number	Diff line number	Diff line change
`@@ -45,7 +45,7 @@ manifest {`
`45`	`45`	`mainScript = 'main.nf'`
`46`	`46`	`defaultBranch = 'main'`
`47`	`47`	`nextflowVersion = '>=23.10.1'`
`48`		`- version = '1.0.3'`
	`48`	`+ version = '1.0.4'`
`49`	`49`	`}`
`50`	`50`
`51`	`51`	`def trace_timestamp = new java.util.Date().format( 'yyyy-MM-dd_HH-mm-ss')`