diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md
new file mode 100644
index 00000000..c3acba00
--- /dev/null
+++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md
@@ -0,0 +1,864 @@
+# GeneLab Pipeline for Generating Reference Annotation Tables
+
+> **This page provides an overview and instructions for how GeneLab generates reference annotation tables. The GeneLab reference annotation table used to add annotations to processed data files is indicated in the exact processing scripts provided for each GLDS dataset under the respective omics datatype subdirectory.**
+
+---
+
+**Date:** October XX, 2024
+**Revision:** -A
+**Document Number:** GL-DPPD-7110-A
+
+**Submitted by:**
+Alexis Torres and Crystal Han (GeneLab Data Processing Team)
+
+**Approved by:**
+Samrawit Gebre (OSDR Project Manager)
+Lauren Sanders (OSDR Project Scientist)
+Amanda Saravia-Butler (GeneLab Science Lead)
+Barbara Novak (GeneLab Data Processing Lead)
+
+---
+
+## Updates from Previous Version
+
+- **Updated Software:**
+ - R version updated from 4.1.3 to 4.4.0.
+ - Bioconductor version updated from 3.15.1 to 3.19.1.
+ - tidyverse version updated from 1.3.2 to 2.0.0.
+ - STRINGdb version updated from 2.8.4 to 2.16.0 (DB version: 12.0).
+ - PANTHER.db version updated from 1.0.11 to 1.0.12 (DB version: 18.0).
+ - rtracklayer version updated from 1.56.1 to 1.64.0.
+
+- **Added Software:**
+ - AnnotationForge version 1.46.0.
+ - biomaRt version 2.60.1.
+ - GO.db version 3.19.1 (DB schema version 2.1)
+
+- **Ensembl Releases:**
+ - Animals: Updated from release 107 to 112
+ - Plants: Updated from release 54 to 59
+ - Bacteria: Updated from release 54 to 59
+
+- **New Organism Support:**
+ 1. Bacillus subtilis, subsp. subtilis 168
+ 2. Brachypodium distachyon
+ 3. Escherichia coli, str. K-12 substr. MG1655
+ 4. Oryzias latipes
+ 5. Lactobacillus acidophilus NCFM
+ 6. Mycobacterium marinum M
+ 7. Oryza sativa Japonica
+ 8. Pseudomonas aeruginosa UCBPP-PA14
+ 9. Salmonella enterica subsp. enterica serovar Typhimurium str. LT2
+ 10. Serratia liquefaciens ATCC 27592
+ 11. Staphylococcus aureus MRSA252
+ 12. Streptococcus mutans UA159
+ 13. Vibrio fischeri ES114
+
+- **Added NCBI as a Reference Source:**
+ FASTA and GTF files were sourced from NCBI for the following organisms:
+ 1. Lactobacillus acidophilus NCFM
+ 2. Mycobacterium marinum M
+ 3. Pseudomonas aeruginosa UCBPP-PA14
+ 4. Serratia liquefaciens ATCC 27592
+ 5. Staphylococcus aureus MRSA252
+ 6. Streptococcus mutans UA159
+ 7. Vibrio fischeri ES114
+
+- **org.db Creation:**
+ Added functionality to create an annotation database using `AnnotationForge`. This is applicable to organisms without a maintained annotation database package in Bioconductor (e.g., `org.Hs.eg.db`). This approach was used for the following organisms:
+ 1. Bacillus subtilis, subsp. subtilis 168
+ 2. Brachypodium distachyon
+ 3. Escherichia coli, str. K-12 substr. MG1655
+ 4. Oryzias latipes
+ 5. Salmonella enterica subsp. enterica serovar Typhimurium str. LT2
+
+The pipeline is designed to annotate unique gene IDs in a reference assembly, map them to organism-specific `org.db` databases for additional annotations, integrate STRING DB IDs, and use PANTHER to obtain GO slim IDs based on ENTREZ IDs.
+
+The default columns in the annotation table are:
+- ENSEMBL (or TAIR), SYMBOL, GENENAME, REFSEQ, ENTREZID, STRING_id, GOSLIM_IDS
+
+- For organisms with FASTA and GTF files sourced from NCBI, the LOCUS, OLD_LOCUS, SYMBOL, GENENAME, and GO annotations were directly derived from the GTF file. The `GO` column contains GO terms. `OLD_LOCUS`, or `old_locus_tag` in the GTF was retained when needed to map to STRING IDs.
+- Missing columns indicate the absence of corresponding data for that organism.
+
+1. **Brachypodium distachyon**:
+ - Columns: ENSEMBL, ACCNUM, SYMBOL, GENENAME, REFSEQ, ENTREZID, STRING_id, GOSLIM_IDS
+ > Note: GTF `transcript_id` entries were matched with `ACCNUM` keys in the `org.db` and saved as `ACCNUM`
+
+2. **Caenorhabditis elegans**:
+ - Columns: ENSEMBL, SYMBOL, GENENAME, REFSEQ, ENTREZID, STRING_id
+ > Note: org.db ENTREZ keys did not match PANTHER ENTREZ keys so the empty `GOSLIM_IDS` column was omitted
+
+3. **Lactobacillus acidophilus**:
+ - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, STRING_id, GO
+
+4. **Mycobacterium marinum**:
+ - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, STRING_id, GO
+
+5. **Oryza sativa Japonica**:
+ - Columns: ENSEMBL, STRING_id
+
+6. **Pseudomonas aeruginosa UCBPP-PA14**:
+ - Columns: LOCUS, SYMBOL, GENENAME, GO
+
+7. **Serratia liquefaciens ATCC 27592**:
+ - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, STRING_id, GO
+
+8. **Staphylococcus aureus MRSA252**:
+ - Columns: LOCUS, SYMBOL, GENENAME, GO
+
+9. **Streptococcus mutans UA159**:
+ - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, STRING_id, GO
+
+10. **Vibrio fischeri ES114**:
+ - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, STRING_id, GO
+
+---
+
+# Table of Contents
+
+- [Software Used](#software-used)
+- [Annotation Table Build Overview with Example Commands](#annotation-table-build-overview-with-example-commands)
+ - [0. Set Up Environment](#0-set-up-environment)
+ - [1. Define Variables and Output File Names](#1-define-variables-and-output-file-names)
+ - [2. Create the Organism Package if it is Not Hosted by Bioconductor](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor)
+ - [3. Load Annotation Databases](#3-load-annotation-databases)
+ - [4. Build Initial Annotation Table](#4-build-initial-annotation-table)
+ - [5. Add org.db Keys](#5-add-orgdb-keys)
+ - [6. Add STRING IDs](#6-add-string-ids)
+ - [7. Add Gene Ontology (GO) Slim IDs](#7-add-gene-ontology-go-slim-ids)
+ - [8. Export Annotation Table and Build Info](#8-export-annotation-table-and-build-info)
+
+---
+
+# Software Used
+
+| Program | Version | Relevant Links |
+|:----------------|:-------:|:---------------|
+| R | 4.4.0 | [https://www.r-project.org/](https://www.r-project.org/) |
+| Bioconductor | 3.19 | [https://bioconductor.org](https://bioconductor.org) |
+| tidyverse | 2.0.0 | [https://www.tidyverse.org](https://www.tidyverse.org) |
+| STRINGdb | 2.16.4 | [https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html](https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html) |
+| PANTHER.db | 1.0.12 | [https://bioconductor.org/packages/release/data/annotation/html/PANTHER.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/PANTHER.db.html) |
+| rtracklayer | 1.64.0 | [https://bioconductor.org/packages/release/bioc/html/rtracklayer.html](https://www.bioconductor.org/packages/release/bioc/html/rtracklayer.html) |
+| org.At.tair.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.At.tair.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.At.tair.db.html) |
+| org.Ce.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Ce.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Ce.eg.db.html) |
+| org.Dm.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Dm.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Dm.eg.db.html) |
+| org.Dr.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html) |
+| org.Hs.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html) |
+| org.Mm.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html) |
+| org.Rn.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Rn.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Rn.eg.db.html) |
+| org.Sc.sgd.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html) |
+| AnnotationForge | 1.46.0 | [https://bioconductor.org/packages/AnnotationForge](https://bioconductor.org/packages/AnnotationForge) |
+| biomaRt | 2.60.1 | [https://bioconductor.org/packages/biomaRt](https://bioconductor.org/packages/biomaRt) |
+| GO.db | 3.19.1 | [https://bioconductor.org/packages/GO.db](https://bioconductor.org/packages/GO.db) |
+
+---
+
+# Annotation table build overview with example commands
+
+Current GeneLab annotation tables are available on [figshare](https://figshare.com/), exact links for each reference organism are provided in the [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) file.
+
+**[Ensembl Reference Versions](https://www.ensembl.org/index.html):**
+- Animals: Ensembl release 112
+- Plants: Ensembl plants release 59
+- Bacteria: Ensembl bacteria release 59
+
+**Database Versions:**
+- STRINGdb: 12.0
+- PANTHERdb: 18.0
+ > Note: The values in the 'name' column of [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) (e.g., HUMAN, MOUSE, RAT) are derived from the short names used in PANTHER. These short names are subject to change.
+- GO.db:
+ - GO ontology file updated on 2024-01-17
+ - Entrez gene data updated on 2024-03-12
+ - DB schema version 2.1
+
+
+
+---
+
+*All code is executed in R.*
+
+## 0. Set Up Environment
+
+```R
+# Set R library path to current working directory
+lib_path <- file.path(getwd())
+.libPaths(lib_path)
+
+# Define variables associated with current pipeline and annotation table versions
+GL_DPPD_ID <- "GL-DPPD-7110-A"
+workflow_version <- ""
+
+ref_tab_path <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv"
+readme_path <- "https://github.com/nasa/GeneLab_Data_Processing/tree/master/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md"
+
+# List currently supported organisms
+currently_accepted_orgs <- c("Arabidopsis thaliana", "Bacillus subtilis", "Brachypodium distachyon",
+ "Caenorhabditis elegans", "Danio rerio", "Drosophila melanogaster",
+ "Escherichia coli", "Homo sapiens", "Lactobacillus acidophilus",
+ "Mus musculus", "Mycobacterium marinum", "Oryza sativa",
+ "Oryzias latipes", "Pseudomonas aeruginosa", "Rattus norvegicus",
+ "Saccharomyces cerevisiae", "Salmonella enterica", "Serratia liquefaciens",
+ "Staphylococcus aureus", "Streptococcus mutans", "Vibrio fischeri")
+
+# Import libraries
+library(tidyverse)
+library(STRINGdb)
+library(PANTHER.db)
+library(rtracklayer)
+```
+**Input Data:**
+
+- None (This is an initial setup step using predefined variables)
+
+**Output Data:**
+
+- `GL_DPPD_ID` (variable specifying the GeneLab Data Processing Pipeline Document ID)
+- `workflow_version` (variable specifying the [current version of the workflow](https://github.com/nasa/GeneLab_Data_Processing/tree/DEV_GeneLab_Reference_Annotations_vGL-DPPD-7110-A/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A))
+- `ref_tab_path` (variable specifying the path to the reference table CSV file)
+- `readme_path` (variable specifying the path to the README file)
+- `currently_accepted_orgs` (variable specifying the list of currently supported organisms)
+
+
+
+---
+
+## 1. Define Variables and Output File Names
+
+```R
+# Set timeout time to ensure annotation file downloads will complete
+options(timeout = 600)
+
+ref_table <- tryCatch(
+ read.csv(ref_tab_path),
+ error = function(e) {
+ message <- paste("Error: Unable to read the reference table from the path provided. Please check the path and try again.\nPath:", ref_tab_path)
+ stop(message)
+ }
+)
+
+# Get target organism information
+target_info <- ref_table %>%
+ filter(species == target_organism)
+
+# Extract the relevant columns from the reference table
+target_taxid <- target_info$taxon # Taxonomic identifier
+target_org_db <- target_info$bioconductor_annotations # org.eg.db R package
+gtf_link <- target_info$gtf # Path to reference assembly GTF
+target_short_name <- target_info$name # PANTHER / UNIPROT short name; blank if not available
+ref_source <- target_info$ref_source # Reference files source
+
+# Error handling for missing values
+if (is.na(target_taxid) || is.na(target_org_db) || is.na(target_organism) || is.na(gtf_link)) {
+ stop(paste("Error: Missing data for target organism", target_organism, "in reference table."))
+}
+
+# Create output filenames
+base_gtf_filename <- basename(gtf_link)
+base_output_name <- str_replace(base_gtf_filename, ".gtf.gz", "")
+
+# Add the species name to base_output_name if the reference source is not ENSEMBL
+if (!(ref_source %in% c("ensembl_plants", "ensembl_bacteria", "ensembl"))) {
+ base_output_name <- paste(str_replace(target_organism, " ", "_"), base_output_name, sep = "_")
+}
+
+out_table_filename <- paste0(base_output_name, "-GL-annotations.tsv")
+out_log_filename <- paste0(base_output_name, "-GL-build-info.txt")
+
+# Check if output file already exists and if it does, exit without overwriting
+if ( file.exists(out_table_filename) ) {
+ cat("\n-------------------------------------------------------------------------------------------------\n")
+ cat(paste0("\n The file that would be created, '", out_table_filename, "', exists already.\n"))
+ cat(paste0(" We don't want to overwrite it accidentally. Move it and run this again if wanting to proceed.\n"))
+ cat("\n-------------------------------------------------------------------------------------------------\n")
+ quit()
+}
+```
+**Input Data:**
+
+- `ref_tab_path` (variable specifying the path to the reference table CSV file, output from [step 0](#0-set-up-environment))
+- `target_organism` (variable specifying the full species name of the target organism for which annotations are being generated)
+- > *Note: This is provided as a positional argument when the R script is run.*
+
+**Output Data:**
+
+- `target_taxid` (variable specifying the taxonomic identifier for the target organism)
+- `target_org_db` (variable specifying the name of the org.eg.db R package for the target organism if it is hosted by Bioconductor)
+- `gtf_link` (variable specifying the URL to the GTF file for the target organism)
+- `target_short_name` (variable specifying the PANTHER/UNIPROT short name for the target organism)
+- `ref_source` (variable specifying the source of the reference files, e.g., "ensembl", "ensembl_plants", "ensembl_bacteria", "ncbi")
+- `out_table_filename` (variable specifying the name of the output annotation table file)
+- `out_log_filename` (variable specifying the name of the output log file)
+
+
+
+---
+
+## 2. Create the Organism Package if it is Not Hosted by Bioconductor
+
+```R
+# Use AnnotationForge's makeOrgPackageFromNCBI function with default settings to create the organism-specific org.db R package from available NCBI annotations
+
+# Try to download the org.db from Bioconductor, build it locally if installation fails
+BiocManager::install(target_org_db, ask = FALSE)
+if (!requireNamespace(target_org_db, quietly = TRUE)) {
+ tryCatch({
+ # Parse organism's name in the reference table to create the org.db name (target_org_db)
+ genus_species <- strsplit(target_organism, " ")[[1]]
+ if (length(genus_species) < 1) {
+ stop("Species designation is not correctly formatted: ", target_organism)
+ }
+
+ genus <- genus_species[1]
+ species <- ifelse(length(genus_species) > 1, genus_species[2], "")
+ strain <- ref_table %>%
+ filter(species == target_organism) %>%
+ pull(strain) %>%
+ gsub("[^A-Za-z0-9]", "", .)
+
+ if (!is.na(strain) && strain != "") {
+ species <- paste0(species, strain)
+ }
+
+ # Get package name or build it if not provided
+ target_org_db <- ref_table %>%
+ filter(species == target_organism) %>%
+ pull(annotations)
+
+ if (is.na(target_org_db) || target_org_db == "") {
+ cat("\nNo annotation database specified. Constructing package name...\n")
+ target_org_db <- paste0("org.", substr(genus, 1, 1), species, ".eg.db")
+ }
+
+ BiocManager::install(c("AnnotationForge", "biomaRt", "GO.db"), ask = FALSE)
+ library(AnnotationForge)
+ makeOrgPackageFromNCBI(
+ version = "0.1",
+ author = "Your Name ",
+ maintainer = "Your Name ",
+ outputDir = "./",
+ tax_id = target_taxid,
+ genus = genus,
+ species = species
+ )
+ install.packages(file.path("./", target_org_db), repos = NULL, type = "source", quiet = TRUE)
+ cat(paste0("'", target_org_db, "' has been successfully built and installed.\n"))
+ }, error = function(e) {
+ stop("Failed to build and load the package: ", target_org_db, "\nError: ", e$message)
+ })
+}
+```
+
+**Input Data:**
+
+- `target_org_db` (variable specifying the name of the org.db R package for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
+- `ref_table` (variable specifying the reference table containing organism-specific information, output from [step 1](#1-define-variables-and-output-file-names))
+- `target_organism` (variable specifying the full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
+- `target_taxid` (variable specifying the taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
+
+**Output Data:**
+
+- `target_org_db` (variable specifying the updated name of the org.db R package, if it was created locally)
+- Locally installed org.db package (if the package is not available on Bioconductor, a new package is created and installed)
+
+
+
+---
+
+## 3. Load Annotation Databases
+
+```R
+# Set timeout time to ensure annotation file downloads will complete
+options(timeout = 600)
+
+####### GTF ##########
+
+# Create the GTF dataframe from its path, unique gene identities in the reference assembly are under 'gene_id'
+GTF <- rtracklayer::import(gtf_link)
+GTF <- data.frame(GTF)
+
+###### org.db ########
+
+# Load the package into the R session
+library(target_org_db, character.only = TRUE)
+
+# Define list of supported organisms which do not use annotations from an org.db
+no_org_db <- c("Lactobacillus acidophilus", "Mycobacterium marinum", "Oryza sativa", "Pseudomonas aeruginosa",
+ "Serratia liquefaciens", "Staphylococcus aureus", "Streptococcus mutans", "Vibrio fischeri")
+
+# Run the function unless the target_organism is in no_org_db
+if (!(target_organism %in% no_org_db) && (target_organism %in% currently_accepted_orgs)) {
+ install_and_load_org_db(target_organism, target_org_db, ref_tab_path)
+}
+```
+
+**Input Data:**
+
+- `gtf_link` (variable specifying the URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
+- `target_org_db` (variable specifying the name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) or [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor))
+- `target_organism` (variable specifying the full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
+- `currently_accepted_orgs` (variable specifying the list of currently supported organisms, output from [step 0](#0-set-up-environment))
+- `ref_tab_path` (variable specifying the path to the reference table CSV file, output from [step 0](#0-set-up-environment))
+
+**Output Data:**
+
+- `GTF` (variable holding the data frame containing the GTF file for the target organism)
+- `no_org_db` (variable specifying the list of organisms that do not use org.db annotations due to inconsistent gene names across GTF and org.db)
+
+
+
+---
+
+## 4. Build Initial Annotation Table
+
+```R
+# Initialize table from GTF
+
+# Define GTF keys based on the target organism; gene_id conrains unique gene IDs in the reference assembly. Defaults to ENSEMBL
+
+gtf_keytype_mappings <- list(
+ "Arabidopsis thaliana" = c(gene_id = "TAIR"),
+ "Bacillus subtilis" = c(gene_id = "ENSEMBL", gene_name = "SYMBOL"),
+ "Brachypodium distachyon" = c(gene_id = "ENSEMBL", transcript_id = "ACCNUM"),
+ "Caenorhabditis elegans" = c(gene_id = "ENSEMBL"),
+ "Escherichia coli" = c(gene_id = "ENSEMBL", gene_name = "SYMBOL"),
+ "Lactobacillus acidophilus" = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"),
+ "Mycobacterium marinum" = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"),
+ "Pseudomonas aeruginosa" = c(gene_id = "LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"),
+ "Salmonella enterica" = c(gene_id = "ENSEMBL", db_xref = "ENTREZID"),
+ "Serratia liquefaciens" = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"),
+ "Staphylococcus aureus" = c(gene_id = "LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"),
+ "Streptococcus mutans" = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"),
+ "Vibrio fischeri" = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"),
+ "default" = c(gene_id = "ENSEMBL")
+)
+
+# Get the key types for the target organism or use the default
+wanted_gtf_keytypes <- if (!is.null(gtf_keytype_mappings[[target_organism]])) {
+ gtf_keytype_mappings[[target_organism]]
+} else {
+ c(gene_id = "ENSEMBL")
+}
+
+# Initialize the annotation table from the GTF, keeping only the wanted_gtf_keytypes
+annot_gtf <- GTF[, names(wanted_gtf_keytypes), drop = FALSE]
+annot_gtf <- annot_gtf %>% distinct()
+
+# Rename the columns in the annot_gtf dataframe according to the key types
+colnames(annot_gtf) <- wanted_gtf_keytypes
+
+# Save the name of the primary key type (gene_id) being used
+primary_keytype <- wanted_gtf_keytypes[1]
+
+# Filter out unwanted genes from the GTF
+
+# Define filtering criteria for specific organisms
+filter_criteria <- list(
+ "Bacillus subtilis" = "^BSU",
+ "Drosophila melanogaster" = "^RR",
+ "Saccharomyces cerevisiae" = "^Y[A-Z0-9]{6}-?[A-Z]?$",
+ "Escherichia coli" = "^b[0-9]{4}$"
+)
+
+# Apply the filter if there's a specific criterion for the target organism
+filter_pattern <- filter_criteria[[target_organism]]
+
+if (!is.null(filter_pattern)) {
+ if (target_organism == "Drosophila melanogaster") {
+ annot_gtf <- annot_gtf %>% filter(!grepl(filter_pattern, !!sym(primary_keytype)))
+ } else {
+ annot_gtf <- annot_gtf %>% filter(grepl(filter_pattern, !!sym(primary_keytype)))
+ }
+}
+
+# Remove "Gene:" labels on ENTREZ IDs
+if (target_organism == "Salmonella enterica") {
+ annot_gtf <- annot_gtf %>% dplyr::mutate(ENTREZID = gsub("^GeneID:", "", ENTREZID)) %>% as.data.frame
+}
+```
+
+**Input Data:**
+
+- `GTF` (variable holding the data frame containing the parsed GTF file for the target organism, output from [step 3](#3-load-annotation-databases))
+- `target_organism` (variable specifying the full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
+- `gtf_keytype_mappings` (variable specifying the list of keys to extract from the GTF, for each organism)
+
+**Output Data:**
+
+- `annot_gtf` (variable holding the initial annotation table derived from the GTF file, containing only the relevant columns for the target organism)
+- `primary_keytype` (variable specifying the name of the primary key type being used, e.g., "ENSEMBL", "TAIR", "LOCUS", based on the GTF gene_id entries)
+
+
+
+---
+
+## 5. Add org.db Keys
+
+```R
+annot_orgdb <- annot_gtf
+
+# Define the initial keys to pull from the organism-specific database
+orgdb_keytypes_list <- list(
+ "Brachypodium distachyon" = c("GENENAME", "REFSEQ", "ENTREZID"),
+ "Escherichia coli" = c("GENENAME", "REFSEQ", "ENTREZID"),
+ "Caenorhabditis elegans" = c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID", "GO"),
+ "Salmonella enterica" = c("SYMBOL", "GENENAME", "REFSEQ"),
+ "Saccharomyces cerevisiae" = c("GENENAME", "ALIAS", "REFSEQ", "ENTREZID"),
+ "default" = c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID")
+)
+
+# Add entries for organisms in no_org_db as character(0) (no keys wanted from the org.db)
+for (organism in no_org_db) {
+ orgdb_keytypes_list[[organism]] <- character(0)
+}
+
+wanted_org_db_keytypes <- if (target_organism %in% names(orgdb_keytypes_list)) {
+ orgdb_keytypes_list[[target_organism]]
+} else {
+ orgdb_keytypes_list[["default"]]
+}
+
+# Define mappings for query and keytype based on target organism
+orgdb_keytype_mappings <- list(
+ "Bacillus subtilis" = list(query = "SYMBOL", keytype = "SYMBOL"),
+ "Brachypodium distachyon" = list(query = "ACCNUM", keytype = "ACCNUM"),
+ "Caenorhabditis elegans" = list(query = primary_keytype, keytype = "ENSEMBL"),
+ "Escherichia coli" = list(query = "SYMBOL", keytype = "SYMBOL"),
+ "Salmonella enterica" = list(query = "ENTREZID", keytype = "ENTREZID"),
+ "default" = list(query = primary_keytype, keytype = primary_keytype)
+)
+
+# Define the orgdb_query, this is the key type that will be used to map to the org.db
+orgdb_query <- if (!is.null(orgdb_keytype_mappings[[target_organism]])) {
+ orgdb_keytype_mappings[[target_organism]][["query"]]
+} else {
+ orgdb_keytype_mappings[["default"]][["query"]]
+}
+
+# Define the orgdb_keytype, this is the name of the key type in the org.db
+orgdb_keytype <- if (!is.null(orgdb_keytype_mappings[[target_organism]])) {
+ orgdb_keytype_mappings[[target_organism]][["keytype"]]
+} else {
+ orgdb_keytype_mappings[["default"]][["keytype"]]
+}
+
+# Function to remove version numbers from ACCNUM keys and match them for BRADI
+match_accnum <- function(annot_table, org_db, query_col, keytype_col, target_column) {
+ # Remove version numbers from the ACCNUM keys in the GTF annotations
+ cleaned_annot_keys <- sub("\\..*", "", annot_table[[query_col]])
+
+ # Retrieve and remove version numbers from the org.db keys
+ orgdb_keys <- keys(org_db, keytype = keytype_col)
+ cleaned_orgdb_keys <- sub("\\..*", "", orgdb_keys)
+
+ # Create a lookup table for matching cleaned keys to original keys
+ lookup_table <- setNames(orgdb_keys, cleaned_orgdb_keys)
+
+ # Match cleaned GTF keys to original org.db keys
+ matched_keys <- lookup_table[cleaned_annot_keys]
+
+ # Use the matched keys to retrieve the target annotations from org.db
+ mapIds(org_db, keys = matched_keys, keytype = keytype_col, column = target_column, multiVals = "list")
+}
+
+# Loop through the desired key types and add annotations to the GTF table
+for (keytype in wanted_org_db_keytypes) {
+ # Check if keytype is a valid column in the target org.db
+ if (keytype %in% columns(get(target_org_db, envir = .GlobalEnv))) {
+ if (target_organism == "Brachypodium distachyon" && orgdb_query == "ACCNUM") {
+ # For BRADI: use the match_accnum function to map to org.db ACCNUM entries
+ org_matches <- match_accnum(annot_orgdb, get(target_org_db, envir = .GlobalEnv), query_col = orgdb_query, keytype_col = orgdb_keytype, target_column = keytype)
+ } else {
+ # Default mapping for other organisms
+ org_matches <- mapIds(get(target_org_db, envir = .GlobalEnv), keys = annot_orgdb[[orgdb_query]], keytype = orgdb_keytype, column = keytype, multiVals = "list")
+ }
+ # Add the mapped annotations to the GTF table
+ annot_orgdb[[keytype]] <- sapply(org_matches, function(x) paste(x, collapse = "|"))
+ } else {
+ # Set column to NA if keytype is not present in org.db
+ annot_orgdb[[keytype]] <- NA
+ }
+}
+
+# For SALTY, reorder columns to match other tables
+if (target_organism == "Salmonella enterica") { # Reorder columns to match others; was mismatched since ENTREZ came from GTF
+ annot_orgdb <- annot_orgdb[, c("ENSEMBL", "SYMBOL", "GENENAME", "REFSEQ", "ENTREZID")]
+}
+
+# For YEAST, Rename ALIAS to GENENAME
+if (target_organism == "Saccharomyces cerevisiae") {
+ colnames(annot_orgdb) <- c("ENSEMBL", "SYMBOL", "GENENAME", "REFSEQ", "ENTREZID")
+}
+```
+
+**Input Data:**
+
+- `annot_gtf` (variable holding the initial annotation table derived from the GTF file, output from [step 4](#4-build-initial-annotation-table))
+- `target_organism` (variable specifying the full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
+- `no_org_db` (variable specifying the list of organisms that do not use annotations from an org.db, output from [step 3](#3-load-annotation-databases))
+- `primary_keytype` (variable specifying the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
+- `target_org_db` (variable specifying the name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) or [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor))
+
+**Output Data:**
+
+- `annot_orgdb` (variable holding the updated annotation table with GTF and organism-specific org.db annotations)
+- `orgdb_query` (variable specifying the key type used to map to the org.db)
+- `orgdb_keytype` (variable specifying the name of the key type in the org.db)
+
+
+
+---
+
+## 6. Add STRING IDs
+
+```R
+# Define organisms that do not use STRING annotations
+no_stringdb <- c("Pseudomonas aeruginosa", "Staphylococcus aureus")
+
+# Define the key type used for mapping to STRING
+stringdb_query_list <- list(
+ "Lactobacillus acidophilus" = "OLD_LOCUS",
+ "Mycobacterium marinum" = "OLD_LOCUS",
+ "Serratia liquefaciens" = "OLD_LOCUS",
+ "Streptococcus mutans" = "OLD_LOCUS",
+ "Vibrio fischeri" = "OLD_LOCUS",
+ "default" = primary_keytype
+)
+
+# Define the key type for mapping in STRING, using the default if necessary
+stringdb_query <- if (!is.null(stringdb_query_list[[target_organism]])) {
+ stringdb_query_list[[target_organism]]
+} else {
+ stringdb_query_list[["default"]]
+}
+
+# Handle organisms which do not use the GTF's gene_id keys to map to STRING
+# These are microbial species for which NCBI references were used rather than ENSEMBL,
+# for which the STRING accessions match the GTF's gene_name keys, but not the gene_id keys.
+uses_old_locus <- c("Lactobacillus acidophilus", "Mycobacterium marinum", "Serratia liquefaciens", "Streptococcus mutans", "Vibrio fischeri")
+# Handle STRING annotation processing based on the target organism
+if (target_organism %in% uses_old_locus) {
+ annot_stringdb <- annot_orgdb %>%
+ separate_rows(!!sym(stringdb_query), sep = ",", convert = TRUE) %>%
+ distinct() %>%
+ as.data.frame()
+} else {
+ # For other organisms, collapse on the primary key
+ annot_stringdb <- annot_orgdb %>% distinct()
+ annot_stringdb <- annot_stringdb %>%
+ group_by(!!sym(primary_keytype)) %>%
+ summarise(across(everything(), ~paste(unique(na.omit(.))[unique(na.omit(.)) != ""], collapse = "|")), .groups = 'drop') %>%
+ as.data.frame()
+}
+
+# Replace "BSU_" with "BSU" in the primary_keytype column for BACSU before STRING mapping
+if (target_organism == "Bacillus subtilis") {
+ annot_stringdb[[stringdb_query]] <- gsub("^BSU_", "BSU", annot_stringdb[[stringdb_query]])
+}
+
+# Map alternative taxonomy IDs for organisms not directly supported by STRING
+taxid_map <- list(
+ "Saccharomyces cerevisiae" = 4932,
+ "Brassica rapa" = 51351,
+ "Serratia liquefaciens" = 614
+)
+
+# Assign the alternative taxonomy identifier if applicable
+target_taxid <- if (!is.null(taxid_map[[target_organism]])) {
+ taxid_map[[target_organism]]
+} else {
+ target_taxid
+}
+
+# Initialize string_map
+string_map <- NULL
+
+# If the target organism is supported by STRING, get STRING annotations
+if (!(target_organism %in% no_stringdb)) {
+ string_db <- STRINGdb$new(version = "12.0", species = target_taxid, score_threshold = 0)
+ string_map <- string_db$map(annot_stringdb, stringdb_query, removeUnmappedRows = FALSE, takeFirst = FALSE)
+}
+if (!is.null(string_map)) {
+ annot_stringdb <- annot_stringdb %>%
+ group_by(!!sym(primary_keytype)) %>%
+ summarise(across(everything(), ~paste(unique(na.omit(.))[unique(na.omit(.)) != ""], collapse = "|")), .groups = 'drop')
+
+ string_map <- string_map %>%
+ group_by(!!sym(primary_keytype)) %>%
+ summarise(across(everything(), ~paste(unique(na.omit(.))[unique(na.omit(.)) != ""], collapse = "|")), .groups = 'drop')
+}
+
+if (!is.null(string_map)) {
+ # Determine the appropriate join key
+ join_key <- if (target_organism %in% c("Lactobacillus acidophilus", "Mycobacterium marinum", "Serratia liquefaciens", "Streptococcus mutans", "Vibrio fischeri")) {
+ primary_keytype
+ } else {
+ stringdb_query
+ }
+
+ # Add temporary column to add string IDs to annotation table
+ annot_stringdb <- annot_stringdb %>%
+ mutate(join_key = toupper(!!sym(join_key)))
+
+ string_map <- string_map %>%
+ mutate(join_key = toupper(!!sym(join_key)))
+
+ # Join STRING IDs to the annotation table
+ annot_stringdb <- left_join(annot_stringdb, string_map %>% dplyr::select(join_key, STRING_id), by = "join_key") %>%
+ dplyr::select(-join_key)
+}
+
+# Undo the "BSU_" to "BSU" replacement for BACSU after STRING mapping
+if (target_organism == "Bacillus subtilis") {
+ annot_stringdb[[stringdb_query]] <- gsub("^BSU", "BSU_", annot_stringdb[[stringdb_query]])
+}
+
+annot_stringdb <- as.data.frame(annot_stringdb)
+```
+
+**Input Data:**
+
+- `annot_orgdb` (variable holding the annotation table with GTF and organism-specific org.db annotations, output from [step 5](#5-add-orgdb-keys))
+- `target_organism` (variable specifying the full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
+- `primary_keytype` (variable specifying the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
+- `target_taxid` (variable specifying the taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
+
+**Output Data:**
+
+- `annot_stringdb` (variable holding the updated annotation table with GTF, organism-specific org.db, and STRING annotations)
+- `no_stringdb` (variable specifying the list of organisms that do not use STRING annotations)
+- `stringdb_query` (variable specifying the key type used for mapping to STRING database)
+- `uses_old_locus` (variable specifying the list of organisms where GTF gene_id entries do not match those in STRING, so entries in OLD_LOCUS are used to query STRING)
+
+
+
+---
+
+## 7. Add Gene Ontology (GO) slim IDs
+
+```R
+# Define organisms that do not use PANTHER annotations
+no_panther_db <- c("Caenorhabditis elegans", "Mycobacterium marinum", "Oryza sativa", "Staphylococcus aureus", "Lactobacillus acidophilus", "Serratia liquefaciens", "Streptococcus mutans", "Vibrio fischeri", "Pseudomonas aeruginosa")
+
+annot_pantherdb <- annot_stringdb
+
+if (!(target_organism %in% no_panther_db)) {
+
+ # Define the key type in the annotation table used to map to PANTHER DB
+ pantherdb_query = "ENTREZID"
+ pantherdb_keytype = "ENTREZ"
+
+ # Retrieve target organism PANTHER GO slim annotations database using the UNIPROT / PANTHER short name
+ pthOrganisms(PANTHER.db) <- target_short_name
+
+ # Define a function to retrieve GO slim IDs for a given gene's ENTREZIDs, which may include entries separated by a "|"
+ get_go_slim_ids <- function(entrez_id) {
+ if (is.na(entrez_id) || entrez_id == "NA") {
+ return("NA")
+ }
+
+ entrez_ids <- unlist(strsplit(entrez_id, "|", fixed = TRUE))
+ go_ids <- lapply(entrez_ids, function(id) {
+ mapIds(PANTHER.db, keys = id, keytype = pantherdb_keytype, column = "GOSLIM_ID", multiVals = "list")
+ })
+
+ # Flatten the list and remove duplicates
+ go_ids <- unique(unlist(go_ids))
+
+ if (length(go_ids) == 0) {
+ return("NA")
+ } else {
+ return(paste(go_ids, collapse = "|"))
+ }
+ }
+
+ # Apply the GO slim ID mapping function to all valid rows
+ annot_pantherdb <- annot_pantherdb %>%
+ mutate(GOSLIM_IDS = sapply(get(pantherdb_query), get_go_slim_ids))
+}
+```
+
+**Input Data:**
+
+- `annot_stringdb` (variable holding the annotation table with GTF, organism-specific org.db, and STRING annotations, output from [step 6](#6-add-string-ids))
+- `target_organism` (variable specifying the full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
+
+**Output Data:**
+
+- `annot_pantherdb` (variable holding the updated annotation table with GTF, organism-specific org.db, STRING, and PANTHER GO Slim annotations)
+- `no_panther_db` (variable specifying the list of organisms that do not use PANTHER annotations)
+
+
+
+---
+
+## 8. Export Annotation Table and Build Info
+
+```R
+# Group by primary key to remove any remaining unjoined or duplicate rows
+annot <- annot_pantherdb %>%
+ group_by(!!sym(primary_keytype)) %>%
+ summarise(across(everything(), ~paste(unique(na.omit(.))[unique(na.omit(.)) != ""], collapse = "|")), .groups = 'drop')
+
+# If "GO" column exists, move it to the end to keep columns in consistent order across organisms
+if ("GO" %in% names(annot)) {
+ go_column <- annot$GO
+ annot$GO <- NULL
+ annot$GO <- go_column
+}
+
+# Sort the annotation table based on primary keytype gene IDs
+annot <- annot %>% arrange(.[[1]])
+
+# Replace any blank cells with NA
+annot[annot == "" | annot == "NA"] <- NA
+
+# Export the annotation table
+write.table(annot, out_table_filename, sep = "\t", quote = FALSE, row.names = FALSE)
+
+# Define the date when the annotation table was generated
+date_generated <- format(Sys.time(), "%d-%B-%Y")
+
+# Export annotation build information
+writeLines(paste(c("Based on:\n ", GL_DPPD_ID), collapse = ""), out_log_filename)
+write(paste(c("\nBuild done on:\n ", date_generated), collapse = ""), out_log_filename, append = TRUE)
+write(paste(c("\nUsed gtf file:\n ", gtf_link), collapse = ""), out_log_filename, append = TRUE)
+if (!(target_organism %in% no_org_db)) {
+ write(paste(c("\nUsed ", target_org_db, " version:\n ", packageVersion(target_org_db) %>% as.character()), collapse = ""), out_log_filename, append = TRUE)
+}
+write(paste(c("\nUsed STRINGdb version:\n ", packageVersion("STRINGdb") %>% as.character()), collapse = ""), out_log_filename, append = TRUE)
+write(paste(c("\nUsed PANTHER.db version:\n ", packageVersion("PANTHER.db") %>% as.character()), collapse = ""), out_log_filename, append = TRUE)
+
+write("\n\nAll session info:\n", out_log_filename, append = TRUE)
+write(capture.output(sessionInfo()), out_log_filename, append = TRUE)
+```
+
+**Input Data:**
+
+- `annot_pantherdb` (variable holding the updated annotation table with GTF, organism-specific org.db, STRING, and PANTHER GO Slim annotations, output from [step 7](#7-add-gene-ontology-go-slim-ids))
+- `primary_keytype` (variable specifying the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
+- `out_table_filename` (variable specifying the name of the output annotation table file, output from [step 1](#1-define-variables-and-output-file-names))
+- `out_log_filename` (variable specifying the name of the output log file, output from [step 1](#1-define-variables-and-output-file-names))
+- `GL_DPPD_ID` (variable specifying the GeneLab Data Processing Pipeline Document ID, output from [step 0](#0-set-up-environment))
+- `gtf_link` (variable specifying the URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
+- `target_org_db` (variable specifying the name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) or [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor))
+- `no_org_db` (variable specifying the list of organisms that do not use org.db annotations, output from [step 3](#3-load-annotation-databases))
+
+**Output Data:**
+
+- `annot` (variable holding the final annotation table with GTF, organism-specific org.db, STRING, and PANTHER GO Slim annotations)
+- ***-GL-annotations.tsv** (final annotation table saved as a tab-delimited table file)
+- ***-GL-build-info.txt** (annotation table build information log file)
+
+
+
+---
+
+**Pipeline Input data:**
+
+- No input files required, but a target organism must be specified as a positional command line argument
+
+**Pipeline Output data:**
+
+- ***-GL-annotations.tsv** (Tab-delineated table of gene annotations, used to add gene annotations in other GeneLab processing pipelines)
+- ***-GL-build-info.txt** (Text file containing information used to create the annotation table, including tool and tool versions and date of creation)
diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv
new file mode 100644
index 00000000..c2a881e2
--- /dev/null
+++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv
@@ -0,0 +1,24 @@
+name,species,strain,ensemblVersion,ref_source,fasta,gtf,taxon,bioconductor_annotations,custom_annotations,genelab_annots_link,genelab_annots_info_link
+ARABIDOPSIS,Arabidopsis thaliana,,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.59.gtf.gz,3702,org.At.tair.db,,https://figshare.com/ndownloader/files/48354355,https://figshare.com/ndownloader/files/48354352
+BACSU,Bacillus subtilis,subsp. subtilis 168,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/dna/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.59.gtf.gz,224308,,org.Bsubtilissubspsubtilis168.eg.db,https://figshare.com/ndownloader/files/48354346,https://figshare.com/ndownloader/files/48354349
+BRADI,Brachypodium distachyon,,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/brachypodium_distachyon/dna/Brachypodium_distachyon.Brachypodium_distachyon_v3.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/brachypodium_distachyon/Brachypodium_distachyon.Brachypodium_distachyon_v3.0.59.gtf.gz,15368,,org.Bdistachyon.eg.db,https://figshare.com/ndownloader/files/48354370,https://figshare.com/ndownloader/files/48354361
+BRARP,Brassica rapa,,59,ensembl_plants,http://ftp.ensemblgenomes.org/pub/plants/release-59/fasta/brassica_rapa/dna/Brassica_rapa.Brapa_1.0.dna.toplevel.fa.gz,http://ftp.ensemblgenomes.org/pub/plants/release-59/gtf/brassica_rapa/Brassica_rapa.Brapa_1.0.59.gtf.gz,,,,,
+WORM,Caenorhabditis elegans,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.112.gtf.gz,6239,org.Ce.eg.db,,https://figshare.com/ndownloader/files/48354373,https://figshare.com/ndownloader/files/48354364
+ZEBRAFISH,Danio rerio,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/danio_rerio/Danio_rerio.GRCz11.112.gtf.gz,7955,org.Dr.eg.db,,https://figshare.com/ndownloader/files/48354388,https://figshare.com/ndownloader/files/48354367
+FLY,Drosophila melanogaster,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.112.gtf.gz,7227,org.Dm.eg.db,,https://figshare.com/ndownloader/files/48354382,https://figshare.com/ndownloader/files/48354376
+ERCC,,,,ThermoFisher,https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip,https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip,,,,,
+ECOLI,Escherichia coli,str. K-12 substr. MG1655,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/dna/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.59.gtf.gz,511145,,org.EcolistrK12substrMG1655.eg.db,https://figshare.com/ndownloader/files/48354379,https://figshare.com/ndownloader/files/48354394
+HUMAN,Homo sapiens,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz,9606,org.Hs.eg.db,,https://figshare.com/ndownloader/files/48354445,https://figshare.com/ndownloader/files/48354448
+,Lactobacillus acidophilus,NCFM,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.gtf.gz,272621,,,https://figshare.com/ndownloader/files/49061254,https://figshare.com/ndownloader/files/49061257
+MOUSE,Mus musculus,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/mus_musculus/Mus_musculus.GRCm39.112.gtf.gz,10090,org.Mm.eg.db,,https://figshare.com/ndownloader/files/48354460,https://figshare.com/ndownloader/files/48354457
+,Mycobacterium marinum,M,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,216594,,,https://figshare.com/ndownloader/files/49061260,https://figshare.com/ndownloader/files/49061263
+ORYSJ,Oryza sativa,Japonica,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/oryza_sativa/Oryza_sativa.IRGSP-1.0.59.gtf.gz,39947,,,https://figshare.com/ndownloader/files/48354451,https://figshare.com/ndownloader/files/48354454
+ORYLA,Oryzias latipes,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/oryzias_latipes/dna/Oryzias_latipes.ASM223467v1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/oryzias_latipes/Oryzias_latipes.ASM223467v1.112.gtf.gz,8090,,org.Olatipes.eg.db,https://figshare.com/ndownloader/files/48354463,https://figshare.com/ndownloader/files/48354466
+,Pseudomonas aeruginosa,UCBPP-PA14,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.gtf.gz,208963,,,https://figshare.com/ndownloader/files/49061266,https://figshare.com/ndownloader/files/49061269
+RAT,Rattus norvegicus,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/rattus_norvegicus/dna/Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/rattus_norvegicus/Rattus_norvegicus.mRatBN7.2.112.gtf.gz,10116,org.Rn.eg.db,,https://figshare.com/ndownloader/files/48354472,https://figshare.com/ndownloader/files/48354475
+YEAST,Saccharomyces cerevisiae,S288C,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.112.gtf.gz,559292,org.Sc.sgd.db,,https://figshare.com/ndownloader/files/48354469,https://figshare.com/ndownloader/files/48354478
+SALTY,Salmonella enterica,serovar Typhimurium str. LT2,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.gtf.gz,99287,,org.SentericaserovarTyphimuriumstrLT2.eg.db,https://figshare.com/ndownloader/files/49061272,https://figshare.com/ndownloader/files/49061275
+,Serratia liquefaciens,ATCC 27592,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.gtf.gz,1346614,,,https://figshare.com/ndownloader/files/49061278,https://figshare.com/ndownloader/files/49061281
+,Staphylococcus aureus,MRSA252,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.gtf.gz,282458,,,https://figshare.com/ndownloader/files/49061284,https://figshare.com/ndownloader/files/49061287
+,Streptococcus mutans,UA159,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.gtf.gz,210007,,,https://figshare.com/ndownloader/files/49061290,https://figshare.com/ndownloader/files/49061293
+,Vibrio fischeri,ES114,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.gtf.gz,312309,,,https://figshare.com/ndownloader/files/49061296,https://figshare.com/ndownloader/files/49061299
\ No newline at end of file
diff --git a/GeneLab_Reference_Annotations/README.md b/GeneLab_Reference_Annotations/README.md
index e11a15c0..07896e0c 100644
--- a/GeneLab_Reference_Annotations/README.md
+++ b/GeneLab_Reference_Annotations/README.md
@@ -1,6 +1,6 @@
# GeneLab pipeline for generating reference annotation tables
-> **The document [`GL-DPPD-7110.md`](Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110.md) holds an overview and example commands for how GeneLab generates reference annotation tables. See the [Repository Links](#repository-links) descriptions below for more information.**
+> **The document [`GL-DPPD-7110-A.md`](Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md) holds an overview and example commands for how GeneLab generates reference annotation tables. See the [Repository Links](#repository-links) descriptions below for more information.**
---
## Repository Links
@@ -17,6 +17,9 @@
---
-**Developed and maintained by:**
+**Developed by:**
Mike Lee
+**Maintained by:**
+Alexis Torres
+Crystal Han
diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md
new file mode 100644
index 00000000..014cf89a
--- /dev/null
+++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md
@@ -0,0 +1,62 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+## [1.1.0](https://github.com/nasa/GeneLab_Data_Processing/blob/DEV_GeneLab_Reference_Annotations_vGL-DPPD-7110-A/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A)
+
+### Added
+
+- Added software:
+ - AnnotationForge version 1.46.0
+ - biomaRt version 2.60.1
+ - GO.db version 3.19.1
+- Added support for:
+ - Bacillus subtilis, subsp. subtilis 168
+ - Brachypodium distachyon
+ - Escherichia coli,str. K-12 substr. MG1655
+ - Oryzias latipes
+ - Lactobacillus acidophilus NCFM
+ - Mycobacterium marinum M
+ - Oryza sativa Japonica
+ - Pseudomonas aeruginosa UCBPP-PA14
+ - Salmonella enterica subsp. enterica serovar Typhimurium str. LT2
+ - Serratia liquefaciens ATCC 27592
+ - Staphylococcus aureus MRSA252
+ - Streptococcus mutans UA159
+ - Vibrio fischeri ES114
+- Added AnnotationForge helper script install-org-db.R to create
+organism-specific annotation packages (org.*.eg.db) in R if not available on
+Bioconductor. Used for:
+ - Bacillus subtilis, subsp. subtilis 168
+ - Brachypodium distachyon
+ - Escherichia coli,str. K-12 substr. MG1655
+ - Oryzias latipes
+ - Salmonella enterica subsp. enterica serovar Typhimurium str. LT2
+- Added NCBI as a source for FASTA and GTF files
+
+### Fixed
+
+- Fixed processing for ECOLI
+
+### Changed
+
+- Updated Ensembl versions:
+ - Animals: Ensembl release 112
+ - Plants: Ensembl plants release 59
+ - Bacteria: Ensembl bacteria release 59
+- Updated software:
+ - tidyverse version updated from 1.3.2 to 2.0.0
+ - STRINGdb version updated from 2.8.4 to 2.16.4
+ - PANTHER.db version updated from 1.0.11 to 1.0.12
+ - rtracklayer version updated from 1.56.1 to 1.64.0
+ - Bioconductor version updated from 3.15.1 to 3.19
+- Removed org.EcK12.eg.db and replaced it with a locally created annotations
+database, as it is no longer available on Bioconductor
+- Changed the first argument of GL-DPPD-7110-A_build-genome-annots-tab.R from
+the 'name' column value to the 'species' column value (e.g., 'Mus musculus' instead of 'MOUSE')
+
+
+## [1.0.0](https://github.com/nasa/GeneLab_Data_Processing/releases/tag/GL_RefAnnotTable_1.0.0)
diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md
new file mode 100644
index 00000000..44e5de46
--- /dev/null
+++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md
@@ -0,0 +1,256 @@
+# GL_RefAnnotTable-A Workflow Information and Usage Instructions
+
+## Table of Contents
+
+- [General Workflow Information](#general-workflow-information)
+- [Utilizing the Workflow](#utilizing-the-workflow)
+ - [1. Download the Workflow Files](#1-download-the-workflow-files)
+ - [2. Run the Workflow](#2-run-the-workflow)
+ - [Approach 1: Using Singularity](#approach-1-using-singularity)
+ - [Step 1: Install Singularity](#step-1-install-singularity)
+ - [Step 2: Fetch the Singularity Image](#step-2-fetch-the-singularity-image)
+ - [Step 3: Run the Workflow](#step-3-run-the-workflow)
+ - [Approach 2: Using a Local R Environment](#approach-2-using-a-local-r-environment)
+ - [Step 1: Install R and Required R Packages](#step-1-install-r-and-required-r-packages)
+ - [Step 2: Run the Workflow](#step-2-run-the-workflow)
+ - [Workflow Input/Output Data](#workflow-input-output-data)
+ - [3. Run the Annotations Database Creation Function as a Stand-Alone Script](#3-run-the-annotations-database-creation-function-as-a-stand-alone-script)
+ - [Using Singularity](#using-singularity)
+ - [Using a Local R Environment](#using-a-local-r-environment)
+
+
+
+---
+
+## General Workflow Information
+
+The current GeneLab Reference Annotation Table (GL_RefAnnotTable-A) pipeline is implemented as an R workflow that can be run from a command line interface (CLI) using bash. The workflow can be executed using either a Singularity container or a local R environment. The workflow can be used even if you are unfamiliar with R, but if you want to learn more about R, visit the [R-project about page here](https://www.r-project.org/about.html). Additionally, an introduction to R along with installation help and information about using R for bioinformatics can be found [here at Happy Belly Bioinformatics](https://astrobiomike.github.io/R/basics).
+
+
+
+---
+
+## Utilizing the Workflow
+
+### 1. Download the Workflow Files
+
+Download the latest version of the GL_RefAnnotTable-A workflow:
+
+```bash
+curl -LO https://github.com/nasa/GeneLab_Data_Processing/releases/download/GL_RefAnnotTable-A_1.1.0/GL_RefAnnotTable-A_1.1.0.zip
+unzip GL_RefAnnotTable-A_1.1.0.zip
+```
+
+
+
+---
+
+### 2. Run the Workflow
+
+The GL_RefAnnotTable-A workflow can be run using one of two approaches:
+
+- **[Approach 1: Using Singularity](#approach-1-using-singularity)**
+- **[Approach 2: Using a Local R Environment](#approach-2-using-a-local-r-environment)**
+
+Please follow the instructions for the approach that best matches your setup and preferences. Each method is explained in detail below.
+
+> **Note**: If you encounter timeout errors, you can increase the default timeout (3600 seconds) by modifying the `options(timeout=3600)` line at the top of the `GL-DPPD-7110-A_build-genome-annots-tab.R` script.
+
+
+
+---
+
+### Approach 1: Using Singularity
+
+This approach allows you to run the workflow within a containerized environment, ensuring consistency and reproducibility.
+
+
+
+#### Step 1: Install Singularity
+
+Singularity is a containerization platform for running applications portably and reproducibly. We use container images hosted on Quay.io to encapsulate all the necessary software and dependencies required by the GL_RefAnnotTable-A workflow. This setup allows you to run the workflow without installing any software directly on your system.
+
+> **Note**: Other containerization tools like Docker or Apptainer can also be used to pull and run these images.
+
+
+We recommend installing Singularity system-wide as per the official [Singularity installation documentation](https://docs.sylabs.io/guides/3.10/admin-guide/admin_quickstart.html).
+
+
+> **Note**: While Singularity is also available through [Anaconda](https://anaconda.org/conda-forge/singularity), we recommend installing Singularity system-wide following the official installation documentation.
+
+
+
+#### Step 2: Fetch the Singularity Image
+
+To pull the Singularity image needed for the workflow, you can use the provided script as directed below or pull the image directly.
+
+> **Note**: This command should be run in the location containing the `GL_RefAnnotTable-A_1.1.0` directory that was downloaded in [step 1](#1-download-the-workflow-files). Depending on your network speed, fetching the images will take approximately 20 minutes.
+
+```bash
+bash GL_RefAnnotTable-A_1.1.0/bin/prepull_singularity.sh GL_RefAnnotTable-A_1.1.0/config/software/by_docker_image.config
+```
+
+Once complete, a `singularity` folder containing the Singularity images will be created. Next, set up the required environment variables:
+
+```bash
+# Set R library path to current working directory
+export R_LIBS_USER=$(pwd)/R_libs
+# Create the specified R library path if it doesn't already exist
+mkdir -p $R_LIBS_USER
+
+# Set Singularity cache directory
+export SINGULARITY_CACHEDIR=$(pwd)/singularity
+```
+
+
+
+#### Step 3: Run the Workflow
+
+> **Note**: The annotation database creation process requires FTP access through port 21. If you encounter connection issues, please verify that port 21 is not blocked by your network/firewall settings or try running the workflow on a system with unrestricted FTP access.
+
+While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder that was downloaded in [step 1](#1-download-the-workflow-files), you can now run the workflow. Below is an example for generating the annotation table for *Mus musculus* (mouse):
+
+```bash
+singularity exec \
+ --bind $(pwd):$(pwd) \
+ $SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \
+ Rscript GL_RefAnnotTable-A_1.1.0/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'
+```
+
+
+
+---
+
+### Approach 2: Using a Local R Environment
+
+This approach allows you to run the workflow directly in your local R environment without using containers.
+
+
+
+#### Step 1: Install R and Required R Packages
+
+We recommend installing R via the [Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/):
+
+1. Select the [CRAN Mirror](https://cran.r-project.org/mirrors.html) closest to your location.
+
+2. Navigate to the download page for your operating system.
+
+3. Download and install R (e.g., R-4.4.0).
+
+Once R is installed, install the required R packages as follows:
+
+Open a terminal and start R:
+
+
+```bash
+R
+```
+
+
+Within the R environment, run the following commands to install the required packages:
+
+
+```R
+install.packages("tidyverse")
+install.packages("BiocManager")
+BiocManager::install("STRINGdb")
+BiocManager::install("PANTHER.db")
+BiocManager::install("rtracklayer")
+BiocManager::install("AnnotationForge")
+BiocManager::install("biomaRt")
+BiocManager::install("GO.db")
+```
+
+
+
+#### Step 2: Run the Workflow
+
+While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder that was downloaded in [step 1](#1-download-the-workflow-files), you can now run the workflow. Below is an example of how to run the workflow to build an annotation table for *Mus musculus* (mouse):
+
+
+```bash
+Rscript GL_RefAnnotTable-A_1.1.0/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'
+```
+
+
+
+ ---
+
+ ### Workflow Input/Output Data
+
+The input and output data are the same for both [Approach 1: Using Singularity](#approach-1-using-singularity) and [Approach 2: Using a Local R Environment](#approach-2-using-a-local-r-environment).
+
+
+
+**Input data:**
+
+- No input files are required. Specify the species name of the target organism using a positional command line argument. `Mus musculus` is used in both the Singularity and the local R environment examples above.
+ > **Notes**:
+ > - To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments.
+ > - The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
+
+- *Optional*: A local reference table CSV file can be supplied as a second positional argument. If not provided, the script will download the current version of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) table by default.
+
+
+**Output data:**
+
+- *-GL-annotations.tsv (Tab delineated table of gene annotations)
+
+- *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation)
+
+
+
+---
+
+### 3. Run the Annotations Database Creation Function as a Stand-Alone Script
+
+If the reference table does not specify an annotations database for the target organism in the 'annotations' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) file, the `install_annotations` function (defined in `install-org-db.R`) will be executed by default. This function can also be run as a stand-alone script:
+
+> **Note**: If you encounter timeout errors, you can increase the default timeout (3600 seconds) by modifying the `options(timeout=3600)` line at the top of the `install-org-db.R` script.
+
+
+
+#### Using Singularity
+
+> **Note**: The annotation database creation process requires FTP access through port 21. If you encounter connection issues, please verify that port 21 is not blocked by your network/firewall settings.
+
+```bash
+# Set R library path if not already set
+export R_LIBS_USER=$(pwd)/R_libs
+# Create the specified R library path if it doesn't already exist
+mkdir -p $R_LIBS_USER
+
+# Set Singularity cache directory if not already set
+export SINGULARITY_CACHEDIR=$(pwd)/singularity
+
+singularity exec \
+ --bind $(pwd):$(pwd) \
+ $SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \
+ Rscript GL_RefAnnotTable-A_1.1.0/install-org-db.R 'Bacillus subtilis'
+```
+
+
+
+#### Using a Local R Environment
+
+```bash
+Rscript GL_RefAnnotTable-A_1.1.0/install-org-db.R 'Bacillus subtilis'
+```
+
+
+
+**Input data:**
+
+- The species name of the target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in both the Singularity and local R examples above.
+ > **Note**: The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
+
+- *Optional*: A local reference table CSV file can be supplied as a second positional argument. If not provided, the script will download the current version of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) table by default.
+
+
+**Output data:**
+
+- org.*.eg.db/ (Species-specific annotation database, as a local R package)
+
+
+
+---
diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R
new file mode 100644
index 00000000..46923217
--- /dev/null
+++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R
@@ -0,0 +1,556 @@
+#!/usr/bin/env Rscript
+# Written by Mike Lee
+# GeneLab script for generating organism-specific gene annotation tables
+# Example usage: Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'
+options(timeout = 3600)
+.libPaths(Sys.getenv("R_LIBS_USER"))
+# Define variables associated with current pipeline and annotation table versions
+GL_DPPD_ID <- "GL-DPPD-7110-A"
+workflow_version <- "GL_RefAnnotTable-A_1.1.0"
+ref_tab_path <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv"
+readme_path <- "https://github.com/nasa/GeneLab_Data_Processing/tree/master/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md"
+
+# List currently supported organisms
+currently_accepted_orgs <- c("Arabidopsis thaliana", "Bacillus subtilis", "Brachypodium distachyon",
+ "Caenorhabditis elegans", "Danio rerio", "Drosophila melanogaster",
+ "Escherichia coli", "Homo sapiens", "Lactobacillus acidophilus",
+ "Mus musculus", "Mycobacterium marinum", "Oryza sativa",
+ "Oryzias latipes", "Pseudomonas aeruginosa", "Rattus norvegicus",
+ "Saccharomyces cerevisiae", "Salmonella enterica", "Serratia liquefaciens",
+ "Staphylococcus aureus", "Streptococcus mutans", "Vibrio fischeri")
+
+
+#########################################################################
+############### Pull in and check command line arguments ################
+#########################################################################
+
+# Pull in command-line arguments
+args <- commandArgs(trailingOnly = TRUE)
+
+# Get the target organism (CLI argument 1) and check that it is listed in currently_accepted_orgs
+validate_arguments <- function(args, supported_orgs) {
+ if (length(args) < 1) {
+ stop("One positional argument is required that specifies the target organism. Available options are:\n", paste(supported_orgs, collapse = ", "))
+ }
+
+ # Convert the first argument to uppercase
+ target_organism <- toupper(args[1])
+
+ # Check if the uppercased target organism is in the uppercased supported_orgs
+ if (!target_organism %in% sapply(supported_orgs, toupper)) {
+ stop(paste0("'", target_organism, "' is not currently supported."))
+ }
+
+ return(args[1])
+}
+
+target_organism <- validate_arguments(args, currently_accepted_orgs)
+
+# If provided, get the reference table URL from CLI arguments (CLI argument 2) and update ref_tab_path
+ref_tab_path <- if (length(args) >= 2) args[2] else ref_tab_path
+
+
+#########################################################################
+######################## Set up environment #############################
+#########################################################################
+
+required_packages <- c("tidyverse", "STRINGdb", "PANTHER.db", "rtracklayer")
+# Check for required packages other than the org-specific db #
+report_package_needed <- function(package_name) {
+ cat(paste0("\n The package '", package_name, "' is required. Please see:\n"))
+ cat(" https://github.com/nasa/GeneLab_Data_Processing/tree/master/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md\n\n")
+ quit()
+}
+
+# Check and report missing packages other than the org-specific db
+for (pkg in required_packages) {
+ if (!requireNamespace(pkg, quietly = TRUE)) {
+ report_package_needed(pkg)
+ }
+}
+
+# Import libraries
+library(tidyverse)
+library(STRINGdb)
+library(PANTHER.db)
+library(rtracklayer)
+
+
+#########################################################################
+############## Define variables and output file names ###################
+#########################################################################
+
+ref_table <- tryCatch(
+ read.csv(ref_tab_path),
+ error = function(e) {
+ message <- paste("Error: Unable to read the reference table from the path provided. Please check the path and try again.\nPath:", ref_tab_path)
+ stop(message)
+ }
+)
+
+# Get target organism information
+target_info <- ref_table %>%
+ filter(species == target_organism)
+
+# Extract the relevant columns from the reference table
+target_taxid <- target_info$taxon # Taxonomic identifier
+target_org_db <- target_info$bioconductor_annotations # org.eg.db R package
+gtf_link <- target_info$gtf # Path to reference assembly GTF
+target_short_name <- target_info$name # PANTHER / UNIPROT short name; blank if not available
+ref_source <- target_info$ref_source # Reference files source
+
+# Error handling for missing values
+if (is.na(target_taxid) || is.na(target_org_db) || is.na(target_organism) || is.na(gtf_link)) {
+ stop(paste("Error: Missing data for target organism", target_organism, "in reference table."))
+}
+
+# Create output filenames
+base_gtf_filename <- basename(gtf_link)
+base_output_name <- str_replace(base_gtf_filename, ".gtf.gz", "")
+
+# Add the species name to base_output_name if the reference source is not ENSEMBL
+if (!(ref_source %in% c("ensembl_plants", "ensembl_bacteria", "ensembl"))) {
+ base_output_name <- paste(str_replace(target_organism, " ", "_"), base_output_name, sep = "_")
+}
+
+out_table_filename <- paste0(base_output_name, "-GL-annotations.tsv")
+out_log_filename <- paste0(base_output_name, "-GL-build-info.txt")
+
+# Check if output file already exists and if it does, exit without overwriting
+if ( file.exists(out_table_filename) ) {
+ cat("\n-------------------------------------------------------------------------------------------------\n")
+ cat(paste0("\n The file that would be created, '", out_table_filename, "', exists already.\n"))
+ cat(paste0(" We don't want to overwrite it accidentally. Move it and run this again if wanting to proceed.\n"))
+ cat("\n-------------------------------------------------------------------------------------------------\n")
+ quit()
+}
+
+
+#############################################
+######## Load annotation databases #########
+#############################################
+
+####### GTF ##########
+
+# Create the GTF dataframe from its path, unique gene identities in the reference assembly are under 'gene_id'
+GTF <- rtracklayer::import(gtf_link)
+GTF <- data.frame(GTF)
+
+###### org.db ########
+
+# Define a function to load the specified org.db package for a given target organism
+install_and_load_org_db <- function(target_organism, target_org_db, ref_tab_path) {
+ # Folder names for the script location: Parent directories or . for executing from parent dir or cd.
+ ## No functionality to pull in the path of an executing R script is available
+ possible_folders <- c("workflow_code", workflow_version, ".")
+
+ # Get the current working directory and attempt to locate the correct folder
+ script_dir <- getwd()
+
+ install_script_path <- NULL
+
+ for (folder in possible_folders) {
+ potential_path <- file.path(script_dir, folder, "install-org-db.R")
+ if (file.exists(potential_path)) {
+ install_script_path <- potential_path
+ break
+ }
+ }
+
+ # If the install script path was not found, stop with an error
+ if (is.null(install_script_path)) {
+ stop("Cannot find 'install-org-db.R' in the expected folders: 'workflow_code' or 'GL_RefAnnotTable-A_1.1.0'")
+ }
+
+ # If target_org_db is provided, try to install it from Bioconductor
+ if (!is.na(target_org_db) && target_org_db != "") {
+ BiocManager::install(target_org_db, ask = FALSE)
+
+ # Check if the package was successfully loaded
+ if (!requireNamespace(target_org_db, quietly = TRUE)) {
+ # Source the install script to create the database locally
+ source(install_script_path)
+ target_org_db <- install_annotations(target_organism, ref_tab_path)
+ }
+ } else {
+ # If target_org_db is NA or empty, create it locally using the helper script
+ source(install_script_path)
+ target_org_db <- install_annotations(target_organism, ref_tab_path)
+ }
+
+ # Load the package into the R session
+ library(target_org_db, character.only = TRUE)
+
+ # Return the target_org_db name
+ return(target_org_db)
+}
+
+# Define list of supported organisms which do not use annotations from an org.db
+no_org_db <- c("Lactobacillus acidophilus", "Mycobacterium marinum", "Oryza sativa", "Pseudomonas aeruginosa",
+ "Serratia liquefaciens", "Staphylococcus aureus", "Streptococcus mutans", "Vibrio fischeri")
+
+# Run the function unless the target_organism is in no_org_db and update target_org_db with the result
+if (!(target_organism %in% no_org_db) && (target_organism %in% currently_accepted_orgs)) {
+ target_org_db <- install_and_load_org_db(target_organism, target_org_db, ref_tab_path)
+}
+
+############################################
+######## Build annotation table ############
+############################################
+
+# Initialize table from GTF
+
+# Define GTF keys based on the target organism; gene_id conrains unique gene IDs in the reference assembly. Defaults to ENSEMBL
+
+gtf_keytype_mappings <- list(
+ "Arabidopsis thaliana" = c(gene_id = "TAIR"),
+ "Bacillus subtilis" = c(gene_id = "ENSEMBL", gene_name = "SYMBOL"),
+ "Brachypodium distachyon" = c(gene_id = "ENSEMBL", transcript_id = "ACCNUM"),
+ "Caenorhabditis elegans" = c(gene_id = "ENSEMBL"),
+ "Escherichia coli" = c(gene_id = "ENSEMBL", gene_name = "SYMBOL"),
+ "Lactobacillus acidophilus" = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"),
+ "Mycobacterium marinum" = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"),
+ "Pseudomonas aeruginosa" = c(gene_id = "LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"),
+ "Salmonella enterica" = c(gene_id = "ENSEMBL", db_xref = "ENTREZID"),
+ "Serratia liquefaciens" = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"),
+ "Staphylococcus aureus" = c(gene_id = "LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"),
+ "Streptococcus mutans" = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"),
+ "Vibrio fischeri" = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"),
+ "default" = c(gene_id = "ENSEMBL")
+)
+
+# Get the key types for the target organism or use the default
+wanted_gtf_keytypes <- if (!is.null(gtf_keytype_mappings[[target_organism]])) {
+ gtf_keytype_mappings[[target_organism]]
+} else {
+ c(gene_id = "ENSEMBL")
+}
+
+# Initialize the annotation table from the GTF, keeping only the wanted_gtf_keytypes
+annot_gtf <- GTF[, names(wanted_gtf_keytypes), drop = FALSE]
+annot_gtf <- annot_gtf %>% distinct()
+
+# Rename the columns in the annot_gtf dataframe according to the key types
+colnames(annot_gtf) <- wanted_gtf_keytypes
+
+# Save the name of the primary key type (gene_id) being used
+primary_keytype <- wanted_gtf_keytypes[1]
+
+# Filter out unwanted genes from the GTF
+
+# Define filtering criteria for specific organisms
+filter_criteria <- list(
+ "Bacillus subtilis" = "^BSU",
+ "Drosophila melanogaster" = "^RR",
+ "Saccharomyces cerevisiae" = "^Y[A-Z0-9]{6}-?[A-Z]?$",
+ "Escherichia coli" = "^b[0-9]{4}$"
+)
+
+# Apply the filter if there's a specific criterion for the target organism
+filter_pattern <- filter_criteria[[target_organism]]
+
+if (!is.null(filter_pattern)) {
+ if (target_organism == "Drosophila melanogaster") {
+ annot_gtf <- annot_gtf %>% filter(!grepl(filter_pattern, !!sym(primary_keytype)))
+ } else {
+ annot_gtf <- annot_gtf %>% filter(grepl(filter_pattern, !!sym(primary_keytype)))
+ }
+}
+
+# Remove "Gene:" labels on ENTREZ IDs
+if (target_organism == "Salmonella enterica") {
+ annot_gtf <- annot_gtf %>% dplyr::mutate(ENTREZID = gsub("^GeneID:", "", ENTREZID)) %>% as.data.frame
+}
+
+#########################################################################
+########################### Add org.db keys #############################
+#########################################################################
+
+annot_orgdb <- annot_gtf
+
+# Define the initial keys to pull from the organism-specific database
+orgdb_keytypes_list <- list(
+ "Brachypodium distachyon" = c("GENENAME", "REFSEQ", "ENTREZID"),
+ "Escherichia coli" = c("GENENAME", "REFSEQ", "ENTREZID"),
+ "Caenorhabditis elegans" = c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID", "GO"),
+ "Salmonella enterica" = c("SYMBOL", "GENENAME", "REFSEQ"),
+ "Saccharomyces cerevisiae" = c("GENENAME", "ALIAS", "REFSEQ", "ENTREZID"),
+ "default" = c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID")
+)
+
+# Add entries for organisms in no_org_db as character(0) (no keys wanted from the org.db)
+for (organism in no_org_db) {
+ orgdb_keytypes_list[[organism]] <- character(0)
+}
+
+wanted_org_db_keytypes <- if (target_organism %in% names(orgdb_keytypes_list)) {
+ orgdb_keytypes_list[[target_organism]]
+} else {
+ orgdb_keytypes_list[["default"]]
+}
+
+# Define mappings for query and keytype based on target organism
+orgdb_keytype_mappings <- list(
+ "Bacillus subtilis" = list(query = "SYMBOL", keytype = "SYMBOL"),
+ "Brachypodium distachyon" = list(query = "ACCNUM", keytype = "ACCNUM"),
+ "Caenorhabditis elegans" = list(query = primary_keytype, keytype = "ENSEMBL"),
+ "Escherichia coli" = list(query = "SYMBOL", keytype = "SYMBOL"),
+ "Salmonella enterica" = list(query = "ENTREZID", keytype = "ENTREZID"),
+ "default" = list(query = primary_keytype, keytype = primary_keytype)
+)
+
+# Define the orgdb_query, this is the key type that will be used to map to the org.db
+orgdb_query <- if (!is.null(orgdb_keytype_mappings[[target_organism]])) {
+ orgdb_keytype_mappings[[target_organism]][["query"]]
+} else {
+ orgdb_keytype_mappings[["default"]][["query"]]
+}
+
+# Define the orgdb_keytype, this is the name of the key type in the org.db
+orgdb_keytype <- if (!is.null(orgdb_keytype_mappings[[target_organism]])) {
+ orgdb_keytype_mappings[[target_organism]][["keytype"]]
+} else {
+ orgdb_keytype_mappings[["default"]][["keytype"]]
+}
+
+# Function to remove version numbers from ACCNUM keys and match them for BRADI
+match_accnum <- function(annot_table, org_db, query_col, keytype_col, target_column) {
+ # Remove version numbers from the ACCNUM keys in the GTF annotations
+ cleaned_annot_keys <- sub("\\..*", "", annot_table[[query_col]])
+
+ # Retrieve and remove version numbers from the org.db keys
+ orgdb_keys <- keys(org_db, keytype = keytype_col)
+ cleaned_orgdb_keys <- sub("\\..*", "", orgdb_keys)
+
+ # Create a lookup table for matching cleaned keys to original keys
+ lookup_table <- setNames(orgdb_keys, cleaned_orgdb_keys)
+
+ # Match cleaned GTF keys to original org.db keys
+ matched_keys <- lookup_table[cleaned_annot_keys]
+
+ # Use the matched keys to retrieve the target annotations from org.db
+ mapIds(org_db, keys = matched_keys, keytype = keytype_col, column = target_column, multiVals = "list")
+}
+
+# Loop through the desired key types and add annotations to the GTF table
+for (keytype in wanted_org_db_keytypes) {
+ # Check if keytype is a valid column in the target org.db
+ if (keytype %in% columns(get(target_org_db, envir = .GlobalEnv))) {
+ if (target_organism == "Brachypodium distachyon" && orgdb_query == "ACCNUM") {
+ # For BRADI: use the match_accnum function to map to org.db ACCNUM entries
+ org_matches <- match_accnum(annot_orgdb, get(target_org_db, envir = .GlobalEnv), query_col = orgdb_query, keytype_col = orgdb_keytype, target_column = keytype)
+ } else {
+ # Default mapping for other organisms
+ org_matches <- mapIds(get(target_org_db, envir = .GlobalEnv), keys = annot_orgdb[[orgdb_query]], keytype = orgdb_keytype, column = keytype, multiVals = "list")
+ }
+ # Add the mapped annotations to the GTF table
+ annot_orgdb[[keytype]] <- sapply(org_matches, function(x) paste(x, collapse = "|"))
+ } else {
+ # Set column to NA if keytype is not present in org.db
+ annot_orgdb[[keytype]] <- NA
+ }
+}
+
+# For SALTY, reorder columns to match other tables
+if (target_organism == "Salmonella enterica") { # Reorder columns to match others; was mismatched since ENTREZ came from GTF
+ annot_orgdb <- annot_orgdb[, c("ENSEMBL", "SYMBOL", "GENENAME", "REFSEQ", "ENTREZID")]
+}
+
+# For YEAST, Rename ALIAS to GENENAME
+if (target_organism == "Saccharomyces cerevisiae") {
+ colnames(annot_orgdb) <- c("ENSEMBL", "SYMBOL", "GENENAME", "REFSEQ", "ENTREZID")
+}
+
+#########################################################################
+########################### Add STRING IDs ##############################
+#########################################################################
+
+# Define organisms that do not use STRING annotations
+no_stringdb <- c("Pseudomonas aeruginosa", "Staphylococcus aureus")
+
+# Define the key type used for mapping to STRING
+stringdb_query_list <- list(
+ "Lactobacillus acidophilus" = "OLD_LOCUS",
+ "Mycobacterium marinum" = "OLD_LOCUS",
+ "Serratia liquefaciens" = "OLD_LOCUS",
+ "Streptococcus mutans" = "OLD_LOCUS",
+ "Vibrio fischeri" = "OLD_LOCUS",
+ "default" = primary_keytype
+)
+
+# Define the key type for mapping in STRING, using the default if necessary
+stringdb_query <- if (!is.null(stringdb_query_list[[target_organism]])) {
+ stringdb_query_list[[target_organism]]
+} else {
+ stringdb_query_list[["default"]]
+}
+
+# Handle organisms which do not use the GTF's gene_id keys to map to STRING
+# These are microbial species for which NCBI references were used rather than ENSEMBL,
+# for which the STRING accessions match the GTF's gene_name keys, but not the gene_id keys.
+uses_old_locus <- c("Lactobacillus acidophilus", "Mycobacterium marinum", "Serratia liquefaciens", "Streptococcus mutans", "Vibrio fischeri")
+# Handle STRING annotation processing based on the target organism
+if (target_organism %in% uses_old_locus) {
+ # If the target organism is one of the NOENTRY organisms, handle the OLD_LOCUS splitting
+ annot_stringdb <- annot_orgdb %>%
+ separate_rows(!!sym(stringdb_query), sep = ",", convert = TRUE) %>%
+ distinct() %>%
+ as.data.frame()
+} else {
+ # For other organisms, collapse on the primary key
+ annot_stringdb <- annot_orgdb %>% distinct()
+ annot_stringdb <- annot_stringdb %>%
+ group_by(!!sym(primary_keytype)) %>%
+ summarise(across(everything(), ~paste(unique(na.omit(.))[unique(na.omit(.)) != ""], collapse = "|")), .groups = 'drop') %>%
+ as.data.frame()
+}
+
+# Replace "BSU_" with "BSU" in the primary_keytype column for BACSU before STRING mapping
+if (target_organism == "Bacillus subtilis") {
+ annot_stringdb[[stringdb_query]] <- gsub("^BSU_", "BSU", annot_stringdb[[stringdb_query]])
+}
+
+# Map alternative taxonomy IDs for organisms not directly supported by STRING
+taxid_map <- list(
+ "Saccharomyces cerevisiae" = 4932,
+ "Brassica rapa" = 51351,
+ "Serratia liquefaciens" = 614
+)
+
+# Assign the alternative taxonomy identifier if applicable
+target_taxid <- if (!is.null(taxid_map[[target_organism]])) {
+ taxid_map[[target_organism]]
+} else {
+ target_taxid
+}
+
+# Initialize string_map
+string_map <- NULL
+
+# If the target organism is supported by STRING, get STRING annotations
+if (!(target_organism %in% no_stringdb)) {
+ string_db <- STRINGdb$new(version = "12.0", species = target_taxid, score_threshold = 0)
+ string_map <- string_db$map(annot_stringdb, stringdb_query, removeUnmappedRows = FALSE, takeFirst = FALSE)
+}
+if (!is.null(string_map)) {
+ annot_stringdb <- annot_stringdb %>%
+ group_by(!!sym(primary_keytype)) %>%
+ summarise(across(everything(), ~paste(unique(na.omit(.))[unique(na.omit(.)) != ""], collapse = "|")), .groups = 'drop')
+
+ string_map <- string_map %>%
+ group_by(!!sym(primary_keytype)) %>%
+ summarise(across(everything(), ~paste(unique(na.omit(.))[unique(na.omit(.)) != ""], collapse = "|")), .groups = 'drop')
+}
+
+if (!is.null(string_map)) {
+ # Determine the appropriate join key
+ join_key <- if (target_organism %in% c("Lactobacillus acidophilus", "Mycobacterium marinum", "Serratia liquefaciens", "Streptococcus mutans", "Vibrio fischeri")) {
+ primary_keytype
+ } else {
+ stringdb_query
+ }
+
+ # Add temporary column to add string IDs to annotation table
+ annot_stringdb <- annot_stringdb %>%
+ mutate(join_key = toupper(!!sym(join_key)))
+
+ string_map <- string_map %>%
+ mutate(join_key = toupper(!!sym(join_key)))
+
+ # Join STRING IDs to the annotation table
+ annot_stringdb <- left_join(annot_stringdb, string_map %>% dplyr::select(join_key, STRING_id), by = "join_key") %>%
+ dplyr::select(-join_key)
+}
+
+# Undo the "BSU_" to "BSU" replacement for BACSU after STRING mapping
+if (target_organism == "Bacillus subtilis") {
+ annot_stringdb[[stringdb_query]] <- gsub("^BSU", "BSU_", annot_stringdb[[stringdb_query]])
+}
+
+annot_stringdb <- as.data.frame(annot_stringdb)
+
+#########################################################################
+################ Add Gene Ontology (GO) slim IDs ########################
+#########################################################################
+
+# Define organisms that do not use PANTHER annotations
+no_panther_db <- c("Caenorhabditis elegans", "Mycobacterium marinum", "Oryza sativa", "Staphylococcus aureus", "Lactobacillus acidophilus", "Serratia liquefaciens", "Streptococcus mutans", "Vibrio fischeri", "Pseudomonas aeruginosa")
+
+annot_pantherdb <- annot_stringdb
+
+if (!(target_organism %in% no_panther_db)) {
+
+ # Define the key type in the annotation table used to map to PANTHER DB
+ pantherdb_query = "ENTREZID"
+ pantherdb_keytype = "ENTREZ"
+
+ # Retrieve target organism PANTHER GO slim annotations database using the UNIPROT / PANTHER short name
+ pthOrganisms(PANTHER.db) <- target_short_name
+
+ # Define a function to retrieve GO slim IDs for a given gene's ENTREZIDs, which may include entries separated by a "|"
+ get_go_slim_ids <- function(entrez_id) {
+ if (is.na(entrez_id) || entrez_id == "NA") {
+ return("NA")
+ }
+
+ entrez_ids <- unlist(strsplit(entrez_id, "|", fixed = TRUE))
+ go_ids <- lapply(entrez_ids, function(id) {
+ mapIds(PANTHER.db, keys = id, keytype = pantherdb_keytype, column = "GOSLIM_ID", multiVals = "list")
+ })
+
+ # Flatten the list and remove duplicates
+ go_ids <- unique(unlist(go_ids))
+
+ if (length(go_ids) == 0) {
+ return("NA")
+ } else {
+ return(paste(go_ids, collapse = "|"))
+ }
+ }
+
+ # Apply the GO slim ID mapping function to all valid rows
+ annot_pantherdb <- annot_pantherdb %>%
+ mutate(GOSLIM_IDS = sapply(get(pantherdb_query), get_go_slim_ids))
+}
+
+
+#########################################################################
+############# Export annotation table and build info ####################
+#########################################################################
+
+# Group by primary key to remove any remaining unjoined or duplicate rows
+annot <- annot_pantherdb %>%
+ group_by(!!sym(primary_keytype)) %>%
+ summarise(across(everything(), ~paste(unique(na.omit(.))[unique(na.omit(.)) != ""], collapse = "|")), .groups = 'drop')
+
+# If "GO" column exists, move it to the end to keep columns in consistent order across organisms
+if ("GO" %in% names(annot)) {
+ go_column <- annot$GO
+ annot$GO <- NULL
+ annot$GO <- go_column
+}
+
+# Sort the annotation table based on primary keytype gene IDs
+annot <- annot %>% arrange(.[[1]])
+
+# Replace any blank cells with NA
+annot[annot == "" | annot == "NA"] <- NA
+
+# Export the annotation table
+write.table(annot, out_table_filename, sep = "\t", quote = FALSE, row.names = FALSE)
+
+# Define the date when the annotation table was generated
+date_generated <- format(Sys.time(), "%d-%B-%Y")
+
+# Export annotation build information
+writeLines(paste(c("Based on:\n ", GL_DPPD_ID), collapse = ""), out_log_filename)
+write(paste(c("\nBuild done on:\n ", date_generated), collapse = ""), out_log_filename, append = TRUE)
+write(paste(c("\nUsed gtf file:\n ", gtf_link), collapse = ""), out_log_filename, append = TRUE)
+if (!(target_organism %in% no_org_db)) {
+ write(paste(c("\nUsed ", target_org_db, " version:\n ", packageVersion(target_org_db) %>% as.character()), collapse = ""), out_log_filename, append = TRUE)
+}
+write(paste(c("\nUsed STRINGdb version:\n ", packageVersion("STRINGdb") %>% as.character()), collapse = ""), out_log_filename, append = TRUE)
+write(paste(c("\nUsed PANTHER.db version:\n ", packageVersion("PANTHER.db") %>% as.character()), collapse = ""), out_log_filename, append = TRUE)
+
+write("\n\nAll session info:\n", out_log_filename, append = TRUE)
+write(capture.output(sessionInfo()), out_log_filename, append = TRUE)
diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_singularity.sh b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_singularity.sh
new file mode 100644
index 00000000..de057d1b
--- /dev/null
+++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_singularity.sh
@@ -0,0 +1,32 @@
+
+#!/usr/bin/env bash
+
+# Addresses issue: https://github.com/nextflow-io/nextflow/issues/1210
+
+CONFILE=${1:-nextflow.config}
+OUTDIR=${2:-./singularity}
+
+if [ ! -e $CONFILE ]; then
+ echo "$CONFILE does not exist"
+ exit
+fi
+
+TMPFILE=`mktemp`
+
+CURDIR=$(pwd)
+
+mkdir -p $OUTDIR
+
+cat ${CONFILE}|grep 'container'|perl -lane 'if ( $_=~/container\s*\=\s*\"(\S+)\"/ ) { $_=~/container\s*\=\s*\"(\S+)\"/; print $1 unless ( $1=~/^\s*$/ or $1=~/\.sif/ or $1=~/\.img/ ) ; }' > $TMPFILE
+
+cd ${OUTDIR}
+
+while IFS= read -r line; do
+ name=$line
+ name=${name/:/-}
+ name=${name//\//-}
+ echo $name
+ singularity pull ${name}.img docker://$line
+done < $TMPFILE
+
+cd $CURDIR
diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/config/software/by_docker_image.config b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/config/software/by_docker_image.config
new file mode 100644
index 00000000..93cc12ba
--- /dev/null
+++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/config/software/by_docker_image.config
@@ -0,0 +1,6 @@
+// Config that specifies containers for nextflow processes
+process {
+ withName: 'GL_REFANNOTTABLE_A' {
+ container = "quay.io/nasa_genelab/gl-refannottable-a:1.1.0"
+ }
+}
diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R
new file mode 100644
index 00000000..00f03548
--- /dev/null
+++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R
@@ -0,0 +1,116 @@
+# install-org-db.R
+options(timeout=3600)
+.libPaths(Sys.getenv("R_LIBS_USER"))
+# Load required libraries
+library(tidyverse)
+library(AnnotationForge)
+library(BiocManager)
+
+# Function: Get annotations db from ref table. If no annotations db is defined, create the package name from genus, species, (and strain for microbes),
+# Try to Bioconductor install annotations db. If fail then build the package using AnnotationForge, install it into the current directory.
+# Requires ~80GB for NCBIFilesDir file caching
+install_annotations <- function(target_organism, refTablePath = NULL) {
+ # Default URL for the specific version of the reference CSV
+ default_url <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv"
+
+ # Use the provided path if available, otherwise use the default URL
+ csv_source <- ifelse(is.null(refTablePath), default_url, refTablePath)
+
+ # Attempt to read the CSV file
+ ref_table <- tryCatch({
+ read.csv(csv_source)
+ }, error = function(e) {
+ stop("Failed to read the reference table: ", e$message)
+ })
+
+ target_taxid <- ref_table %>%
+ filter(species == target_organism) %>%
+ pull(taxon)
+
+ # Parse organism's name in the reference table to create the org.db name (target_org_db)
+ target_species_designation <- ref_table %>%
+ filter(species == target_organism) %>%
+ pull(species) %>%
+ gsub("\\s+", " ", .) %>%
+ gsub("[^A-Za-z0-9 ]", "", .)
+
+ genus_species <- strsplit(target_species_designation, " ")[[1]]
+ if (length(genus_species) < 1) {
+ stop("Species designation is not correctly formatted: ", target_species_designation)
+ }
+
+ genus <- genus_species[1]
+ species <- ifelse(length(genus_species) > 1, genus_species[2], "")
+ strain <- ref_table %>%
+ filter(species == target_organism) %>%
+ pull(strain) %>%
+ gsub("[^A-Za-z0-9]", "", .)
+
+ if (!is.na(strain) && strain != "") {
+ species <- paste0(species, strain)
+ }
+
+ # Get package name or build it if not provided
+ target_org_db <- ref_table %>%
+ filter(species == target_organism) %>%
+ pull(bioconductor_annotations)
+
+ if (is.na(target_org_db) || target_org_db == "") {
+ cat("\nNo annotation database specified. Constructing package name...\n")
+ target_org_db <- paste0("org.", substr(genus, 1, 1), species, ".eg.db")
+ }
+
+ cat(paste0("\nChecking Bioconductor for '", target_org_db, "'...\n"))
+ if (requireNamespace(target_org_db, quietly = TRUE)) {
+ cat(paste0("'", target_org_db, "' is already installed.\n"))
+ } else {
+ cat(paste0("\nAttempting to install '", target_org_db, "' from Bioconductor...\n"))
+ BiocManager::install(target_org_db, ask = FALSE)
+
+ if (requireNamespace(target_org_db, quietly = TRUE)) {
+ cat(paste0("'", target_org_db, "' has been successfully installed from Bioconductor.\n"))
+ } else {
+ cat(paste0("\nInstallation from Bioconductor failed, attempting to build '", target_org_db, "'...\n"))
+ if (!dir.exists(target_org_db)) {
+ tryCatch({
+ BiocManager::install(c("AnnotationForge", "biomaRt", "GO.db"), ask = FALSE)
+ library(AnnotationForge)
+ makeOrgPackageFromNCBI(
+ version = "0.1",
+ author = "Your Name ",
+ maintainer = "Your Name ",
+ outputDir = "./",
+ tax_id = target_taxid,
+ genus = genus,
+ species = species
+ )
+ install.packages(file.path("./", target_org_db), repos = NULL, type = "source", quiet = TRUE)
+ cat(paste0("'", target_org_db, "' has been successfully built and installed.\n"))
+ }, error = function(e) {
+ stop("Failed to build and load the package: ", target_org_db, "\nError: ", e$message)
+ })
+ } else {
+ cat(paste0("Local annotation package ", target_org_db, " already exists. This local package will be installed.\n"))
+ install.packages(file.path("./", target_org_db), repos = NULL, type = "source", quiet = TRUE)
+ }
+ }
+ }
+
+ library(target_org_db, character.only = TRUE)
+ cat(paste0("Using Annotation Database '", target_org_db, "'.\n"))
+ return(target_org_db)
+}
+
+if (!interactive()) {
+ # Parse command line arguments
+ args <- commandArgs(trailingOnly = TRUE)
+
+ if (length(args) < 1) {
+ stop("Usage: Rscript install-org-db.R [refTablePath]")
+ }
+
+ target_organism <- args[1]
+ refTablePath <- if (length(args) > 1) args[2] else NULL
+
+ install_annotations(target_organism, refTablePath)
+}
diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/README.md
index 421b2a10..01b497e0 100644
--- a/GeneLab_Reference_Annotations/Workflow_Documentation/README.md
+++ b/GeneLab_Reference_Annotations/Workflow_Documentation/README.md
@@ -6,8 +6,9 @@
|Pipeline Version|Current Workflow Version (for respective pipeline version)|
|:---------------|:---------------------------------------------------------|
-|*[GL-DPPD-7110.md](../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110.md)|[1.0.0](GL_RefAnnotTable)|
+|*[GL-DPPD-7110-A.md](../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md)|[1.1.0](GL_RefAnnotTable-A)|
+|[GL-DPPD-7110.md](../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110.md)|[1.0.0](GL_RefAnnotTable)|
*Current GeneLab Pipeline/Workflow Implementation
-> See the [workflow change log](GL_RefAnnotTable/CHANGELOG.md) to access previous workflow versions and view all changes associated with each version update.
+> See the [workflow change log](GL_RefAnnotTable-A/CHANGELOG.md) to access previous workflow versions and view all changes associated with each version update.