From 3f731d57b1c94059755257578232af7d238ece10 Mon Sep 17 00:00:00 2001 From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com> Date: Thu, 25 Apr 2024 10:38:20 -0700 Subject: [PATCH 01/58] Create GL-DPPD-7110-A.md --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 373 ++++++++++++++++++ 1 file changed, 373 insertions(+) create mode 100644 GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md new file mode 100644 index 00000000..95dcdd91 --- /dev/null +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -0,0 +1,373 @@ +# GeneLab pipeline for generating reference annotation tables + +> **This page holds an overview and instructions for how GeneLab generates reference annotation tables. The GeneLab reference annotation table used to add annotations to processed data files are indicated in the exact processing scripts provided for each GLDS dataset under the respective omics datatype subdirectory.** + +--- + +**Date:** Month XX, 2024 +**Revision:** -A +**Document Number:** GL-DPPD-7110-A + +**Submitted by:** +Alexis Torres and Crystal Han (GeneLab Data Processing Team) + +**Approved by:** +Sylvain Costes (OSDR Project Manager) +Samrawit Gebre (GeneLab Deputy Project Manager and Acting Genelab Configuration Manager) +Lauren Sanders (OSDR Project Scientist) +Amanda Saravia-Butler (GeneLab Science Lead) +Barbara Novak (GeneLab Data Processing Lead) + +--- + +## Updates from previous version + + + +--- + +# Table of contents + +- [GeneLab pipeline for generating reference annotation tables](#genelab-pipeline-for-generating-reference-annotation-tables) +- [Table of contents](#table-of-contents) +- [Software used](#software-used) +- [Annotation table build overview with example commands](#annotation-table-build-overview-with-example-commands) + - [0. Set Up Environment](#0-set-up-environment) + - [1. Define Variables and Output File Names](#1-define-variables-and-output-file-names) + - [2. Load Annotation Databases and Retrieve Unique Gene IDs](#2-load-annotation-databases-and-retrieve-unique-gene-ids) + - [3. Build Initial Annotation Table](#3-build-initial-annotation-table) + - [4. Add STRING IDs](#4-add-string-ids) + - [5. Add Gene Ontology (GO) slim IDs](#5-add-gene-ontology-go-slim-ids) + - [6. Export Annotation Table and Build Info](#6-export-annotation-table-and-build-info) + + +--- + +# Software used + +|Program|Version|Relevant Links| +|:------|:------:|:-------------| +|R|4.2.1|[https://www.r-project.org/](https://www.r-project.org/)| +|Bioconductor|3.15|[https://bioconductor.org](https://bioconductor.org)| +|tidyverse|1.3.2|[https://www.tidyverse.org](https://www.tidyverse.org)| +|STRINGdb|2.8.4|[https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html](https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html)| +|PANTHER.db|1.0.11|[https://bioconductor.org/packages/release/data/annotation/html/PANTHER.db.html](https://bioconductor.org/packages/release/data/annotation/html/PANTHER.db.html)| +|rtracklayer|1.56.1|[https://bioconductor.org/packages/release/bioc/html/rtracklayer.html](https://bioconductor.org/packages/release/bioc/html/rtracklayer.html) +|org.Hs.eg.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html)| +|org.Mm.eg.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html)| +|org.Rn.eg.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.Rn.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Rn.eg.db.html) +|org.Dm.eg.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.Dm.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Dm.eg.db.html)| +|org.Ce.eg.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.Ce.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Ce.eg.db.html)| +|org.At.tair.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.At.tair.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.At.tair.db.html)| +|org.EcK12.eg.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.EcK12.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.EcK12.eg.db.html)| +|org.Sc.sgd.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html)| + +--- + +# Annotation table build overview with example commands + +> Current GeneLab annotation tables are available on [figshare](https://figshare.com/), exact links for each reference organism are provided in the [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) file. +> +> **[Ensembl Reference Files](https://www.ensembl.org/index.html) Used:** +> - Animals: Ensembl release 111 +> - Plants: Ensembl plants release 58 +> - Bacteria: Ensembl bacteria release 58 + + +--- + +This example below is done for *Mus musculus*. All code is executed in R. + +## 0. Set Up Environment + +```R +target_organism == "MOUSE" + +GL_DPPD_ID <- "GL-DPPD-7110-A" + +## Import libraries ## +library(tidyverse) +library(STRINGdb) +library(PANTHER.db) +library(rtracklayer) + + +## Set the primary annotation keytype, TAIR for Arabidopsis, ENSEMBL for all other organisms ## +if ( target_organism == "ARABIDOPSIS" ) { + + primary_keytype <- "TAIR" + +} else { + + primary_keytype <- "ENSEMBL" + +} + + +## Define annotation keys to retrieve ## +wanted_keys_vec <- c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID") + + +## Define links to tables containing species-specific annotation info ## +ref_tab_link <- + "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv" + + +## Set timeout time to allow more time for annotation file downloads to complete ## +options(timeout = 600) +``` + +--- + +## 1. Define Variables and Output File Names + +```R +## Read in tables containing species-specific annotation info ## +ref_table <- read.csv(ref_tab_link) + +## Retrieve and define target organism taxid, annotation database name, and scientific name ## +target_taxid <- ref_table %>% + filter(name == target_organism) %>% + pull(taxon) + +target_org_db <- ref_table %>% + filter(name == target_organism) %>% + pull(annotations) + +target_species_designation <- ref_table %>% + filter(name == target_organism) %>% + pull(species) + +## Define link to Ensembl annotation gtf file for the target organism ## +gtf_link <- ref_table %>% + filter(species == target_species_designation) %>% + pull(gtf) + +## Create output file names ## +base_gtf_filename <- basename(gtf_link) +base_output_name <- str_replace(base_gtf_filename, ".gtf.gz", "") + +out_table_filename <- paste0(base_output_name, "-GL-annotations.tsv") +out_log_filename <- paste0(base_output_name, "-GL-build-info.txt") +``` + +
+ +--- + +## 2. Load Annotation Databases and Retrieve Unique Gene IDs + +```R +## Import Ensembl annotation gtf file for the target organism ## +gtf_obj <- import(gtf_link) + +## Define unique Ensembl IDs ## +unique_IDs <- gtf_obj$gene_id %>% unique() + +## Remove gtf object to conserve RAM, since it is no longer needed ## +rm(gtf_obj) + +## Define target organism annotation database ## +ann.dbi <- target_org_db + +## Install target organism annotation database if not already installed, then load the annotation database library ## +if ( ! require(ann.dbi, character.only = TRUE)) { + + BiocManager::install(ann.dbi, ask = FALSE) + +} + +library(ann.dbi, character.only = TRUE) +``` + +
+ +--- + +## 3. Build Initial Annotation Table + +```R +## Begin annotation table using unique IDs of the primary keytype ## +annot <- data.frame(unique_IDs) +colnames(annot) <- primary_keytype + +## Retrieve and add additional annotation keys as table columns ## +for ( key in wanted_keys_vec ) { + + if ( key %in% columns(eval(parse(text = ann.dbi), env = .GlobalEnv))) { + + new_list <- mapIds(eval(parse(text = ann.dbi), env = .GlobalEnv), keys = unique_IDs, keytype = primary_keytype, column = key, multiVals = "list") + + # they come as lists when we accept the multiple hits, so converting to character strings here + annot[[key]] <- sapply(new_list, paste, collapse = "|") + + } else { + + # if the annotation DB didn't have any of the wanted key types, that column will be missing + # adding in here as an empty column + annot[key] <- NA + + } +} + +``` + +
+ +--- + +## 4. Add STRING IDs + +```R +## Retrieve target organism STRING protein-protein interaction database and create STRING ID map to the primary keytype ## +string_db <- STRINGdb$new(version = "11.5", species = target_taxid, score_threshold = 0) +string_map <- string_db$map(annot, primary_keytype, removeUnmappedRows = FALSE, takeFirst = FALSE) + +## Create a table using the gene IDs of the primary keytype as row names and a column containing STRING IDs. ## +## For genes containing multiple STRING IDs, combine all STRING IDs for each gene into one row and separate each ID with a '|' ## +tab_with_multiple_STRINGids_combined <- + data.frame(row.names = annot[[primary_keytype]]) + +for ( curr_gene_ID in row.names(tab_with_multiple_STRINGids_combined) ) { + + curr_STRING_ids <- string_map %>% + filter(!!rlang::sym(primary_keytype) == curr_gene_ID) %>% + pull(STRING_id) %>% paste(collapse = "|") + + tab_with_multiple_STRINGids_combined[curr_gene_ID, "STRING_id"] <- curr_STRING_ids + +} + +## Move the primary keytype gene IDs back to being a column in the STRING ID table (since they were switched to row names above) ## +tab_with_multiple_STRINGids_combined <- + tab_with_multiple_STRINGids_combined %>% + rownames_to_column(primary_keytype) + +## Add the STRING ID column to the annotation table ## + +annot <- dplyr::left_join(annot, + tab_with_multiple_STRINGids_combined, + by = primary_keytype) +``` + +
+ +--- + +## 5. Add Gene Ontology (GO) slim IDs + +```R +## Retrieve target organism PANTHER GO slim annotations database ## +pthOrganisms(PANTHER.db) <- target_organism + +## Use ENTREZ IDs to map genes to respective PANTHER GO slim annotation(s) ## +# Note: Since there can be none (indicated in the annotation table as "NA"), one, or +# multiple ENTREZ IDs for a gene, this section contains 3 distinct parts to handle +# each of those scenarios and create a new column in the annotation table containing the GO slim IDs + +for ( curr_row in 1:dim(annot)[1] ) { + + curr_entry <- annot[curr_row, "ENTREZID"] + + ## For genes without an ENTREZ ID ## + if ( curr_entry == "NA" ) { + + annot[curr_row, "GOSLIM_IDS"] <- "NA" + + } else if ( ! grepl("|", curr_entry, fixed = TRUE) ) { + + ## For genes with one ENTREZ ID ## + curr_GO_IDs <- mapIds(PANTHER.db, keys = curr_entry, keytype = "ENTREZ", column = "GOSLIM_ID", multiVals = "list") %>% unlist() %>% as.vector() + + ## Add "NA" to the GO slim column for ENTREZ IDs that do not contain a respective GO slim ID ## + if ( is.null(curr_GO_IDs) ) { + + curr_GO_IDs <- "NA" + } + + annot[curr_row, "GOSLIM_IDS"] <- paste(curr_GO_IDs, collapse = "|") + + } else { + + ## For genes with multiple ENTREZ ID ## + # Note: In this scenario, the ENTREZ IDs for each gene are first split with a '|' to + # separate the IDs, then the GO slim ID(s) for each ENTREZ ID are collected and + # combined, then duplicates are removed, and the final list of GO slim IDs for + # each gene are added in a single row, separated with a '|' + + ## Split the ENTREZ IDs ## + curr_entry_vec <- strsplit(curr_entry, "|", fixed = TRUE) + + ## Start a vector of current GO slim IDs ## + curr_GO_IDs <- vector() + + ## Collect and combine GO slim ID(s) for each ENTREZ ID ## + for ( curr_entry in curr_entry_vec ) { + + new_GO_IDs <- mapIds(PANTHER.db, keys = curr_entry, keytype = "ENTREZ", column = "GOSLIM_ID", multiVals = "list") %>% unlist() %>% as.vector() + + ## Add new GO slim IDs to the GO slim IDs vector ## + curr_GO_IDs <- c(curr_GO_IDs, new_GO_IDs) + + } + + ## Remove duplicate GO slim IDs ## + curr_GO_IDs <- unique(curr_GO_IDs) + + ## Add "NA" to the GO slim vector for ENTREZ IDs that do not contain a respective GO slim ID ## + if ( length(curr_GO_IDs) == 0 ) { + + curr_GO_IDs <- "NA" + } + + ## Add additional GO slim IDs to the GOSLIM ID column in the annotation table ## + annot[curr_row, "GOSLIM_IDS"] <- paste(curr_GO_IDs, collapse = "|") + + } + +} +``` + +
+ +--- + +## 6. Export Annotation Table and Build Info + +```R +## Sort the annotation table based on primary keytype gene IDs ## +annot <- annot %>% arrange(.[[1]]) + +## Replacing any blank cells with NA ## +annot[annot == ""] <- NA + +## Export the annotation table using the file name defined in Step 1 ## +write.table(annot, out_table_filename, sep = "\t", quote = FALSE, row.names = FALSE) + +## Define the date the annotation table was generated ## +date_generated <- format(Sys.time(), "%d-%B-%Y") + +## Export annotation table build info using the file name defined in Step 1 ## +writeLines(paste(c("Based on:\n ", GL_DPPD_ID), collapse = ""), out_log_filename) +write(paste(c("Build done on:\n ", date_generated), collapse = ""), out_log_filename, append = TRUE) +write(paste(c("\nUsed gtf file:\n ", gtf_link), collapse = ""), out_log_filename, append = TRUE) +write(paste(c("\nUsed ", ann.dbi, " version:\n ", packageVersion(ann.dbi) %>% as.character()), collapse = ""), out_log_filename, append = TRUE) +write(paste(c("\nUsed STRINGdb version:\n ", packageVersion("STRINGdb") %>% as.character()), collapse = ""), out_log_filename, append = TRUE) +write(paste(c("\nUsed PANTHER.db version:\n ", packageVersion("PANTHER.db") %>% as.character()), collapse = ""), out_log_filename, append = TRUE) + +write("\n\nAll session info:\n", out_log_filename, append = TRUE) +write(capture.output(sessionInfo()), out_log_filename, append = TRUE) +``` + +
+ +--- + +**Pipeline Input data:** + +- No input files required, but a target organism must be specified as a positional command line argument + +**Pipeline Output data:** + +- *-GL-annotations.tsv (Tab delineated table of gene annotations, used to add gene annotations in other GeneLab processing pipelines) +- *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation) From 98ff3e3c3f09cf19732396ca83c754e8aa9a1c41 Mon Sep 17 00:00:00 2001 From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com> Date: Thu, 25 Apr 2024 10:51:51 -0700 Subject: [PATCH 02/58] Add files via upload --- .../GL-DPPD-7110-A_annotations.csv | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) create mode 100644 GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv new file mode 100644 index 00000000..2c9b20ff --- /dev/null +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv @@ -0,0 +1,19 @@ +name,species,strain,ensemblVersion,ref_source,fasta,gtf,taxon,annotations,genelab_annots_link,genelab_annots_info_link +ARABIDOPSIS,Arabidopsis thaliana,,58,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-58/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-58/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.58.gtf.gz,3702,org.At.tair.db,, +BACSU,Bacillus subtilis,subsp. subtilis 168,58,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-58/fasta/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/dna/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-58/gtf/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.58.gtf.gz,224308,org.MeSH.Bsu.168.db,, +BRARP,Brassica rapa,,58,ensembl_plants,,,,,, +WORM,Caenorhabditis elegans,,111,ensembl,https://ftp.ensembl.org/pub/release-111/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz,https://ftp.ensembl.org/pub/release-111/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.111.gtf.gz,6239,org.Ce.eg.db,, +ZEBRAFISH,Danio rerio,,111,ensembl,,,7955,org.Dr.eg.db,, +FLY,Drosophila melanogaster,,111,ensembl,,,7227,org.Dm.eg.db,, +ERCC,,,,ThermoFisher,,,,,, +ECOLI,Escherichia coli,str. K-12 substr. MG1655,58,ensembl_bacteria,,,83333,org.EcK12.eg.db,, +HUMAN,Homo sapiens,,111,ensembl,https://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-111/gtf/homo_sapiens/Homo_sapiens.GRCh38.111.gtf.gz,9606,org.Hs.eg.db,, +MOUSE,Mus musculus,,111,ensembl,https://ftp.ensembl.org/pub/release-111/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-111/gtf/mus_musculus/Mus_musculus.GRCm39.111.gtf.gz,10090,org.Mm.eg.db,, +,Mycobacterium marinum,LHM4,58,ensembl_bacteria,,,,,, +ORYLA,Oryzias latipes,,111,ensembl,,,,,, +RAT,Rattus norvegicus,,111,ensembl,,,10116,org.Rn.eg.db,, +YEAST,Saccharomyces cerevisiae,S288C,111,ensembl,,,559292,org.Sc.sgd.db,, +STAA8,Staphylococcus aureus,UAMS-1,58,ensembl_bacteria,,,,,, +,Streptococcus mutans,UA159,58,ensembl_bacteria,,,,,, +BRADI,Brachypodium distachyon,,58,ensembl_plants,,,15368,,, +ORYSJ,Oryza sativa,Japonica,58,ensembl_plants,,,4530,BSgenome.Osativa.MSU.MSU7,, \ No newline at end of file From b6d408a350b8343a6489009481bc24aecf37058a Mon Sep 17 00:00:00 2001 From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com> Date: Thu, 25 Apr 2024 10:53:35 -0700 Subject: [PATCH 03/58] Updating to point to pipeline version A --- GeneLab_Reference_Annotations/README.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/GeneLab_Reference_Annotations/README.md b/GeneLab_Reference_Annotations/README.md index e11a15c0..755c3348 100644 --- a/GeneLab_Reference_Annotations/README.md +++ b/GeneLab_Reference_Annotations/README.md @@ -1,6 +1,6 @@ # GeneLab pipeline for generating reference annotation tables -> **The document [`GL-DPPD-7110.md`](Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110.md) holds an overview and example commands for how GeneLab generates reference annotation tables. See the [Repository Links](#repository-links) descriptions below for more information.** +> **The document [`GL-DPPD-7110-A.md`](Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md) holds an overview and example commands for how GeneLab generates reference annotation tables. See the [Repository Links](#repository-links) descriptions below for more information.** --- ## Repository Links @@ -17,6 +17,9 @@ --- -**Developed and maintained by:** +**Developed by:** Mike Lee +**Maintained by:** +Alexis Torres +Crystal Han From 9123f89d261684c6ff2bccd3ca0fd566bc0801af Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Fri, 24 May 2024 10:49:38 -0700 Subject: [PATCH 04/58] [GL_RefAnnotTable] Added rat links and annotation table --- .../GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv index 2c9b20ff..e29b8771 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv @@ -11,7 +11,7 @@ HUMAN,Homo sapiens,,111,ensembl,https://ftp.ensembl.org/pub/release-111/fasta/ho MOUSE,Mus musculus,,111,ensembl,https://ftp.ensembl.org/pub/release-111/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-111/gtf/mus_musculus/Mus_musculus.GRCm39.111.gtf.gz,10090,org.Mm.eg.db,, ,Mycobacterium marinum,LHM4,58,ensembl_bacteria,,,,,, ORYLA,Oryzias latipes,,111,ensembl,,,,,, -RAT,Rattus norvegicus,,111,ensembl,,,10116,org.Rn.eg.db,, +RAT,Rattus norvegicus,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/rattus_norvegicus/dna/Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/rattus_norvegicus/Rattus_norvegicus.mRatBN7.2.112.gtf.gz,10116,org.Rn.eg.db,https://figshare.com/ndownloader/files/46537834,https://figshare.com/ndownloader/files/46537867 YEAST,Saccharomyces cerevisiae,S288C,111,ensembl,,,559292,org.Sc.sgd.db,, STAA8,Staphylococcus aureus,UAMS-1,58,ensembl_bacteria,,,,,, ,Streptococcus mutans,UA159,58,ensembl_bacteria,,,,,, From 8d6f239d752788f9cc834a96af9c96d86922b553 Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Sat, 1 Jun 2024 23:01:55 -0700 Subject: [PATCH 05/58] [GL_RefAnnotTable] Updated Reference annotations CSV --- .../GL-DPPD-7110-A_annotations.csv | 35 ++++++++++--------- 1 file changed, 18 insertions(+), 17 deletions(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv index e29b8771..ac4490a0 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv @@ -1,19 +1,20 @@ name,species,strain,ensemblVersion,ref_source,fasta,gtf,taxon,annotations,genelab_annots_link,genelab_annots_info_link -ARABIDOPSIS,Arabidopsis thaliana,,58,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-58/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-58/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.58.gtf.gz,3702,org.At.tair.db,, -BACSU,Bacillus subtilis,subsp. subtilis 168,58,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-58/fasta/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/dna/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-58/gtf/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.58.gtf.gz,224308,org.MeSH.Bsu.168.db,, -BRARP,Brassica rapa,,58,ensembl_plants,,,,,, -WORM,Caenorhabditis elegans,,111,ensembl,https://ftp.ensembl.org/pub/release-111/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz,https://ftp.ensembl.org/pub/release-111/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.111.gtf.gz,6239,org.Ce.eg.db,, -ZEBRAFISH,Danio rerio,,111,ensembl,,,7955,org.Dr.eg.db,, -FLY,Drosophila melanogaster,,111,ensembl,,,7227,org.Dm.eg.db,, -ERCC,,,,ThermoFisher,,,,,, -ECOLI,Escherichia coli,str. K-12 substr. MG1655,58,ensembl_bacteria,,,83333,org.EcK12.eg.db,, -HUMAN,Homo sapiens,,111,ensembl,https://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-111/gtf/homo_sapiens/Homo_sapiens.GRCh38.111.gtf.gz,9606,org.Hs.eg.db,, -MOUSE,Mus musculus,,111,ensembl,https://ftp.ensembl.org/pub/release-111/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-111/gtf/mus_musculus/Mus_musculus.GRCm39.111.gtf.gz,10090,org.Mm.eg.db,, -,Mycobacterium marinum,LHM4,58,ensembl_bacteria,,,,,, -ORYLA,Oryzias latipes,,111,ensembl,,,,,, +ARABIDOPSIS,Arabidopsis thaliana,,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.59.gtf.gz,3702,org.At.tair.db,https://figshare.com/ndownloader/files/46762531,https://figshare.com/ndownloader/files/46762525 +BACSU,Bacillus subtilis,subsp. subtilis 168,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/dna/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.59.gtf.gz,224308,,https://figshare.com/ndownloader/files/46762528,https://figshare.com/ndownloader/files/46762534 +BRADI,Brachypodium distachyon,,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/brachypodium_distachyon/dna/Brachypodium_distachyon.Brachypodium_distachyon_v3.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/brachypodium_distachyon/Brachypodium_distachyon.Brachypodium_distachyon_v3.0.59.gtf.gz,15368,,, +BRARP,Brassica rapa,,59,ensembl_plants,http://ftp.ensemblgenomes.org/pub/plants/release-59/fasta/brassica_rapa/dna/Brassica_rapa.Brapa_1.0.dna.toplevel.fa.gz,http://ftp.ensemblgenomes.org/pub/plants/release-59/gtf/brassica_rapa/Brassica_rapa.Brapa_1.0.59.gtf.gz,3711,,, +WORM,Caenorhabditis elegans,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.112.gtf.gz,6239,org.Ce.eg.db,https://figshare.com/ndownloader/files/46762537,https://figshare.com/ndownloader/files/46762540 +ZEBRAFISH,Danio rerio,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/danio_rerio/Danio_rerio.GRCz11.112.gtf.gz,7955,org.Dr.eg.db,https://figshare.com/ndownloader/files/46762546,https://figshare.com/ndownloader/files/46762543 +FLY,Drosophila melanogaster,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.112.gtf.gz,7227,org.Dm.eg.db,https://figshare.com/ndownloader/files/46762549,https://figshare.com/ndownloader/files/46762552 +ERCC,,,,ThermoFisher,https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip,https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip,,,, +ECOLI,Escherichia coli,str. K-12 substr. MG1655,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/dna/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.59.gtf.gz,511145,,https://figshare.com/ndownloader/files/46762555,https://figshare.com/ndownloader/files/46762558 +HUMAN,Homo sapiens,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz,9606,org.Hs.eg.db,https://figshare.com/ndownloader/files/46762501,https://figshare.com/ndownloader/files/46762495 +MOUSE,Mus musculus,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/mus_musculus/Mus_musculus.GRCm39.112.gtf.gz,10090,org.Mm.eg.db,https://figshare.com/ndownloader/files/46762504,https://figshare.com/ndownloader/files/46762498 +,Mycobacterium marinum,LHM4,59,ensembl_bacteria,coming soon,coming soon,,,, +ORYSJ,Oryza sativa,Japonica,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/oryza_sativa/Oryza_sativa.IRGSP-1.0.59.gtf.gz,4530,BSgenome.Osativa.MSU.MSU7,, +ORYLA,Oryzias latipes,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/oryzias_latipes/dna/Oryzias_latipes.ASM223467v1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/oryzias_latipes/Oryzias_latipes.ASM223467v1.112.gtf.gz,8090,,https://figshare.com/ndownloader/files/46762510,https://figshare.com/ndownloader/files/46762507 RAT,Rattus norvegicus,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/rattus_norvegicus/dna/Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/rattus_norvegicus/Rattus_norvegicus.mRatBN7.2.112.gtf.gz,10116,org.Rn.eg.db,https://figshare.com/ndownloader/files/46537834,https://figshare.com/ndownloader/files/46537867 -YEAST,Saccharomyces cerevisiae,S288C,111,ensembl,,,559292,org.Sc.sgd.db,, -STAA8,Staphylococcus aureus,UAMS-1,58,ensembl_bacteria,,,,,, -,Streptococcus mutans,UA159,58,ensembl_bacteria,,,,,, -BRADI,Brachypodium distachyon,,58,ensembl_plants,,,15368,,, -ORYSJ,Oryza sativa,Japonica,58,ensembl_plants,,,4530,BSgenome.Osativa.MSU.MSU7,, \ No newline at end of file +YEAST,Saccharomyces cerevisiae,S288C,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.112.gtf.gz,559292,org.Sc.sgd.db,https://figshare.com/ndownloader/files/46762516,https://figshare.com/ndownloader/files/46762522 +STAA8,Staphylococcus aureus,UAMS-1,59,ensembl_bacteria,coming soon,coming soon,,,, +,Streptococcus mutans,UA159,59,ensembl_bacteria,,,,,, + \ No newline at end of file From e27b4c7ce1c41cde3a699f50db222782edcec602 Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Thu, 11 Jul 2024 20:32:17 -0700 Subject: [PATCH 06/58] [GL_RefAnnotTable] GL_RefAnnotTable-A 1.1.0 --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 149 +++--- .../GL_RefAnnotTable-A/CHANGELOG.md | 29 ++ .../GL_RefAnnotTable-A/README.md | 98 ++++ .../GL-DPPD-7110_build-genome-annots-tab.R | 431 ++++++++++++++++++ .../workflow_code/install-annot-dbi.R | 86 ++++ .../Workflow_Documentation/README.md | 3 +- 6 files changed, 742 insertions(+), 54 deletions(-) create mode 100644 GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md create mode 100644 GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md create mode 100644 GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110_build-genome-annots-tab.R create mode 100644 GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-annot-dbi.R diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index 95dcdd91..f0a38946 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -4,7 +4,7 @@ --- -**Date:** Month XX, 2024 +**Date:** July 11, 2024 **Revision:** -A **Document Number:** GL-DPPD-7110-A @@ -23,6 +23,11 @@ Barbara Novak (GeneLab Data Processing Lead) ## Updates from previous version +- Updated R from version 4.13 to 4.4.0 +- Updated Bioconductor from 3.15.1 to 3.19.1 +- Added functionality to create an annotation database using AnnotationForge. This applies to organisms without a maintained annotation database package in Bioconductor (e.g. org.Hs.eg.db). This is currently in use for Bacillus subtilis, subsp. subtilis 168 (BACSU), Escherichia coli,str. K-12 substr. MG1655 (ECOLI), and Oryzias latipes (ORYLA). +- Added support for BACSU, ECOLI, ORYLA + --- @@ -47,20 +52,19 @@ Barbara Novak (GeneLab Data Processing Lead) |Program|Version|Relevant Links| |:------|:------:|:-------------| -|R|4.2.1|[https://www.r-project.org/](https://www.r-project.org/)| -|Bioconductor|3.15|[https://bioconductor.org](https://bioconductor.org)| -|tidyverse|1.3.2|[https://www.tidyverse.org](https://www.tidyverse.org)| -|STRINGdb|2.8.4|[https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html](https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html)| -|PANTHER.db|1.0.11|[https://bioconductor.org/packages/release/data/annotation/html/PANTHER.db.html](https://bioconductor.org/packages/release/data/annotation/html/PANTHER.db.html)| -|rtracklayer|1.56.1|[https://bioconductor.org/packages/release/bioc/html/rtracklayer.html](https://bioconductor.org/packages/release/bioc/html/rtracklayer.html) -|org.Hs.eg.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html)| -|org.Mm.eg.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html)| -|org.Rn.eg.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.Rn.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Rn.eg.db.html) -|org.Dm.eg.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.Dm.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Dm.eg.db.html)| -|org.Ce.eg.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.Ce.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Ce.eg.db.html)| -|org.At.tair.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.At.tair.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.At.tair.db.html)| -|org.EcK12.eg.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.EcK12.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.EcK12.eg.db.html)| -|org.Sc.sgd.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html)| +|R|4.4.0|[https://www.r-project.org/](https://www.r-project.org/)| +|Bioconductor|3.19.1|[https://bioconductor.org](https://bioconductor.org)| +|tidyverse|2.0.0|[https://www.tidyverse.org](https://www.tidyverse.org)| +|STRINGdb|2.16.0|[https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html](https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html)| +|PANTHER.db|1.0.12|[https://bioconductor.org/packages/release/data/annotation/html/PANTHER.db.html](https://bioconductor.org/packages/release/data/annotation/html/PANTHER.db.html)| +|rtracklayer|1.64.0|[https://bioconductor.org/packages/release/bioc/html/rtracklayer.html](https://bioconductor.org/packages/release/bioc/html/rtracklayer.html) +|org.Hs.eg.db|3.19.1|[https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html)| +|org.Mm.eg.db|3.19.1|[https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html)| +|org.Rn.eg.db|3.19.1|[https://bioconductor.org/packages/release/data/annotation/html/org.Rn.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Rn.eg.db.html) +|org.Dm.eg.db|3.19.1|[https://bioconductor.org/packages/release/data/annotation/html/org.Dm.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Dm.eg.db.html)| +|org.Ce.eg.db|3.19.1|[https://bioconductor.org/packages/release/data/annotation/html/org.Ce.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Ce.eg.db.html)| +|org.At.tair.db|3.19.1|[https://bioconductor.org/packages/release/data/annotation/html/org.At.tair.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.At.tair.db.html)| +|org.Sc.sgd.db|3.19.1|[https://bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html)| --- @@ -69,9 +73,9 @@ Barbara Novak (GeneLab Data Processing Lead) > Current GeneLab annotation tables are available on [figshare](https://figshare.com/), exact links for each reference organism are provided in the [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) file. > > **[Ensembl Reference Files](https://www.ensembl.org/index.html) Used:** -> - Animals: Ensembl release 111 -> - Plants: Ensembl plants release 58 -> - Bacteria: Ensembl bacteria release 58 +> - Animals: Ensembl release 112 +> - Plants: Ensembl plants release 59 +> - Bacteria: Ensembl bacteria release 59 --- @@ -93,25 +97,21 @@ library(rtracklayer) ## Set the primary annotation keytype, TAIR for Arabidopsis, ENSEMBL for all other organisms ## -if ( target_organism == "ARABIDOPSIS" ) { - +if (target_organism == "ARABIDOPSIS") { primary_keytype <- "TAIR" - } else { - primary_keytype <- "ENSEMBL" - } - ## Define annotation keys to retrieve ## wanted_keys_vec <- c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID") - -## Define links to tables containing species-specific annotation info ## -ref_tab_link <- - "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv" - +## Check for ref table input in arg 2, otherwise load GL-DPPD-7110-A_annotations.csv +if (length(args) >= 2) { + ref_tab_link <- args[2] +} else { + ref_tab_link <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv" +} ## Set timeout time to allow more time for annotation file downloads to complete ## options(timeout = 600) @@ -170,14 +170,22 @@ rm(gtf_obj) ## Define target organism annotation database ## ann.dbi <- target_org_db -## Install target organism annotation database if not already installed, then load the annotation database library ## -if ( ! require(ann.dbi, character.only = TRUE)) { - +## If ann.dbi is not null, try to install the annotations database from bioconductor, otherwise create it with install-annot-dbi.R "" +if (!is.na(ann.dbi) && ann.dbi != "") { BiocManager::install(ann.dbi, ask = FALSE) - + if (!requireNamespace(ann.dbi, quietly = TRUE)) { + source("install-annot-dbi.R") + ann.dbi <- install_annotations(target_organism, ref_tab_link) + } +} else { + source("install-annot-dbi.R") + ann.dbi <- install_annotations(target_organism, ref_tab_link) } + library(ann.dbi, character.only = TRUE) + + ```
@@ -188,28 +196,46 @@ library(ann.dbi, character.only = TRUE) ```R ## Begin annotation table using unique IDs of the primary keytype ## -annot <- data.frame(unique_IDs) -colnames(annot) <- primary_keytype +if (target_organism == "BACSU") { + gtf_df <- as.data.frame(gtf_obj) + # Create a dataframe with unique gene_ids + annot <- gtf_df %>% + dplyr::select(gene_id, gene_name) %>% + distinct(gene_id, .keep_all = TRUE) + colnames(annot) <- c(primary_keytype, "SYMBOL") +} else { + annot <- data.frame(unique_IDs) + colnames(annot) <- primary_keytype +} + +## If organism is BACSU, remove underscores from gene_ids ## +if (target_organism == "BACSU") { + # Create a mapping of original and modified gene IDs + annot$original_IDs <- annot[[primary_keytype]] + annot[[primary_keytype]] <- gsub("_", "", annot[[primary_keytype]]) +} ## Retrieve and add additional annotation keys as table columns ## for ( key in wanted_keys_vec ) { - - if ( key %in% columns(eval(parse(text = ann.dbi), env = .GlobalEnv))) { - - new_list <- mapIds(eval(parse(text = ann.dbi), env = .GlobalEnv), keys = unique_IDs, keytype = primary_keytype, column = key, multiVals = "list") - - # they come as lists when we accept the multiple hits, so converting to character strings here - annot[[key]] <- sapply(new_list, paste, collapse = "|") - - } else { - - # if the annotation DB didn't have any of the wanted key types, that column will be missing - # adding in here as an empty column - annot[key] <- NA - + + if ( key %in% columns(eval(parse(text = ann.dbi), env = .GlobalEnv))) { + + if (target_organism == "BACSU") { + new_list <- mapIds(eval(parse(text = ann.dbi), env = .GlobalEnv), keys = annot[["SYMBOL"]], keytype = "SYMBOL", column = key, multiVals = "list") + } else if (target_organism == "ECOLI") { + new_list <- mapIds(eval(parse(text = ann.dbi), env = .GlobalEnv), keys = unique_IDs, keytype = "ALIAS", column = key, multiVals = "list") + } else { new_list <- mapIds(eval(parse(text = ann.dbi), env = .GlobalEnv), keys = unique_IDs, keytype = primary_keytype, column = key, multiVals = "list") } + annot[[key]] <- sapply(new_list, paste, collapse = "|") + + } else { + # if the annotation DB didn't have any of the wanted key types, that column will be missing + # adding in here as an empty column + annot[key] <- NA + } } + ```
@@ -220,7 +246,7 @@ for ( key in wanted_keys_vec ) { ```R ## Retrieve target organism STRING protein-protein interaction database and create STRING ID map to the primary keytype ## -string_db <- STRINGdb$new(version = "11.5", species = target_taxid, score_threshold = 0) +string_db <- STRINGdb$new(version = "12.0", species = target_taxid, score_threshold = 0) string_map <- string_db$map(annot, primary_keytype, removeUnmappedRows = FALSE, takeFirst = FALSE) ## Create a table using the gene IDs of the primary keytype as row names and a column containing STRING IDs. ## @@ -245,9 +271,20 @@ tab_with_multiple_STRINGids_combined <- ## Add the STRING ID column to the annotation table ## -annot <- dplyr::left_join(annot, - tab_with_multiple_STRINGids_combined, - by = primary_keytype) +if (target_organism == "ECOLI") { + # Add a temporary key for joining in both tables + annot <- annot %>% + mutate(join_key = toupper(ENSEMBL)) + string_map <- string_map %>% + mutate(join_key = toupper(ENSEMBL)) + + # Perform the left join using the temporary key and drop the join_key column if no longer needed + annot <- left_join(annot, string_map %>% dplyr::select(join_key, STRING_id), by = "join_key") %>% + dplyr::select(-join_key) +} else{ + annot <- left_join(annot, tab_with_multiple_STRINGids_combined, by = primary_keytype) +} + ```
@@ -335,6 +372,12 @@ for ( curr_row in 1:dim(annot)[1] ) { ## 6. Export Annotation Table and Build Info ```R +## BACSU-specific: revert gene IDs to originals with underscores ## +if (target_organism == "BACSU") { + annot[["ENSEMBL"]] <- annot$original_IDs + annot$original_IDs <- NULL +} + ## Sort the annotation table based on primary keytype gene IDs ## annot <- annot %>% arrange(.[[1]]) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md new file mode 100644 index 00000000..252dcc0c --- /dev/null +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md @@ -0,0 +1,29 @@ +# Changelog + +All notable changes to this project will be documented in this file. + +The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), +and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). + +## [1.1.0](https://github.com/nasa/GeneLab_Data_Processing/blob/DEV_GeneLab_Reference_Annotations_vGL-DPPD-7110-A/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A) + +### Added + +- Added AnnotationForge helper script to create local annotations databases if not available on Bioconductor + +- Added support for BACSU, ECOLI, and ORYLA via install-annot-dbi.R + +### Fixed + +- Fixed automated processing for ECOLI + +### Changed + +- Updated Ensembl versions + - Animals: Ensembl release 112 + - Plants: Ensembl plants release 59 + - Bacteria: Ensembl bacteria release 59 +- Removed org.EcK12.eg.db and replaced with local annotations database creation since it is no longer on Bioconductor + + +## [1.0.0](https://github.com/nasa/GeneLab_Data_Processing/releases/tag/GL_RefAnnotTable_1.0.0) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md new file mode 100644 index 00000000..73143bb9 --- /dev/null +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -0,0 +1,98 @@ +# GL_RefAnnotTable Workflow Information and Usage Instructions + +## General workflow info +The current GeneLab Reference Annotation Table (GL_RefAnnotTable) pipeline is implemented as an R workflow that can be run from a command line interface (CLI) using bash. The workflow can be used even if you are unfamiliar with R, but if you want to learn more about R, visit the [R-project about page here](https://www.r-project.org/about.html). Additionally, an introduction to R along with installation help and information about using R for bioinformatics can be found [here at Happy Belly Bioinformatics](https://astrobiomike.github.io/R/basics). + +## Utilizing the workflow + +1. [Install R and R packages](#1-install-r-and-r-packages) +2. [Download the workflow files](#2-download-the-workflow-files) +3. [Setup Execution Permission for Workflow Scripts](#3-setup-execution-permission-for-workflow-scripts) +4. [Run the workflow](#4-run-the-workflow) +5. [Run the annotations database creation function as a stand-alone script](#5-run-the-annotations-database-creation-function-as-a-stand-alone-script) +
+ +### 1. Install R and R packages + +We recommend installing R via the [Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/) as follows: + +1. Select the [CRAN Mirror](https://cran.r-project.org/mirrors.html) closest to your location. +2. Click the link under the "Download and Install R" section that's consistent with your machine. +3. Clink on the R-4.2.1 package consistent with your machine to download. +4. Double click on the R-4.2.1.pkg downloaded in step 3 and follow the installation instructions. + +Once R is installed, open a CLI terminal and run the following command to activate R: + +```bash +R +``` + +Within an active R environment, run the following commands to install the required R packages: + +```R +install.packages("tidyverse", version = 2.0.0, repos = "http://cran.us.r-project.org") + +install.packages("BiocManager", version = 3.19, repos = "http://cran.us.r-project.org") + +BiocManager::install("STRINGdb", version = 3.19) +BiocManager::install("PANTHER.db", version = 3.19) +BiocManager::install("rtracklayer", version = 3.19) +``` + +
+ +### 2. Download the Workflow Files + +All files required for utilizing the GL_RefAnnotTable workflow for generating reference annotation tables are in the [workflow_code](workflow_code) directory. To get a copy of latest GL_RefAnnotTable version on to your system, run the following command: + +```bash +curl -LO https://github.com/nasa/GeneLab_Data_Processing/releases/download/GL_RefAnnotTable-A_1.1.0/GL_RefAnnotTable-A_1.1.0.zip +``` + +
+ +### 3. Setup Execution Permission for Workflow Scripts + +Once you've downloaded the GL_RefAnnotTable workflow directory as a zip file, unzip the workflow then `cd` into the GL_RefAnnotTable_1.0.0 directory on the CLI. Next, run the following command to set the execution permissions for the R script: + +```bash +chmod -R u+x *R +``` + +
+ +### 4. Run the Workflow + +While in the GL_RefAnnotTable workflow directory, you are now able to run the workflow. Below is an example of how to run the workflow to build an annotation table for Mus musculus (mouse): + +```bash +Rscript GL-DPPD-7110_build-genome-annots-tab.R MOUSE +``` + +**Input data:** + +- No input files required, but a target organism must be specified as a positional command line argument, `MOUSE` is used in the example above. Run `Rscript GL-DPPD-7110_build-genome-annots-tab.R` with no positional arguments to see the list of currently available organisms. + +- Optional: a reference table CSV can be supplied as a second positional argument instead of using the default [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) + +**Output data:** + +- *-GL-annotations.tsv (Tab delineated table of gene annotations) +- *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation) + +### 5. Run the annotations database creation function as a stand-alone script + +When the workflow is run, if the reference table does not specify an annotations database for the target_organism in the `annotations` column, the `install_annotations` function, defined in the `install-annot-dbi.R` script, will be executed. This script will locally create and install an annotations database R package using AnnotationForge. This function can also be run as a stand-alone script from the command line: + +```bash +Rscript install-annot-dbi.R BACSU /path/to/GL-DPPD-7110-A_annotations.csv +``` + +**Input data:** + +- The target organism must be specified as the first positional command line argument, `BACSU` is used in the example above. +- The path to a local reference table must also be supplied as the second positional argument + +Output data: + +- org.*.eg.db/ (species-specific annotation database, as a local R package) \ No newline at end of file diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110_build-genome-annots-tab.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110_build-genome-annots-tab.R new file mode 100644 index 00000000..8f098db4 --- /dev/null +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110_build-genome-annots-tab.R @@ -0,0 +1,431 @@ +#!/usr/bin/env Rscript + +# Written by Mike Lee +# GeneLab script for generating organism ENSEMBL annotation tables +# Example usage: Rscript GL-DPPD-7110_build-genome-annots-tab.R MOUSE + +GL_DPPD_ID <- "GL-DPPD-7110-A" + +######################################################################### +############### Pull In and Check Command Line Arguments ################ +######################################################################### + + +## Import command line arguments ## + +args <- commandArgs(trailingOnly = TRUE) + +## Define currently acceptable input organisms (matching names in ref organisms.csv table) ## + +currently_accepted_orgs <- c("ARABIDOPSIS", + "FLY", + "HUMAN", + "MOUSE", + "RAT", + "WORM", + "YEAST", + "ZEBRAFISH", + "BACSU", + "ECOLI", + "ORYLA") + +## Check that at least one positional command line argument was provided ## + +if ( length(args) < 1 ) { + cat("\n One positional argument is required that specifies the target organism. Currently available include:\n") + + for ( item in currently_accepted_orgs ) { + + cat(paste0("\n ", item)) + } + + cat("\n\n") + + quit() + +} else { + + suppressWarnings(target_organism <- toupper(args[1])) + +} + + +## Check that the positional argument provided is acceptable ## + +if (!target_organism %in% currently_accepted_orgs) { + + cat(paste0("\n '", args[1], "' is not currently supported. \n")) + cat(" Creation of this annotation table will likely involve manual processing.\n\n") + + quit() + +} + + +## checking for required packages other than the org-specific db ## + +# helper function for pointing to GL setup page if missing a package +report_package_needed <- function(package_name) { + cat(paste0("\n The package '", package_name, "' is required. Please see:\n")) + cat(" https://github.com/nasa/GeneLab_Data_Processing/tree/master/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable/README.md\n\n") + quit() +} + +# checking and reporting +if (!requireNamespace("tidyverse", quietly = TRUE)) + report_package_needed("tidyverse") + +if (!requireNamespace("BiocManager", quietly = TRUE)) + report_package_needed("BiocManager") + +if (!requireNamespace("STRINGdb", quietly = TRUE)) + report_package_needed("STRINGdb") + +if (!requireNamespace("PANTHER.db", quietly = TRUE)) + report_package_needed("PANTHER.db") + +if (!requireNamespace("rtracklayer", quietly = TRUE)) + report_package_needed("rtracklayer") + +######################################################################### +######################## Set Up Environment ############################# +######################################################################### + +## Import libraries ## + +library(tidyverse) +library(STRINGdb) +library(PANTHER.db) +library(rtracklayer) + +# Set the primary key type based on the target organism +if (target_organism == "ARABIDOPSIS") { + primary_keytype <- "TAIR" +} else { + primary_keytype <- "ENSEMBL" +} + +## Define annotation keys to retrieve ## + +wanted_keys_vec <- c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID") + +## Define links to tables containing species-specific annotation info ## + +if (length(args) >= 2) { + ref_tab_link <- args[2] +} else { + ref_tab_link <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv" +} + + +######################################################################### +############## Define Variables and Output File Names ################### +######################################################################### + + +## Set timeout time to ensure annotation file downloads will complete ## + +options(timeout = 600) + +## Read in tables containing species-specific annotation info ## + +ref_table <- read.csv(ref_tab_link) + +## Retrieve and define target organism taxid, annotation database name, and scientific name ## + +target_taxid <- ref_table %>% + filter(name == target_organism) %>% + pull(taxon) + +target_org_db <- ref_table %>% + filter(name == target_organism) %>% + pull(annotations) + +target_species_designation <- ref_table %>% + filter(name == target_organism) %>% + pull(species) + +## Define link to Ensembl annotation gtf file for the target organism ## + +gtf_link <- ref_table %>% + filter(species == target_species_designation) %>% + pull(gtf) + +## Create output files names ## + +base_gtf_filename <- basename(gtf_link) +base_output_name <- str_replace(base_gtf_filename, ".gtf.gz", "") + +out_table_filename <- paste0(base_output_name, "-GL-annotations.tsv") +out_log_filename <- paste0(base_output_name, "-GL-build-info.txt") + +## Check if output file already exists and if it does, exit without overwriting ## + +if ( file.exists(out_table_filename) ) { + + cat("\n-------------------------------------------------------------------------------------------------\n") + cat(paste0("\n The file that would be created, '", out_table_filename, "', exists already.\n")) + cat(paste0(" We don't want to overwrite it accidentally. Move it and run this again if wanting to proceed.\n")) + cat("\n-------------------------------------------------------------------------------------------------\n") + + quit() + +} + + +######################################################################### +######## Load Annotation Databases and Retrieve Unique Gene IDs ######### +######################################################################### + + +## Import Ensembl annotation gtf file for the target organism ## + +gtf_obj <- import(gtf_link) + +## Define unique Ensembl IDs ## + +unique_IDs <- gtf_obj$gene_id %>% unique() + +## Define target organism annotation database ## +ann.dbi <- target_org_db + + +## If ann.dbi is not null, try to install the annotations database from bioconductor, else create with install-annot-dbi.R +if (!is.na(ann.dbi) && ann.dbi != "") { + BiocManager::install(ann.dbi, ask = FALSE) + if (!requireNamespace(ann.dbi, quietly = TRUE)) { + source("install-annot-dbi.R") + ann.dbi <- install_annotations(target_organism, ref_tab_link) + } +} else { + source("install-annot-dbi.R") + ann.dbi <- install_annotations(target_organism, ref_tab_link) +} + + +library(ann.dbi, character.only = TRUE) + + +######################################################################### +######################## Build Annotation Table ######################### +######################################################################### + +## Begin annotation table using unique IDs of the primary keytype ## + +if (target_organism == "BACSU") { + gtf_df <- as.data.frame(gtf_obj) + # Create a dataframe with unique gene_ids + annot <- gtf_df %>% + dplyr::select(gene_id, gene_name) %>% + distinct(gene_id, .keep_all = TRUE) + colnames(annot) <- c(primary_keytype, "SYMBOL") +} else { + annot <- data.frame(unique_IDs) + colnames(annot) <- primary_keytype +} + +# If organism is BACSU, remove underscores from gene_ids that are present in the GTF +if (target_organism == "BACSU") { + # Create a mapping of original and modified gene IDs + annot$original_IDs <- annot[[primary_keytype]] + annot[[primary_keytype]] <- gsub("_", "", annot[[primary_keytype]]) +} + +## Add additional annotation keys as table columns ## + +for ( key in wanted_keys_vec ) { + + if ( key %in% columns(eval(parse(text = ann.dbi), env = .GlobalEnv))) { + + if (target_organism == "BACSU") { + new_list <- mapIds(eval(parse(text = ann.dbi), env = .GlobalEnv), keys = annot[["SYMBOL"]], keytype = "SYMBOL", column = key, multiVals = "list") + } else if (target_organism == "ECOLI") { + new_list <- mapIds(eval(parse(text = ann.dbi), env = .GlobalEnv), keys = unique_IDs, keytype = "ALIAS", column = key, multiVals = "list") + } else { new_list <- mapIds(eval(parse(text = ann.dbi), env = .GlobalEnv), keys = unique_IDs, keytype = primary_keytype, column = key, multiVals = "list") + } + annot[[key]] <- sapply(new_list, paste, collapse = "|") + + } else { + # if the annotation DB didn't have any of the wanted key types, that column will be missing + # adding in here as an empty column + annot[key] <- NA + } +} + + +######################################################################### +########################### Add STRING IDs ############################## +######################################################################### + +## Retrieve target organism STRING protein-protein interaction database and create STRING ID map to the primary keytype ## + +# for some organisms, the taxonid is not supported by STRING. +taxid_map <- list( + YEAST = 4932 +) + +# Assign the tax ID based on the target organism +if (target_organism %in% names(taxid_map)) { + target_taxid <- taxid_map[[target_organism]] +} + + +## Remove gtf object to conserve RAM, since it is no longer needed ## +rm(gtf_obj) + +string_db <- STRINGdb$new(version = "12.0", species = target_taxid, score_threshold = 0) +string_map <- string_db$map(annot, primary_keytype, removeUnmappedRows = FALSE, takeFirst = FALSE) + + +## Adding some blank lines just for spacing on print-out ## +cat("\n\n") + +## Create a table using the gene IDs of the primary keytype as row names and a column containing STRING IDs. For genes containing multiple STRING IDs, combine all STRING IDs for each gene into one row and separate each ID with a '|' ## + +tab_with_multiple_STRINGids_combined <- + data.frame(row.names = annot[[primary_keytype]]) + +for ( curr_gene_ID in row.names(tab_with_multiple_STRINGids_combined) ) { + + curr_STRING_ids <- string_map %>% + filter(!!rlang::sym(primary_keytype) == curr_gene_ID) %>% + pull(STRING_id) %>% paste(collapse = "|") + + tab_with_multiple_STRINGids_combined[curr_gene_ID, "STRING_id"] <- curr_STRING_ids + +} + +## Move the primary keytype gene IDs back to being a column in the STRING ID table (since they were switched to row names above) ## + +tab_with_multiple_STRINGids_combined <- + tab_with_multiple_STRINGids_combined %>% + rownames_to_column(primary_keytype) + +## Add the STRING ID column to the annotation table ## + +if (target_organism == "ECOLI") { + # Add a temporary key for joining in both tables + annot <- annot %>% + mutate(join_key = toupper(ENSEMBL)) + string_map <- string_map %>% + mutate(join_key = toupper(ENSEMBL)) + + # Perform the left join using the temporary key and drop the join_key column if no longer needed + annot <- left_join(annot, string_map %>% dplyr::select(join_key, STRING_id), by = "join_key") %>% + dplyr::select(-join_key) +} else{ + annot <- left_join(annot, tab_with_multiple_STRINGids_combined, by = primary_keytype) +} + + + + +######################################################################### +################ Add Gene Ontology (GO) slim IDs ######################## +######################################################################### + + +## Retrieve target organism PANTHER GO slim annotations database ## + +pthOrganisms(PANTHER.db) <- target_organism + +## Use ENTREZ IDs to map genes to respective PANTHER GO slim annotation(s) ## + +## Note: Since there can be none (indicated in the annotation table as "NA"), one, or multiple ENTREZ IDs for a gene, this section contains 3 distinct parts to handle each of those scenarios and create a new column in the annotation table containing the GO slim IDs ## + +for ( curr_row in 1:dim(annot)[1] ) { + + curr_entry <- annot[curr_row, "ENTREZID"] + + ## For genes without an ENTREZ ID ## + if ( curr_entry == "NA" ) { + + annot[curr_row, "GOSLIM_IDS"] <- "NA" + + } else if ( ! grepl("|", curr_entry, fixed = TRUE) ) { + + ## For genes with one ENTREZ ID ## + curr_GO_IDs <- mapIds(PANTHER.db, keys = curr_entry, keytype = "ENTREZ", column = "GOSLIM_ID", multiVals = "list") %>% unlist() %>% as.vector() + + ## Add "NA" to the GO slim column for ENTREZ IDs that do not contain a respective GO slim ID ## + if ( is.null(curr_GO_IDs) ) { + + curr_GO_IDs <- "NA" + } + + annot[curr_row, "GOSLIM_IDS"] <- paste(curr_GO_IDs, collapse = "|") + + } else { + + ## For genes with multiple ENTREZ ID ## + ## Note: In this scenario, the ENTREZ IDs for each gene are first split with a '|' to separate the IDs, then the GO slim ID(s) for each ENTREZ ID are collected and combined, then duplicates are removed, and the final list of GO slim IDs for each gene are added in a single row, separated with a '|' ## + + ## Split the ENTREZ IDs ## + curr_entry_vec <- strsplit(curr_entry, "|", fixed = TRUE) + + ## Start a vector of current GO slim IDs ## + curr_GO_IDs <- vector() + + ## Collect and combine GO slim ID(s) for each ENTREZ ID ## + for ( curr_entry in curr_entry_vec ) { + + new_GO_IDs <- mapIds(PANTHER.db, keys = curr_entry, keytype = "ENTREZ", column = "GOSLIM_ID", multiVals = "list") %>% unlist() %>% as.vector() + + ## Add new GO slim IDs to the GO slim IDs vector ## + curr_GO_IDs <- c(curr_GO_IDs, new_GO_IDs) + + } + + ## Remove duplicate GO slim IDs ## + curr_GO_IDs <- unique(curr_GO_IDs) + + ## Add "NA" to the GO slim vector for ENTREZ IDs that do not contain a respective GO slim ID ## + if ( length(curr_GO_IDs) == 0 ) { + + curr_GO_IDs <- "NA" + } + + ## Add additional GO slim IDs to the GOSLIM ID column in the annotation table ## + annot[curr_row, "GOSLIM_IDS"] <- paste(curr_GO_IDs, collapse = "|") + + } + +} + + +######################################################################### +############# Export Annotation Table and Build Info #################### +######################################################################### + +## BACSU-specific: revert gene IDs to originals with underscores ## +if (target_organism == "BACSU") { + annot[["ENSEMBL"]] <- annot$original_IDs + annot$original_IDs <- NULL +} + +## Sort the annotation table based on primary keytype gene IDs ## + +annot <- annot %>% arrange(.[[1]]) + +## Replacing any blank cells with NA ## +annot[annot == ""] <- NA + +## Export the annotation table ## + +write.table(annot, out_table_filename, sep = "\t", quote = FALSE, row.names = FALSE) + +## Define the date the annotation table was generated ## + +date_generated <- format(Sys.time(), "%d-%B-%Y") + +## Export annotation table build info ## + +writeLines(paste(c("Based on:\n ", GL_DPPD_ID), collapse = ""), out_log_filename) +write(paste(c("\nBuild done on:\n ", date_generated), collapse = ""), out_log_filename, append = TRUE) +write(paste(c("\nUsed gtf file:\n ", gtf_link), collapse = ""), out_log_filename, append = TRUE) +write(paste(c("\nUsed ", ann.dbi, " version:\n ", packageVersion(ann.dbi) %>% as.character()), collapse = ""), out_log_filename, append = TRUE) +write(paste(c("\nUsed STRINGdb version:\n ", packageVersion("STRINGdb") %>% as.character()), collapse = ""), out_log_filename, append = TRUE) +write(paste(c("\nUsed PANTHER.db version:\n ", packageVersion("PANTHER.db") %>% as.character()), collapse = ""), out_log_filename, append = TRUE) + +write("\n\nAll session info:\n", out_log_filename, append = TRUE) +write(capture.output(sessionInfo()), out_log_filename, append = TRUE) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-annot-dbi.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-annot-dbi.R new file mode 100644 index 00000000..e5af82f6 --- /dev/null +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-annot-dbi.R @@ -0,0 +1,86 @@ +# install-annot-dbi.R + +# Function: Get annotations db from ref table. If no annotations db is defined, create the package name from genus, species, (and strain for microbes), +# Try to Bioconductor install annotations db. If fail then build the package using AnnotationForge, install it into the current directory. +# Requires ~80GB for NCBIFilesDir file caching +install_annotations <- function(target_organism, refTablePath) { + if (!file.exists(refTablePath)) { + stop("Reference table file does not exist at the specified path: ", refTablePath) + } + + ref_table <- read.csv(refTablePath) + target_taxid <- ref_table %>% + filter(name == target_organism) %>% + pull(taxon) + + # Get package name or build it if not provided + target_org_db <- ref_table %>% + filter(name == target_organism) %>% + pull(annotations) + + if (is.na(target_org_db) || target_org_db == "") { + cat("\nNo annotation database specified. Constructing package name...\n") + target_species_designation <- ref_table %>% + filter(name == target_organism) %>% + pull(species) %>% + gsub("\\s+", " ", .) %>% + gsub("[^A-Za-z0-9 ]", "", .) + + genus_species <- strsplit(target_species_designation, " ")[[1]] + if (length(genus_species) < 1) { + stop("Species designation is not correctly formatted: ", target_species_designation) + } + + genus <- genus_species[1] + species <- ifelse(length(genus_species) > 1, genus_species[2], "") + strain <- ref_table %>% + filter(name == target_organism) %>% + pull(strain) %>% + gsub("[^A-Za-z0-9]", "", .) + + if (!is.na(strain) && strain != "") { + species <- paste0(species, strain) + } + + target_org_db <- paste0("org.", substr(genus, 1, 1), species, ".eg.db") + } + + cat(paste0("\nChecking Bioconductor for '", target_org_db, "'...\n")) + if (requireNamespace(target_org_db, quietly = TRUE)) { + cat(paste0("'", target_org_db, "' is already installed.\n")) + } else { + cat(paste0("\nAttempting to install '", target_org_db, "' from Bioconductor...\n")) + BiocManager::install(target_org_db, ask = FALSE) + if (requireNamespace(target_org_db, quietly = TRUE)) { + cat(paste0("'", target_org_db, "' has been successfully installed from Bioconductor.\n")) + } else { + cat(paste0("\nInstallation from Bioconductor failed, attempting to build '", target_org_db, "'...\n")) + if (!dir.exists(target_org_db)) { + tryCatch({ + BiocManager::install(c("AnnotationForge", "biomaRt", "GO.db"), ask = FALSE) + library(AnnotationForge) + makeOrgPackageFromNCBI( + version = "0.1", + author = "Your Name ", + maintainer = "Your Name ", + outputDir = "./", + tax_id = target_taxid, + genus = genus, + species = species + ) + install.packages(file.path("./", target_org_db), repos = NULL, type = "source", quiet = TRUE) + cat(paste0("'", target_org_db, "' has been successfully built and installed.\n")) + }, error = function(e) { + stop("Failed to build and load the package: ", target_org_db, "\nError: ", e$message) + }) + } else { + cat(paste0("Local annotation package ", target_org_db, " already exists. This local package will be installed.")) + install.packages(file.path("./", target_org_db), repos = NULL, type = "source", quiet = TRUE) + } + } + } + + library(target_org_db, character.only = TRUE) + cat(paste0("Using Annotation Database '", target_org_db, "'.\n")) + return(target_org_db) +} \ No newline at end of file diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/README.md index 421b2a10..20034465 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/README.md @@ -6,8 +6,9 @@ |Pipeline Version|Current Workflow Version (for respective pipeline version)| |:---------------|:---------------------------------------------------------| +|*[GL-DPPD-7110-A.md](../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md)|[1.1.0](GL_RefAnnotTable-A)| |*[GL-DPPD-7110.md](../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110.md)|[1.0.0](GL_RefAnnotTable)| *Current GeneLab Pipeline/Workflow Implementation -> See the [workflow change log](GL_RefAnnotTable/CHANGELOG.md) to access previous workflow versions and view all changes associated with each version update. +> See the [workflow change log](GL_RefAnnotTable-A/CHANGELOG.md) to access previous workflow versions and view all changes associated with each version update. From e22158d88c8ff7cd318f8bfe6e7948b080d917b4 Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Sun, 4 Aug 2024 07:42:04 -0700 Subject: [PATCH 07/58] [GL_RefAnnotTable] Initial microbes updates - Included Ensembl versions in Updates section of DPPD document - Fixed formatting in reference table - Recreated annotation tables, fixed broken columns - Added locally created databases to reference table - Fixed R version in workflow document R installation instructions - Updated reference table directory in workflow document instructions step #3 - Microbes-related changes: Added microbes to reference table, updated changelog --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 39 +++++++++++++++--- .../GL-DPPD-7110-A_annotations.csv | 40 ++++++++++--------- GeneLab_Reference_Annotations/README.md | 4 +- .../GL_RefAnnotTable-A/CHANGELOG.md | 25 ++++++++++-- .../GL_RefAnnotTable-A/README.md | 13 +++--- ... GL-DPPD-7110-A_build-genome-annots-tab.R} | 0 .../Workflow_Documentation/README.md | 2 +- 7 files changed, 86 insertions(+), 37 deletions(-) rename GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/{GL-DPPD-7110_build-genome-annots-tab.R => GL-DPPD-7110-A_build-genome-annots-tab.R} (100%) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index f0a38946..ab80f7ab 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -22,12 +22,39 @@ Barbara Novak (GeneLab Data Processing Lead) ## Updates from previous version - -- Updated R from version 4.13 to 4.4.0 -- Updated Bioconductor from 3.15.1 to 3.19.1 -- Added functionality to create an annotation database using AnnotationForge. This applies to organisms without a maintained annotation database package in Bioconductor (e.g. org.Hs.eg.db). This is currently in use for Bacillus subtilis, subsp. subtilis 168 (BACSU), Escherichia coli,str. K-12 substr. MG1655 (ECOLI), and Oryzias latipes (ORYLA). -- Added support for BACSU, ECOLI, ORYLA - +Ensembl Releases: +- Animals: Updated from release 107 to 112 +- Plants: Updated from release 54 to 59 +- Bacteria: Updated from release 54 to 59 + + +Added NCBI as a reference source for FASTA and GTF files for bacteria to improve annotations. + +Updated R version from 4.1.3 to 4.4.0. + +Updated Bioconductor version from 3.15.1 to 3.19.1. + +Added support for: +- Bacillus subtilis, subsp. subtilis 168 +- Brachypodium distachyon +- Escherichia coli,str. K-12 substr. MG1655 +- Oryzias latipes +- Lactobacillus acidophilus NCFM +- Mycobacterium marinum M +- Oryza sativa Japonica +- Pseudomonas aeruginosa UCBPP-PA14 +- Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 +- Serratia liquefaciens ATCC 27592 +- Staphylococcus aureus MRSA252 +- Streptococcus mutans UA159 +- Vibrio fischeri ES114 + +Added functionality to create an annotation database using AnnotationForge. This applies to organisms without a maintained annotation database package in Bioconductor (e.g. org.Hs.eg.db). This is currently in use for the following organisms: +- Bacillus subtilis, subsp. subtilis 168 +- Brachypodium distachyon +- Escherichia coli, str. K-12 substr. MG1655 +- Oryzias latipes +- Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 --- diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv index ac4490a0..0f33b315 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv @@ -1,20 +1,24 @@ name,species,strain,ensemblVersion,ref_source,fasta,gtf,taxon,annotations,genelab_annots_link,genelab_annots_info_link -ARABIDOPSIS,Arabidopsis thaliana,,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.59.gtf.gz,3702,org.At.tair.db,https://figshare.com/ndownloader/files/46762531,https://figshare.com/ndownloader/files/46762525 -BACSU,Bacillus subtilis,subsp. subtilis 168,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/dna/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.59.gtf.gz,224308,,https://figshare.com/ndownloader/files/46762528,https://figshare.com/ndownloader/files/46762534 -BRADI,Brachypodium distachyon,,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/brachypodium_distachyon/dna/Brachypodium_distachyon.Brachypodium_distachyon_v3.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/brachypodium_distachyon/Brachypodium_distachyon.Brachypodium_distachyon_v3.0.59.gtf.gz,15368,,, -BRARP,Brassica rapa,,59,ensembl_plants,http://ftp.ensemblgenomes.org/pub/plants/release-59/fasta/brassica_rapa/dna/Brassica_rapa.Brapa_1.0.dna.toplevel.fa.gz,http://ftp.ensemblgenomes.org/pub/plants/release-59/gtf/brassica_rapa/Brassica_rapa.Brapa_1.0.59.gtf.gz,3711,,, -WORM,Caenorhabditis elegans,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.112.gtf.gz,6239,org.Ce.eg.db,https://figshare.com/ndownloader/files/46762537,https://figshare.com/ndownloader/files/46762540 -ZEBRAFISH,Danio rerio,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/danio_rerio/Danio_rerio.GRCz11.112.gtf.gz,7955,org.Dr.eg.db,https://figshare.com/ndownloader/files/46762546,https://figshare.com/ndownloader/files/46762543 -FLY,Drosophila melanogaster,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.112.gtf.gz,7227,org.Dm.eg.db,https://figshare.com/ndownloader/files/46762549,https://figshare.com/ndownloader/files/46762552 +ARABIDOPSIS,Arabidopsis thaliana,,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.59.gtf.gz,3702,org.At.tair.db,https://figshare.com/ndownloader/files/48166390,https://figshare.com/ndownloader/files/48166381 +BACSU,Bacillus subtilis,subsp. subtilis 168,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/dna/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.59.gtf.gz,224308,org.Bsubtilissubspsubtilis168.eg.db,https://figshare.com/ndownloader/files/48166384,https://figshare.com/ndownloader/files/48166387 +BRADI,Brachypodium distachyon,,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/brachypodium_distachyon/dna/Brachypodium_distachyon.Brachypodium_distachyon_v3.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/brachypodium_distachyon/Brachypodium_distachyon.Brachypodium_distachyon_v3.0.59.gtf.gz,15368,org.Bdistachyon.eg.db,https://figshare.com/ndownloader/files/48166399,https://figshare.com/ndownloader/files/48166393 +BRARP,Brassica rapa,,59,ensembl_plants,http://ftp.ensemblgenomes.org/pub/plants/release-59/fasta/brassica_rapa/dna/Brassica_rapa.Brapa_1.0.dna.toplevel.fa.gz,http://ftp.ensemblgenomes.org/pub/plants/release-59/gtf/brassica_rapa/Brassica_rapa.Brapa_1.0.59.gtf.gz,,,, +WORM,Caenorhabditis elegans,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.112.gtf.gz,6239,org.Ce.eg.db,https://figshare.com/ndownloader/files/48166402,https://figshare.com/ndownloader/files/48166396 +ZEBRAFISH,Danio rerio,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/danio_rerio/Danio_rerio.GRCz11.112.gtf.gz,7955,org.Dr.eg.db,https://figshare.com/ndownloader/files/48166414,https://figshare.com/ndownloader/files/48166405 +FLY,Drosophila melanogaster,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.112.gtf.gz,7227,org.Dm.eg.db,https://figshare.com/ndownloader/files/48166411,https://figshare.com/ndownloader/files/48166408 ERCC,,,,ThermoFisher,https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip,https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip,,,, -ECOLI,Escherichia coli,str. K-12 substr. MG1655,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/dna/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.59.gtf.gz,511145,,https://figshare.com/ndownloader/files/46762555,https://figshare.com/ndownloader/files/46762558 -HUMAN,Homo sapiens,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz,9606,org.Hs.eg.db,https://figshare.com/ndownloader/files/46762501,https://figshare.com/ndownloader/files/46762495 -MOUSE,Mus musculus,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/mus_musculus/Mus_musculus.GRCm39.112.gtf.gz,10090,org.Mm.eg.db,https://figshare.com/ndownloader/files/46762504,https://figshare.com/ndownloader/files/46762498 -,Mycobacterium marinum,LHM4,59,ensembl_bacteria,coming soon,coming soon,,,, -ORYSJ,Oryza sativa,Japonica,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/oryza_sativa/Oryza_sativa.IRGSP-1.0.59.gtf.gz,4530,BSgenome.Osativa.MSU.MSU7,, -ORYLA,Oryzias latipes,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/oryzias_latipes/dna/Oryzias_latipes.ASM223467v1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/oryzias_latipes/Oryzias_latipes.ASM223467v1.112.gtf.gz,8090,,https://figshare.com/ndownloader/files/46762510,https://figshare.com/ndownloader/files/46762507 -RAT,Rattus norvegicus,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/rattus_norvegicus/dna/Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/rattus_norvegicus/Rattus_norvegicus.mRatBN7.2.112.gtf.gz,10116,org.Rn.eg.db,https://figshare.com/ndownloader/files/46537834,https://figshare.com/ndownloader/files/46537867 -YEAST,Saccharomyces cerevisiae,S288C,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.112.gtf.gz,559292,org.Sc.sgd.db,https://figshare.com/ndownloader/files/46762516,https://figshare.com/ndownloader/files/46762522 -STAA8,Staphylococcus aureus,UAMS-1,59,ensembl_bacteria,coming soon,coming soon,,,, -,Streptococcus mutans,UA159,59,ensembl_bacteria,,,,,, - \ No newline at end of file +ECOLI,Escherichia coli,str. K-12 substr. MG1655,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/dna/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.59.gtf.gz,511145,org.EcolistrK12substrMG1655.eg.db,https://figshare.com/ndownloader/files/48166417,https://figshare.com/ndownloader/files/48166420 +HUMAN,Homo sapiens,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz,9606,org.Hs.eg.db,https://figshare.com/ndownloader/files/48166477,https://figshare.com/ndownloader/files/48166471 +NOENTRY_LA,Lactobacillus acidophilus,NCFM,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.gtf.gz,272621,,https://figshare.com/ndownloader/files/48166447,https://figshare.com/ndownloader/files/48166450 +MOUSE,Mus musculus,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/mus_musculus/Mus_musculus.GRCm39.112.gtf.gz,10090,org.Mm.eg.db,https://figshare.com/ndownloader/files/48166483,https://figshare.com/ndownloader/files/48166474 +NOENTRY_MM,Mycobacterium marinum,M,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,216594,,https://figshare.com/ndownloader/files/48166459,https://figshare.com/ndownloader/files/48166462 +ORYSJ,Oryza sativa,Japonica,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/oryza_sativa/Oryza_sativa.IRGSP-1.0.59.gtf.gz,39947,,https://figshare.com/ndownloader/files/48166480,https://figshare.com/ndownloader/files/48166486 +ORYLA,Oryzias latipes,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/oryzias_latipes/dna/Oryzias_latipes.ASM223467v1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/oryzias_latipes/Oryzias_latipes.ASM223467v1.112.gtf.gz,8090,org.Olatipes.eg.db,https://figshare.com/ndownloader/files/48166492,https://figshare.com/ndownloader/files/48166489 +PSEAE,Pseudomonas aeruginosa,UCBPP-PA14,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.gtf.gz,208963,,https://figshare.com/ndownloader/files/48166453,https://figshare.com/ndownloader/files/48166456 +RAT,Rattus norvegicus,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/rattus_norvegicus/dna/Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/rattus_norvegicus/Rattus_norvegicus.mRatBN7.2.112.gtf.gz,10116,org.Rn.eg.db,https://figshare.com/ndownloader/files/48166501,https://figshare.com/ndownloader/files/48166495 +YEAST,Saccharomyces cerevisiae,S288C,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.112.gtf.gz,559292,org.Sc.sgd.db,https://figshare.com/ndownloader/files/48166498,https://figshare.com/ndownloader/files/48166504 +SALTY,Salmonella enterica,serovar Typhimurium str. LT2,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.gtf.gz,99287,org.SentericaserovarTyphimuriumstrLT2.eg.db,https://figshare.com/ndownloader/files/48166423,https://figshare.com/ndownloader/files/48166426 +NOENTRY_SL,Serratia liquefaciens,ATCC 27592,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.gtf.gz,1346614,,https://figshare.com/ndownloader/files/48166465,https://figshare.com/ndownloader/files/48166468 +NOENTRY_SA,Staphylococcus aureus,MRSA252,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.gtf.gz,282458,,https://figshare.com/ndownloader/files/48166435,https://figshare.com/ndownloader/files/48166438 +NOENTRY_SM,Streptococcus mutans,UA159,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.gtf.gz,210007,,https://figshare.com/ndownloader/files/48166429,https://figshare.com/ndownloader/files/48166432 +NOENTRY_VF,Vibrio fischeri,ES114,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.gtf.gz,312309,,https://figshare.com/ndownloader/files/48166441,https://figshare.com/ndownloader/files/48166444 \ No newline at end of file diff --git a/GeneLab_Reference_Annotations/README.md b/GeneLab_Reference_Annotations/README.md index 755c3348..07896e0c 100644 --- a/GeneLab_Reference_Annotations/README.md +++ b/GeneLab_Reference_Annotations/README.md @@ -20,6 +20,6 @@ **Developed by:** Mike Lee -**Maintained by:** -Alexis Torres +**Maintained by:** +Alexis Torres Crystal Han diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md index 252dcc0c..b6da06a8 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md @@ -9,9 +9,26 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added -- Added AnnotationForge helper script to create local annotations databases if not available on Bioconductor - -- Added support for BACSU, ECOLI, and ORYLA via install-annot-dbi.R +- Added support for: + - Bacillus subtilis, subsp. subtilis 168 + - Brachypodium distachyon + - Escherichia coli,str. K-12 substr. MG1655 + - Oryzias latipes + - Lactobacillus acidophilus NCFM + - Mycobacterium marinum M + - Oryza sativa Japonica + - Pseudomonas aeruginosa UCBPP-PA14 + - Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 + - Serratia liquefaciens ATCC 27592 + - Staphylococcus aureus MRSA252 + - Streptococcus mutans UA159 + - Vibrio fischeri ES114 +- Added AnnotationForge helper script install-annot-dbi.R to create organism-specific annotation packages (org.*.eg.db) in R if not available on Bioconductor. Used for: + - Bacillus subtilis, subsp. subtilis 168 + - Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 + - Escherichia coli,str. K-12 substr. MG1655 + - Oryzias latipes +- Added NCBI as a source for GASTA and GTF files ### Fixed @@ -23,7 +40,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - Animals: Ensembl release 112 - Plants: Ensembl plants release 59 - Bacteria: Ensembl bacteria release 59 -- Removed org.EcK12.eg.db and replaced with local annotations database creation since it is no longer on Bioconductor +- Removed org.EcK12.eg.db and replaced it with a locally created annotations database, as it is no longer available on Bioconductor ## [1.0.0](https://github.com/nasa/GeneLab_Data_Processing/releases/tag/GL_RefAnnotTable_1.0.0) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index 73143bb9..ec6c1d89 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -18,8 +18,8 @@ We recommend installing R via the [Comprehensive R Archive Network (CRAN)](https 1. Select the [CRAN Mirror](https://cran.r-project.org/mirrors.html) closest to your location. 2. Click the link under the "Download and Install R" section that's consistent with your machine. -3. Clink on the R-4.2.1 package consistent with your machine to download. -4. Double click on the R-4.2.1.pkg downloaded in step 3 and follow the installation instructions. +3. Click on the R-4.4.0 package consistent with your machine to download. +4. Double click on the R-4.4.0.pkg downloaded in step 3 and follow the installation instructions. Once R is installed, open a CLI terminal and run the following command to activate R: @@ -53,7 +53,7 @@ curl -LO https://github.com/nasa/GeneLab_Data_Processing/releases/download/GL_Re ### 3. Setup Execution Permission for Workflow Scripts -Once you've downloaded the GL_RefAnnotTable workflow directory as a zip file, unzip the workflow then `cd` into the GL_RefAnnotTable_1.0.0 directory on the CLI. Next, run the following command to set the execution permissions for the R script: +Once you've downloaded the GL_RefAnnotTable workflow directory as a zip file, unzip the workflow then `cd` into the GL_RefAnnotTable-A_1.1.0 directory on the CLI. Next, run the following command to set the execution permissions for the R script: ```bash chmod -R u+x *R @@ -66,12 +66,12 @@ chmod -R u+x *R While in the GL_RefAnnotTable workflow directory, you are now able to run the workflow. Below is an example of how to run the workflow to build an annotation table for Mus musculus (mouse): ```bash -Rscript GL-DPPD-7110_build-genome-annots-tab.R MOUSE +Rscript GL-DPPD-7110-A_build-genome-annots-tab.R MOUSE ``` **Input data:** -- No input files required, but a target organism must be specified as a positional command line argument, `MOUSE` is used in the example above. Run `Rscript GL-DPPD-7110_build-genome-annots-tab.R` with no positional arguments to see the list of currently available organisms. +- No input files are required. Specify the target organism using a positional command line argument. `MOUSE` is used in the example above. To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. The correct argument for each organism can also be found in the 'name' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) - Optional: a reference table CSV can be supplied as a second positional argument instead of using the default [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) @@ -90,7 +90,8 @@ Rscript install-annot-dbi.R BACSU /path/to/GL-DPPD-7110-A_annotations.csv **Input data:** -- The target organism must be specified as the first positional command line argument, `BACSU` is used in the example above. +- The target organism must be specified as the first positional command line argument, `BACSU` is used in the example above. The correct argument for each organism can be found in the 'name' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) + - The path to a local reference table must also be supplied as the second positional argument Output data: diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110_build-genome-annots-tab.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R similarity index 100% rename from GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110_build-genome-annots-tab.R rename to GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/README.md index 20034465..01b497e0 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/README.md @@ -7,7 +7,7 @@ |Pipeline Version|Current Workflow Version (for respective pipeline version)| |:---------------|:---------------------------------------------------------| |*[GL-DPPD-7110-A.md](../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md)|[1.1.0](GL_RefAnnotTable-A)| -|*[GL-DPPD-7110.md](../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110.md)|[1.0.0](GL_RefAnnotTable)| +|[GL-DPPD-7110.md](../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110.md)|[1.0.0](GL_RefAnnotTable)| *Current GeneLab Pipeline/Workflow Implementation From f6154f798890090e64fb77e27b4ba231fbe3832f Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Mon, 12 Aug 2024 05:31:47 -0700 Subject: [PATCH 08/58] [GL_RefAnnotTable] GL_RefAnnotTable-A 1.1.0 - Completed refactoring of script and addition of microbial organisms --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 747 ++++++++++++------ .../GL-DPPD-7110-A_annotations.csv | 42 +- .../GL_RefAnnotTable-A/CHANGELOG.md | 9 +- .../GL_RefAnnotTable-A/README.md | 4 +- .../GL-DPPD-7110-A_build-genome-annots-tab.R | 696 +++++++++------- .../{install-annot-dbi.R => install-org-db.R} | 2 +- 6 files changed, 904 insertions(+), 596 deletions(-) rename GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/{install-annot-dbi.R => install-org-db.R} (99%) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index ab80f7ab..74d2a6be 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -1,97 +1,148 @@ -# GeneLab pipeline for generating reference annotation tables - -> **This page holds an overview and instructions for how GeneLab generates reference annotation tables. The GeneLab reference annotation table used to add annotations to processed data files are indicated in the exact processing scripts provided for each GLDS dataset under the respective omics datatype subdirectory.** +# GeneLab Pipeline for Generating Reference Annotation Tables +> **This page provides an overview and instructions for how GeneLab generates reference annotation tables. The GeneLab reference annotation table used to add annotations to processed data files is indicated in the exact processing scripts provided for each GLDS dataset under the respective omics datatype subdirectory.** + --- -**Date:** July 11, 2024 +**Date:** August 12, 2024 **Revision:** -A **Document Number:** GL-DPPD-7110-A **Submitted by:** -Alexis Torres and Crystal Han (GeneLab Data Processing Team) +Alexis Torres and Crystal Han (GeneLab Data Processing Team) **Approved by:** Sylvain Costes (OSDR Project Manager) -Samrawit Gebre (GeneLab Deputy Project Manager and Acting Genelab Configuration Manager) +Samrawit Gebre (GeneLab Deputy Project Manager and Acting GeneLab Configuration Manager) Lauren Sanders (OSDR Project Scientist) Amanda Saravia-Butler (GeneLab Science Lead) -Barbara Novak (GeneLab Data Processing Lead) +Barbara Novak (GeneLab Data Processing Lead) --- -## Updates from previous version - -Ensembl Releases: -- Animals: Updated from release 107 to 112 -- Plants: Updated from release 54 to 59 -- Bacteria: Updated from release 54 to 59 - - -Added NCBI as a reference source for FASTA and GTF files for bacteria to improve annotations. - -Updated R version from 4.1.3 to 4.4.0. - -Updated Bioconductor version from 3.15.1 to 3.19.1. - -Added support for: -- Bacillus subtilis, subsp. subtilis 168 -- Brachypodium distachyon -- Escherichia coli,str. K-12 substr. MG1655 -- Oryzias latipes -- Lactobacillus acidophilus NCFM -- Mycobacterium marinum M -- Oryza sativa Japonica -- Pseudomonas aeruginosa UCBPP-PA14 -- Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 -- Serratia liquefaciens ATCC 27592 -- Staphylococcus aureus MRSA252 -- Streptococcus mutans UA159 -- Vibrio fischeri ES114 - -Added functionality to create an annotation database using AnnotationForge. This applies to organisms without a maintained annotation database package in Bioconductor (e.g. org.Hs.eg.db). This is currently in use for the following organisms: -- Bacillus subtilis, subsp. subtilis 168 -- Brachypodium distachyon -- Escherichia coli, str. K-12 substr. MG1655 -- Oryzias latipes -- Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 +## Updates from Previous Version + +- **Updated Software:** + - R version updated from 4.1.3 to 4.4.0. + - Bioconductor version updated from 3.15.1 to 3.19.1. + +- **Ensembl Releases:** + - Animals: Updated from release 107 to 112 + - Plants: Updated from release 54 to 59 + - Bacteria: Updated from release 54 to 59 + +- **New Organism Support:** + 1. Bacillus subtilis, subsp. subtilis 168 + 2. Brachypodium distachyon + 3. Escherichia coli, str. K-12 substr. MG1655 + 4. Oryzias latipes + 5. Lactobacillus acidophilus NCFM + 6. Mycobacterium marinum M + 7. Oryza sativa Japonica + 8. Pseudomonas aeruginosa UCBPP-PA14 + 9. Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 + 10. Serratia liquefaciens ATCC 27592 + 11. Staphylococcus aureus MRSA252 + 12. Streptococcus mutans UA159 + 13. Vibrio fischeri ES114 + +- **Added NCBI as a Reference Source:** + FASTA and GTF files were sourced from NCBI for the following organisms: + 1. Lactobacillus acidophilus NCFM + 2. Mycobacterium marinum M + 3. Pseudomonas aeruginosa UCBPP-PA14 + 4. Serratia liquefaciens ATCC 27592 + 5. Staphylococcus aureus MRSA252 + 6. Streptococcus mutans UA159 + 7. Vibrio fischeri ES114 + +- **org.db Creation:** + Added functionality to create an annotation database using `AnnotationForge`. This is applicable to organisms without a maintained annotation database package in Bioconductor (e.g., `org.Hs.eg.db`). Currently, this approach is in use for the following organisms: + 1. Bacillus subtilis, subsp. subtilis 168 + 2. Brachypodium distachyon + 3. Escherichia coli, str. K-12 substr. MG1655 + 4. Oryzias latipes + 5. Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 + +The pipeline is designed to annotate unique gene IDs in a reference assembly, map them to organism-specific `org.db` databases for additional annotations, integrate STRING DB IDs, and use PANTHER to obtain GO slim IDs based on ENTREZ IDs. + +The default columns in the annotation table are: +- ENSEMBL (or TAIR), SYMBOL, GENENAME, REFSEQ, ENTREZID, STRING_id, GOSLIM_IDS + +- For organisms with FASTA and GTF files sourced from NCBI, the LOCUS, OLD_LOCUS, SYMBOL, GENENAME, and GO annotations were directly derived from the GTF file. The `GO` column contains GO terms. `OLD_LOCUS`, or `old_locus_tag` in the GTF was retained when needed to map to STRING IDs. +- Missing columns indicate the absence of corresponding data for that organism + +1. **Brachypodium distachyon (BRADI)**: + - Columns: ENSEMBL, ACCNUM, SYMBOL, GENENAME, REFSEQ, ENTREZID, STRING_id, GOSLIM_IDS + > Note: GTF `transcript_id` entries were matched with `ACCNUM` keys in the `org.db` and saved as `ACCNUM` + +2. **Caenorhabditis elegans (WORM)**: + - Columns: ENSEMBL, SYMBOL, GENENAME, REFSEQ, ENTREZID, STRING_id + > Note: org.db ENTREZ keys did not match PANTHER ENTREZ keys so the empty `GOSLIM_IDS` column was ommitted + +3. **Lactobacillus acidophilus (NCFM)**: + - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, GO, STRING_id + +4. **Mycobacterium marinum (MMARINUMM)**: + - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, GO, STRING_id + +5. **Oryza sativa Japonica (ORYSJ)**: + - Columns: ENSEMBL, STRING_id + +6. **Pseudomonas aeruginosa UCBPP-PA14 (PA14)**: + - Columns: LOCUS, SYMBOL, GENENAME, GO + +7. **Serratia liquefaciens ATCC 27592 (ATCC27592)**: + - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, GO, STRING_id + +8. **Staphylococcus aureus MRSA252 (MRSA252)**: + - Columns: LOCUS, SYMBOL, GENENAME, GO + +9. **Streptococcus mutans UA159 (UA159)**: + - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, GO, STRING_id + +10. **Vibrio fischeri ES114 (ES114)**: + - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, GO, STRING_id --- -# Table of contents +# Table of Contents -- [GeneLab pipeline for generating reference annotation tables](#genelab-pipeline-for-generating-reference-annotation-tables) -- [Table of contents](#table-of-contents) -- [Software used](#software-used) -- [Annotation table build overview with example commands](#annotation-table-build-overview-with-example-commands) +- [GeneLab Pipeline for Generating Reference Annotation Tables](#genelab-pipeline-for-generating-reference-annotation-tables) +- [Table of Contents](#table-of-contents) +- [Software Used](#software-used) +- [Annotation Table Build Overview with Example Commands](#annotation-table-build-overview-with-example-commands) - [0. Set Up Environment](#0-set-up-environment) - [1. Define Variables and Output File Names](#1-define-variables-and-output-file-names) - - [2. Load Annotation Databases and Retrieve Unique Gene IDs](#2-load-annotation-databases-and-retrieve-unique-gene-ids) + - [2. Load Annotation Databases](#2-load-annotation-databases) - [3. Build Initial Annotation Table](#3-build-initial-annotation-table) - - [4. Add STRING IDs](#4-add-string-ids) - - [5. Add Gene Ontology (GO) slim IDs](#5-add-gene-ontology-go-slim-ids) - - [6. Export Annotation Table and Build Info](#6-export-annotation-table-and-build-info) + - [4. Add org.db Keys](#4-add-orgdb-keys) + - [5. Add STRING IDs](#5-add-string-ids) + - [6. Add Gene Ontology (GO) Slim IDs](#6-add-gene-ontology-go-slim-ids) + - [7. Export Annotation Table and Build Info](#7-export-annotation-table-and-build-info) + --- -# Software used - -|Program|Version|Relevant Links| -|:------|:------:|:-------------| -|R|4.4.0|[https://www.r-project.org/](https://www.r-project.org/)| -|Bioconductor|3.19.1|[https://bioconductor.org](https://bioconductor.org)| -|tidyverse|2.0.0|[https://www.tidyverse.org](https://www.tidyverse.org)| -|STRINGdb|2.16.0|[https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html](https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html)| -|PANTHER.db|1.0.12|[https://bioconductor.org/packages/release/data/annotation/html/PANTHER.db.html](https://bioconductor.org/packages/release/data/annotation/html/PANTHER.db.html)| -|rtracklayer|1.64.0|[https://bioconductor.org/packages/release/bioc/html/rtracklayer.html](https://bioconductor.org/packages/release/bioc/html/rtracklayer.html) -|org.Hs.eg.db|3.19.1|[https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html)| -|org.Mm.eg.db|3.19.1|[https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html)| -|org.Rn.eg.db|3.19.1|[https://bioconductor.org/packages/release/data/annotation/html/org.Rn.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Rn.eg.db.html) -|org.Dm.eg.db|3.19.1|[https://bioconductor.org/packages/release/data/annotation/html/org.Dm.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Dm.eg.db.html)| -|org.Ce.eg.db|3.19.1|[https://bioconductor.org/packages/release/data/annotation/html/org.Ce.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Ce.eg.db.html)| -|org.At.tair.db|3.19.1|[https://bioconductor.org/packages/release/data/annotation/html/org.At.tair.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.At.tair.db.html)| -|org.Sc.sgd.db|3.19.1|[https://bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html)| +# Software Used + +| Program | Version | Relevant Links | +|:--------------|:-------:|:---------------| +| R | 4.4.0 | [https://www.r-project.org/](https://www.r-project.org/) | +| Bioconductor | 3.19.1 | [https://bioconductor.org](https://bioconductor.org) | +| tidyverse | 2.0.0 | [https://www.tidyverse.org](https://www.tidyverse.org) | +| STRINGdb | 2.16.0 | [https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html](https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html) | +| PANTHER.db | 1.0.12 | [https://bioconductor.org/packages/release/data/annotation/html/PANTHER.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/PANTHER.db.html) | +| rtracklayer | 1.64.0 | [https://bioconductor.org/packages/release/bioc/html/rtracklayer.html](https://www.bioconductor.org/packages/release/bioc/html/rtracklayer.html) | +| org.At.tair.db| 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.At.tair.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.At.tair.db.html) | +| org.Ce.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Ce.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Ce.eg.db.html) | +| org.Dm.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Dm.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Dm.eg.db.html) | +| org.Dr.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html) | +| org.Hs.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html) | +| org.Mm.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html) | +| org.Rn.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Rn.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Rn.eg.db.html) | +| org.Sc.sgd.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html) | --- @@ -112,36 +163,23 @@ This example below is done for *Mus musculus*. All code is executed in R. ## 0. Set Up Environment ```R -target_organism == "MOUSE" - +# Define variables associated with current pipeline and annotation table versions GL_DPPD_ID <- "GL-DPPD-7110-A" +ref_tab_path <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv" +readme_path <- "https://github.com/nasa/GeneLab_Data_Processing/tree/master/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md" + +# List currently supported organisms +currently_accepted_orgs <- c("ARABIDOPSIS", "BACSU", "BRADI", "WORM", "ZEBRAFISH", + "FLY", "ECOLI", "HUMAN", "NCFM", "MOUSE", + "MMARINUMM", "ORYSJ", "ORYLA", "PA14", "RAT", + "YEAST", "SALTY", "ATCC27592", "MRSA252", "UA159", + "ES114") -## Import libraries ## +# Import libraries library(tidyverse) library(STRINGdb) library(PANTHER.db) library(rtracklayer) - - -## Set the primary annotation keytype, TAIR for Arabidopsis, ENSEMBL for all other organisms ## -if (target_organism == "ARABIDOPSIS") { - primary_keytype <- "TAIR" -} else { - primary_keytype <- "ENSEMBL" -} - -## Define annotation keys to retrieve ## -wanted_keys_vec <- c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID") - -## Check for ref table input in arg 2, otherwise load GL-DPPD-7110-A_annotations.csv -if (length(args) >= 2) { - ref_tab_link <- args[2] -} else { - ref_tab_link <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv" -} - -## Set timeout time to allow more time for annotation file downloads to complete ## -options(timeout = 600) ``` --- @@ -149,70 +187,96 @@ options(timeout = 600) ## 1. Define Variables and Output File Names ```R -## Read in tables containing species-specific annotation info ## -ref_table <- read.csv(ref_tab_link) +# Set timeout time to ensure annotation file downloads will complete +options(timeout = 600) -## Retrieve and define target organism taxid, annotation database name, and scientific name ## -target_taxid <- ref_table %>% - filter(name == target_organism) %>% - pull(taxon) +ref_table <- tryCatch( + read.csv(ref_tab_path), + error = function(e) { + message <- paste("Error: Unable to read the reference table from the path provided. Please check the path and try again.\nPath:", ref_tab_path) + stop(message) + } +) -target_org_db <- ref_table %>% - filter(name == target_organism) %>% - pull(annotations) +# Get target organism information +target_info <- ref_table %>% + filter(name == target_organism) -target_species_designation <- ref_table %>% - filter(name == target_organism) %>% - pull(species) +# Extract the relevant columns from the reference table +target_taxid <- target_info$taxon # Taxonomic identifier +target_org_db <- target_info$annotations # org.eg.db R package +target_species_designation <- target_info$species # Full species name +gtf_link <- target_info$gtf # Path to reference assembly GTF -## Define link to Ensembl annotation gtf file for the target organism ## -gtf_link <- ref_table %>% - filter(species == target_species_designation) %>% - pull(gtf) +# Error handling for missing values +if (is.na(target_taxid) || is.na(target_org_db) || is.na(target_species_designation) || is.na(gtf_link)) { + stop(paste("Error: Missing data for target organism", target_organism, "in reference table.")) +} -## Create output file names ## +# Create output filenames base_gtf_filename <- basename(gtf_link) base_output_name <- str_replace(base_gtf_filename, ".gtf.gz", "") out_table_filename <- paste0(base_output_name, "-GL-annotations.tsv") out_log_filename <- paste0(base_output_name, "-GL-build-info.txt") + +# Check if output file already exists and if it does, exit without overwriting +if ( file.exists(out_table_filename) ) { + cat("\n-------------------------------------------------------------------------------------------------\n") + cat(paste0("\n The file that would be created, '", out_table_filename, "', exists already.\n")) + cat(paste0(" We don't want to overwrite it accidentally. Move it and run this again if wanting to proceed.\n")) + cat("\n-------------------------------------------------------------------------------------------------\n") + quit() +} ```
--- -## 2. Load Annotation Databases and Retrieve Unique Gene IDs +## 2. Load Annotation Databases ```R -## Import Ensembl annotation gtf file for the target organism ## -gtf_obj <- import(gtf_link) +# Set timeout time to ensure annotation file downloads will complete +options(timeout = 600) -## Define unique Ensembl IDs ## -unique_IDs <- gtf_obj$gene_id %>% unique() +####### GTF ########## -## Remove gtf object to conserve RAM, since it is no longer needed ## -rm(gtf_obj) +# Create the GTF dataframe from its path, unique gene identities in the reference assembly are under 'gene_id' +GTF <- rtracklayer::import(gtf_link) +GTF <- data.frame(GTF) -## Define target organism annotation database ## -ann.dbi <- target_org_db +###### org.db ######## -## If ann.dbi is not null, try to install the annotations database from bioconductor, otherwise create it with install-annot-dbi.R "" -if (!is.na(ann.dbi) && ann.dbi != "") { - BiocManager::install(ann.dbi, ask = FALSE) - if (!requireNamespace(ann.dbi, quietly = TRUE)) { - source("install-annot-dbi.R") - ann.dbi <- install_annotations(target_organism, ref_tab_link) +# Define a function to load the specified org.db package for a given target organism +install_and_load_org_db <- function(target_organism, target_org_db, ref_tab_path) { + if (!is.na(target_org_db) && target_org_db != "") { + # Attempt to install the package from Bioconductor + BiocManager::install(target_org_db, ask = FALSE) + + # Check if the package was successfully loaded + if (!requireNamespace(target_org_db, quietly = TRUE)) { + # If not, attempt to create it locally using a helper script + source("install-org-db.R") + target_org_db <- install_annotations(target_organism, ref_tab_path) } -} else { - source("install-annot-dbi.R") - ann.dbi <- install_annotations(target_organism, ref_tab_link) + } else { + # If target_org_db is NA or empty, create it locally using the helper script + source("install-org-db.R") + target_org_db <- install_annotations(target_organism, ref_tab_path) + } + + # Load the package into the R session + library(target_org_db, character.only = TRUE) } +# Define list of supported organisms which do not use annotations from an org.db +no_org_db <- c("NCFM", "MMARINUMM", "ORYSJ", "PA14", "ATCC27592", "MRSA252", "UA159", "ES114") -library(ann.dbi, character.only = TRUE) - - +# Run the function unless the target_organism is in no_org_db +if (!(target_organism %in% no_org_db) && (target_organism %in% currently_accepted_orgs)) { + install_and_load_org_db(target_organism, target_org_db, ref_tab_path) +} ```
@@ -222,173 +286,331 @@ library(ann.dbi, character.only = TRUE) ## 3. Build Initial Annotation Table ```R -## Begin annotation table using unique IDs of the primary keytype ## -if (target_organism == "BACSU") { - gtf_df <- as.data.frame(gtf_obj) - # Create a dataframe with unique gene_ids - annot <- gtf_df %>% - dplyr::select(gene_id, gene_name) %>% - distinct(gene_id, .keep_all = TRUE) - colnames(annot) <- c(primary_keytype, "SYMBOL") +# Initialize table from GTF + +# Define GTF keys based on the target organism; gene_id conrains unique gene IDs in the reference assembly. Defaults to ENSEMBL + +gtf_keytype_mappings <- list( + ARABIDOPSIS = c(gene_id = "TAIR"), + BACSU = c(gene_id = "ENSEMBL", gene_name = "SYMBOL"), + BRADI = c(gene_id = "ENSEMBL", transcript_id = "ACCNUM"), + WORM = c(gene_id = "ENSEMBL"), + ECOLI = c(gene_id = "ENSEMBL", gene_name = "SYMBOL"), + NCFM = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + MMARINUMM = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + PA14 = c(gene_id = "LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + SALTY = c(gene_id = "ENSEMBL", db_xref = "ENTREZID"), + ATCC27592 = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + MRSA252 = c(gene_id = "LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + UA159 = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + ES114 = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + default = c(gene_id = "ENSEMBL") +) + +# Get the key types for the target organism or use the default +wanted_gtf_keytypes <- if (!is.null(gtf_keytype_mappings[[target_organism]])) { + gtf_keytype_mappings[[target_organism]] } else { - annot <- data.frame(unique_IDs) - colnames(annot) <- primary_keytype + c(gene_id = "ENSEMBL") } -## If organism is BACSU, remove underscores from gene_ids ## -if (target_organism == "BACSU") { - # Create a mapping of original and modified gene IDs - annot$original_IDs <- annot[[primary_keytype]] - annot[[primary_keytype]] <- gsub("_", "", annot[[primary_keytype]]) -} +# Initialize the annotation table from the GTF, keeping only the wanted_gtf_keytypes +annot_gtf <- GTF[, names(wanted_gtf_keytypes), drop = FALSE] +annot_gtf <- annot_gtf %>% distinct() -## Retrieve and add additional annotation keys as table columns ## -for ( key in wanted_keys_vec ) { - - if ( key %in% columns(eval(parse(text = ann.dbi), env = .GlobalEnv))) { - - if (target_organism == "BACSU") { - new_list <- mapIds(eval(parse(text = ann.dbi), env = .GlobalEnv), keys = annot[["SYMBOL"]], keytype = "SYMBOL", column = key, multiVals = "list") - } else if (target_organism == "ECOLI") { - new_list <- mapIds(eval(parse(text = ann.dbi), env = .GlobalEnv), keys = unique_IDs, keytype = "ALIAS", column = key, multiVals = "list") - } else { new_list <- mapIds(eval(parse(text = ann.dbi), env = .GlobalEnv), keys = unique_IDs, keytype = primary_keytype, column = key, multiVals = "list") - } - annot[[key]] <- sapply(new_list, paste, collapse = "|") - +# Rename the columns in the annot_gtf dataframe according to the key types +colnames(annot_gtf) <- wanted_gtf_keytypes + +# Save the name of the primary key type (gene_id) being used +primary_keytype <- wanted_gtf_keytypes[1] + +# Filter out unwanted genes from the GTF + +# Define filtering criteria for specific organisms +filter_criteria <- list( + BACSU = "^BSU", + FLY = "^RR", + YEAST = "^Y[A-Z0-9]{6}-?[A-Z]?$", + ECOLI = "^b[0-9]{4}$" +) + +# Apply the filter if there's a specific criterion for the target organism +filter_pattern <- filter_criteria[[target_organism]] + +if (!is.null(filter_pattern)) { + if (target_organism == "FLY") { + annot_gtf <- annot_gtf %>% filter(!grepl(filter_pattern, !!sym(primary_keytype))) } else { - # if the annotation DB didn't have any of the wanted key types, that column will be missing - # adding in here as an empty column - annot[key] <- NA + annot_gtf <- annot_gtf %>% filter(grepl(filter_pattern, !!sym(primary_keytype))) } } - +# Remove "Gene:" labels on ENTREZ IDs +if (target_organism == "SALTY") { + annot_gtf <- annot_gtf %>% dplyr::mutate(ENTREZID = gsub("^GeneID:", "", ENTREZID)) %>% as.data.frame +} ```
--- -## 4. Add STRING IDs +## 4. Add org.db Keys ```R -## Retrieve target organism STRING protein-protein interaction database and create STRING ID map to the primary keytype ## -string_db <- STRINGdb$new(version = "12.0", species = target_taxid, score_threshold = 0) -string_map <- string_db$map(annot, primary_keytype, removeUnmappedRows = FALSE, takeFirst = FALSE) +annot_orgdb <- annot_gtf + +# Define the initial keys to pull from the organism-specific database +orgdb_keytypes_list <- list( + BRADI = c("GENENAME", "REFSEQ", "ENTREZID"), + ECOLI = c("GENENAME", "REFSEQ", "ENTREZID"), + WORM = c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID", "GO"), + SALTY = c("SYMBOL", "GENENAME", "REFSEQ"), + YEAST = c("GENENAME", "ALIAS", "REFSEQ", "ENTREZID"), + default = c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID") +) + +# Add entries for organisms in no_org_db as character(0) (no keys wanted from the org.db) +for (organism in no_org_db) { + orgdb_keytypes_list[[organism]] <- character(0) +} -## Create a table using the gene IDs of the primary keytype as row names and a column containing STRING IDs. ## -## For genes containing multiple STRING IDs, combine all STRING IDs for each gene into one row and separate each ID with a '|' ## -tab_with_multiple_STRINGids_combined <- - data.frame(row.names = annot[[primary_keytype]]) +wanted_org_db_keytypes <- if (target_organism %in% names(orgdb_keytypes_list)) { + orgdb_keytypes_list[[target_organism]] +} else { + orgdb_keytypes_list[["default"]] +} -for ( curr_gene_ID in row.names(tab_with_multiple_STRINGids_combined) ) { +# Define mappings for query and keytype based on target organism +orgdb_keytype_mappings <- list( + BACSU = list(query = "SYMBOL", keytype = "SYMBOL"), + BRADI = list(query = "ACCNUM", keytype = "ACCNUM"), + WORM = list(query = primary_keytype, keytype = "ENSEMBL"), + ECOLI = list(query = "SYMBOL", keytype = "SYMBOL"), + SALTY = list(query = "ENTREZID", keytype = "ENTREZID"), + default = list(query = primary_keytype, keytype = primary_keytype) +) + +# Define the orgdb_query, this is the key type that will be used to map to the org.db +orgdb_query <- if (!is.null(orgdb_keytype_mappings[[target_organism]])) { + orgdb_keytype_mappings[[target_organism]][["query"]] +} else { + orgdb_keytype_mappings[["default"]][["query"]] +} - curr_STRING_ids <- string_map %>% - filter(!!rlang::sym(primary_keytype) == curr_gene_ID) %>% - pull(STRING_id) %>% paste(collapse = "|") +# Define the orgdb_keytype, this is the name of the key type in the org.db +orgdb_keytype <- if (!is.null(orgdb_keytype_mappings[[target_organism]])) { + orgdb_keytype_mappings[[target_organism]][["keytype"]] +} else { + orgdb_keytype_mappings[["default"]][["keytype"]] +} - tab_with_multiple_STRINGids_combined[curr_gene_ID, "STRING_id"] <- curr_STRING_ids +# Function to clean and match ACCNUM keys for BRADI +clean_and_match_accnum <- function(annot_table, org_db, query_col, keytype_col, target_column) { + # Clean the ACCNUM keys in the GTF annotations + cleaned_annot_keys <- sub("\\..*", "", annot_table[[query_col]]) + + # Retrieve and clean the org.db keys + orgdb_keys <- keys(org_db, keytype = keytype_col) + cleaned_orgdb_keys <- sub("\\..*", "", orgdb_keys) + + # Create a lookup table for matching cleaned keys to original keys + lookup_table <- setNames(orgdb_keys, cleaned_orgdb_keys) + + # Match cleaned GTF keys to original org.db keys + matched_keys <- lookup_table[cleaned_annot_keys] + + # Use the matched keys to retrieve the target annotations from org.db + mapIds(org_db, keys = matched_keys, keytype = keytype_col, column = target_column, multiVals = "list") +} +# Loop through the desired key types and add annotations to the GTF table +for (keytype in wanted_org_db_keytypes) { + # Check if keytype is a valid column in the target org.db + if (keytype %in% columns(get(target_org_db, envir = .GlobalEnv))) { + if (target_organism == "BRADI" && orgdb_query == "ACCNUM") { + # For BRADI: use the clean_and_match_accnum function to map to org.db ACCNUM entries + org_matches <- clean_and_match_accnum(annot_orgdb, get(target_org_db, envir = .GlobalEnv), query_col = orgdb_query, keytype_col = orgdb_keytype, target_column = keytype) + } else { + # Default mapping for other organisms + org_matches <- mapIds(get(target_org_db, envir = .GlobalEnv), keys = annot_orgdb[[orgdb_query]], keytype = orgdb_keytype, column = keytype, multiVals = "list") + } + # Add the mapped annotations to the GTF table + annot_orgdb[[keytype]] <- sapply(org_matches, function(x) paste(x, collapse = "|")) + } else { + # Set column to NA if keytype is not present in org.db + annot_orgdb[[keytype]] <- NA + } } -## Move the primary keytype gene IDs back to being a column in the STRING ID table (since they were switched to row names above) ## -tab_with_multiple_STRINGids_combined <- - tab_with_multiple_STRINGids_combined %>% - rownames_to_column(primary_keytype) - -## Add the STRING ID column to the annotation table ## - -if (target_organism == "ECOLI") { - # Add a temporary key for joining in both tables - annot <- annot %>% - mutate(join_key = toupper(ENSEMBL)) - string_map <- string_map %>% - mutate(join_key = toupper(ENSEMBL)) - - # Perform the left join using the temporary key and drop the join_key column if no longer needed - annot <- left_join(annot, string_map %>% dplyr::select(join_key, STRING_id), by = "join_key") %>% - dplyr::select(-join_key) -} else{ - annot <- left_join(annot, tab_with_multiple_STRINGids_combined, by = primary_keytype) +# For SALTY, reorder columns to mtach other tables +if (target_organism == "SALTY") { # Reorder columns to match others; was mismatched since ENTREZ came from GTF + annot_orgdb <- annot_orgdb[, c("ENSEMBL", "SYMBOL", "GENENAME", "REFSEQ", "ENTREZID")] } +# For YEAST, Rename ALIAS to GENENAME +if (target_organism == "YEAST") { + colnames(annot_orgdb) <- c("ENSEMBL", "SYMBOL", "GENENAME", "REFSEQ", "ENTREZID") +} ```
--- -## 5. Add Gene Ontology (GO) slim IDs +## 5. Add STRING IDs ```R -## Retrieve target organism PANTHER GO slim annotations database ## -pthOrganisms(PANTHER.db) <- target_organism - -## Use ENTREZ IDs to map genes to respective PANTHER GO slim annotation(s) ## -# Note: Since there can be none (indicated in the annotation table as "NA"), one, or -# multiple ENTREZ IDs for a gene, this section contains 3 distinct parts to handle -# each of those scenarios and create a new column in the annotation table containing the GO slim IDs - -for ( curr_row in 1:dim(annot)[1] ) { - - curr_entry <- annot[curr_row, "ENTREZID"] - - ## For genes without an ENTREZ ID ## - if ( curr_entry == "NA" ) { - - annot[curr_row, "GOSLIM_IDS"] <- "NA" - - } else if ( ! grepl("|", curr_entry, fixed = TRUE) ) { - - ## For genes with one ENTREZ ID ## - curr_GO_IDs <- mapIds(PANTHER.db, keys = curr_entry, keytype = "ENTREZ", column = "GOSLIM_ID", multiVals = "list") %>% unlist() %>% as.vector() - - ## Add "NA" to the GO slim column for ENTREZ IDs that do not contain a respective GO slim ID ## - if ( is.null(curr_GO_IDs) ) { +# Define organisms that do not use STRING annotations +no_stringdb <- c("PA14", "MRSA252") + +# Define the key type used for mapping to STRING +stringdb_query_list <- list( + NCFM = "OLD_LOCUS", + MMARINUMM = "OLD_LOCUS", + ATCC27592 = "OLD_LOCUS", + UA159 = "OLD_LOCUS", + ES114 = "OLD_LOCUS", + default = primary_keytype +) + +# Define the key type for mapping in STRING, using the default if necessary +stringdb_query <- if (!is.null(stringdb_query_list[[target_organism]])) { + stringdb_query_list[[target_organism]] +} else { + stringdb_query_list[["default"]] +} - curr_GO_IDs <- "NA" - } +# Handle organisms which do not use the GTF's gene_id keys to map to STRING +# These are microbial species for which NCBI references were used rather than ENSEMBL, +# for which the STRING accessions match the GTF's gene_name keys, but not the gene_id keys. +uses_old_locus <- c("NCFM", "MMARINUMM", "ATCC27592", "UA159", "ES114") +# Handle STRING annotation processing based on the target organism +if (target_organism %in% uses_old_locus) { + # If the target organism is one of the NOENTRY organisms, handle the OLD_LOCUS splitting + annot_stringdb <- annot_orgdb %>% + separate_rows(!!sym(stringdb_query), sep = ",", convert = TRUE) %>% + distinct() %>% + as.data.frame() +} else { + # For other organisms, collapse on the primary key + annot_stringdb <- annot_orgdb %>% distinct() + annot_stringdb <- annot_stringdb %>% + group_by(!!sym(primary_keytype)) %>% + summarise(across(everything(), ~paste(unique(na.omit(.))[unique(na.omit(.)) != ""], collapse = "|")), .groups = 'drop') %>% + as.data.frame() +} - annot[curr_row, "GOSLIM_IDS"] <- paste(curr_GO_IDs, collapse = "|") +# Replace "BSU_" with "BSU" in the primary_keytype column for BACSU before STRING mapping +if (target_organism == "BACSU") { + annot_stringdb[[stringdb_query]] <- gsub("^BSU_", "BSU", annot_stringdb[[stringdb_query]]) +} - } else { +# Map alternative taxonomy IDs for organisms not directly supported by STRING +taxid_map <- list( + YEAST = 4932, + BRARP = 51351, + ATCC27592 = 614 +) - ## For genes with multiple ENTREZ ID ## - # Note: In this scenario, the ENTREZ IDs for each gene are first split with a '|' to - # separate the IDs, then the GO slim ID(s) for each ENTREZ ID are collected and - # combined, then duplicates are removed, and the final list of GO slim IDs for - # each gene are added in a single row, separated with a '|' +# Assign the alternative taxonomy identifier if applicable +target_taxid <- if (!is.null(taxid_map[[target_organism]])) { + taxid_map[[target_organism]] +} else { + target_taxid +} - ## Split the ENTREZ IDs ## - curr_entry_vec <- strsplit(curr_entry, "|", fixed = TRUE) +# Initialize string_map +string_map <- NULL - ## Start a vector of current GO slim IDs ## - curr_GO_IDs <- vector() +# If the target organism is supported by STRING, get STRING annotations +if (!(target_organism %in% no_stringdb)) { + string_db <- STRINGdb$new(version = "12.0", species = target_taxid, score_threshold = 0) + string_map <- string_db$map(annot_stringdb, stringdb_query, removeUnmappedRows = FALSE, takeFirst = FALSE) +} +if (!is.null(string_map)) { + annot_stringdb <- annot_stringdb %>% + group_by(!!sym(primary_keytype)) %>% + summarise(across(everything(), ~paste(unique(na.omit(.))[unique(na.omit(.)) != ""], collapse = "|")), .groups = 'drop') + + string_map <- string_map %>% + group_by(!!sym(primary_keytype)) %>% + summarise(across(everything(), ~paste(unique(na.omit(.))[unique(na.omit(.)) != ""], collapse = "|")), .groups = 'drop') +} - ## Collect and combine GO slim ID(s) for each ENTREZ ID ## - for ( curr_entry in curr_entry_vec ) { +if (!is.null(string_map)) { + # Determine the appropriate join key + join_key <- if (target_organism %in% c("NCFM", "MMARINUMM", "ATCC27592", "UA159", "ES114")) { + primary_keytype + } else { + stringdb_query + } + + # Add temporary column to add string IDs to annotation table + annot_stringdb <- annot_stringdb %>% + mutate(join_key = toupper(!!sym(join_key))) + + string_map <- string_map %>% + mutate(join_key = toupper(!!sym(join_key))) + + # Join STRING IDs to the annotation table + annot_stringdb <- left_join(annot_stringdb, string_map %>% dplyr::select(join_key, STRING_id), by = "join_key") %>% + dplyr::select(-join_key) +} - new_GO_IDs <- mapIds(PANTHER.db, keys = curr_entry, keytype = "ENTREZ", column = "GOSLIM_ID", multiVals = "list") %>% unlist() %>% as.vector() +# Undo the "BSU_" to "BSU" replacement for BACSU after STRING mapping +if (target_organism == "BACSU") { + annot_stringdb[[stringdb_query]] <- gsub("^BSU", "BSU_", annot_stringdb[[stringdb_query]]) +} - ## Add new GO slim IDs to the GO slim IDs vector ## - curr_GO_IDs <- c(curr_GO_IDs, new_GO_IDs) +annot_stringdb <- as.data.frame(annot_stringdb) +``` - } +
- ## Remove duplicate GO slim IDs ## - curr_GO_IDs <- unique(curr_GO_IDs) +--- - ## Add "NA" to the GO slim vector for ENTREZ IDs that do not contain a respective GO slim ID ## - if ( length(curr_GO_IDs) == 0 ) { +## 6. Add Gene Ontology (GO) slim IDs - curr_GO_IDs <- "NA" - } +```R +# Define organisms that do not use PANTHER annotations +no_panther_db <- c("WORM", "MMARINUMM", "ORYSJ", "MRSA252", "NCFM", "ATCC27592", "UA159", "ES114", "PA14") - ## Add additional GO slim IDs to the GOSLIM ID column in the annotation table ## - annot[curr_row, "GOSLIM_IDS"] <- paste(curr_GO_IDs, collapse = "|") +annot_pantherdb <- annot_stringdb +if (!(target_organism %in% no_panther_db)) { + + # Define the key type in the annotation table used to map to PANTHER DB + pantherdb_query = "ENTREZID" + pantherdb_keytype = "ENTREZ" + + # Retrieve target organism PANTHER GO slim annotations database + pthOrganisms(PANTHER.db) <- target_organism + + # Define a function to retrieve GO slim IDs for a given gene's ENTREZIDs, which may include entries separated by a "|" + get_go_slim_ids <- function(entrez_id) { + if (is.na(entrez_id) || entrez_id == "NA") { + return("NA") } - + + entrez_ids <- unlist(strsplit(entrez_id, "|", fixed = TRUE)) + go_ids <- lapply(entrez_ids, function(id) { + mapIds(PANTHER.db, keys = id, keytype = pantherdb_keytype, column = "GOSLIM_ID", multiVals = "list") + }) + + # Flatten the list and remove duplicates + go_ids <- unique(unlist(go_ids)) + + if (length(go_ids) == 0) { + return("NA") + } else { + return(paste(go_ids, collapse = "|")) + } + } + + # Apply the GO slim ID mapping function to all valid rows + annot_pantherdb <- annot_pantherdb %>% + mutate(GOSLIM_IDS = sapply(get(pantherdb_query), get_go_slim_ids)) } ``` @@ -396,32 +618,33 @@ for ( curr_row in 1:dim(annot)[1] ) { --- -## 6. Export Annotation Table and Build Info +## 7. Export Annotation Table and Build Info ```R -## BACSU-specific: revert gene IDs to originals with underscores ## -if (target_organism == "BACSU") { - annot[["ENSEMBL"]] <- annot$original_IDs - annot$original_IDs <- NULL -} +# Group by primary key to remove any remaining unjoined or duplicate rows +annot <- annot_pantherdb %>% + group_by(!!sym(primary_keytype)) %>% + summarise(across(everything(), ~paste(unique(na.omit(.))[unique(na.omit(.)) != ""], collapse = "|")), .groups = 'drop') -## Sort the annotation table based on primary keytype gene IDs ## +# Sort the annotation table based on primary keytype gene IDs annot <- annot %>% arrange(.[[1]]) -## Replacing any blank cells with NA ## -annot[annot == ""] <- NA +# Replace any blank cells with NA +annot[annot == "" | annot == "NA"] <- NA -## Export the annotation table using the file name defined in Step 1 ## +# Export the annotation table write.table(annot, out_table_filename, sep = "\t", quote = FALSE, row.names = FALSE) -## Define the date the annotation table was generated ## +# Define the date when the annotation table was generated date_generated <- format(Sys.time(), "%d-%B-%Y") -## Export annotation table build info using the file name defined in Step 1 ## +# Export annotation build information writeLines(paste(c("Based on:\n ", GL_DPPD_ID), collapse = ""), out_log_filename) -write(paste(c("Build done on:\n ", date_generated), collapse = ""), out_log_filename, append = TRUE) +write(paste(c("\nBuild done on:\n ", date_generated), collapse = ""), out_log_filename, append = TRUE) write(paste(c("\nUsed gtf file:\n ", gtf_link), collapse = ""), out_log_filename, append = TRUE) -write(paste(c("\nUsed ", ann.dbi, " version:\n ", packageVersion(ann.dbi) %>% as.character()), collapse = ""), out_log_filename, append = TRUE) +if (!(target_organism %in% no_org_db)) { + write(paste(c("\nUsed ", target_org_db, " version:\n ", packageVersion(target_org_db) %>% as.character()), collapse = ""), out_log_filename, append = TRUE) +} write(paste(c("\nUsed STRINGdb version:\n ", packageVersion("STRINGdb") %>% as.character()), collapse = ""), out_log_filename, append = TRUE) write(paste(c("\nUsed PANTHER.db version:\n ", packageVersion("PANTHER.db") %>% as.character()), collapse = ""), out_log_filename, append = TRUE) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv index 0f33b315..caa49b23 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv @@ -1,24 +1,24 @@ name,species,strain,ensemblVersion,ref_source,fasta,gtf,taxon,annotations,genelab_annots_link,genelab_annots_info_link -ARABIDOPSIS,Arabidopsis thaliana,,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.59.gtf.gz,3702,org.At.tair.db,https://figshare.com/ndownloader/files/48166390,https://figshare.com/ndownloader/files/48166381 -BACSU,Bacillus subtilis,subsp. subtilis 168,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/dna/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.59.gtf.gz,224308,org.Bsubtilissubspsubtilis168.eg.db,https://figshare.com/ndownloader/files/48166384,https://figshare.com/ndownloader/files/48166387 -BRADI,Brachypodium distachyon,,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/brachypodium_distachyon/dna/Brachypodium_distachyon.Brachypodium_distachyon_v3.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/brachypodium_distachyon/Brachypodium_distachyon.Brachypodium_distachyon_v3.0.59.gtf.gz,15368,org.Bdistachyon.eg.db,https://figshare.com/ndownloader/files/48166399,https://figshare.com/ndownloader/files/48166393 +ARABIDOPSIS,Arabidopsis thaliana,,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.59.gtf.gz,3702,org.At.tair.db,https://figshare.com/ndownloader/files/48354355,https://figshare.com/ndownloader/files/48354352 +BACSU,Bacillus subtilis,subsp. subtilis 168,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/dna/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.59.gtf.gz,224308,org.Bsubtilissubspsubtilis168.eg.db,https://figshare.com/ndownloader/files/48354346,https://figshare.com/ndownloader/files/48354349 +BRADI,Brachypodium distachyon,,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/brachypodium_distachyon/dna/Brachypodium_distachyon.Brachypodium_distachyon_v3.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/brachypodium_distachyon/Brachypodium_distachyon.Brachypodium_distachyon_v3.0.59.gtf.gz,15368,org.Bdistachyon.eg.db,https://figshare.com/ndownloader/files/48354370,https://figshare.com/ndownloader/files/48354361 BRARP,Brassica rapa,,59,ensembl_plants,http://ftp.ensemblgenomes.org/pub/plants/release-59/fasta/brassica_rapa/dna/Brassica_rapa.Brapa_1.0.dna.toplevel.fa.gz,http://ftp.ensemblgenomes.org/pub/plants/release-59/gtf/brassica_rapa/Brassica_rapa.Brapa_1.0.59.gtf.gz,,,, -WORM,Caenorhabditis elegans,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.112.gtf.gz,6239,org.Ce.eg.db,https://figshare.com/ndownloader/files/48166402,https://figshare.com/ndownloader/files/48166396 -ZEBRAFISH,Danio rerio,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/danio_rerio/Danio_rerio.GRCz11.112.gtf.gz,7955,org.Dr.eg.db,https://figshare.com/ndownloader/files/48166414,https://figshare.com/ndownloader/files/48166405 -FLY,Drosophila melanogaster,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.112.gtf.gz,7227,org.Dm.eg.db,https://figshare.com/ndownloader/files/48166411,https://figshare.com/ndownloader/files/48166408 +WORM,Caenorhabditis elegans,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.112.gtf.gz,6239,org.Ce.eg.db,https://figshare.com/ndownloader/files/48354373,https://figshare.com/ndownloader/files/48354364 +ZEBRAFISH,Danio rerio,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/danio_rerio/Danio_rerio.GRCz11.112.gtf.gz,7955,org.Dr.eg.db,https://figshare.com/ndownloader/files/48354388,https://figshare.com/ndownloader/files/48354367 +FLY,Drosophila melanogaster,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.112.gtf.gz,7227,org.Dm.eg.db,https://figshare.com/ndownloader/files/48354382,https://figshare.com/ndownloader/files/48354376 ERCC,,,,ThermoFisher,https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip,https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip,,,, -ECOLI,Escherichia coli,str. K-12 substr. MG1655,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/dna/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.59.gtf.gz,511145,org.EcolistrK12substrMG1655.eg.db,https://figshare.com/ndownloader/files/48166417,https://figshare.com/ndownloader/files/48166420 -HUMAN,Homo sapiens,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz,9606,org.Hs.eg.db,https://figshare.com/ndownloader/files/48166477,https://figshare.com/ndownloader/files/48166471 -NOENTRY_LA,Lactobacillus acidophilus,NCFM,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.gtf.gz,272621,,https://figshare.com/ndownloader/files/48166447,https://figshare.com/ndownloader/files/48166450 -MOUSE,Mus musculus,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/mus_musculus/Mus_musculus.GRCm39.112.gtf.gz,10090,org.Mm.eg.db,https://figshare.com/ndownloader/files/48166483,https://figshare.com/ndownloader/files/48166474 -NOENTRY_MM,Mycobacterium marinum,M,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,216594,,https://figshare.com/ndownloader/files/48166459,https://figshare.com/ndownloader/files/48166462 -ORYSJ,Oryza sativa,Japonica,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/oryza_sativa/Oryza_sativa.IRGSP-1.0.59.gtf.gz,39947,,https://figshare.com/ndownloader/files/48166480,https://figshare.com/ndownloader/files/48166486 -ORYLA,Oryzias latipes,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/oryzias_latipes/dna/Oryzias_latipes.ASM223467v1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/oryzias_latipes/Oryzias_latipes.ASM223467v1.112.gtf.gz,8090,org.Olatipes.eg.db,https://figshare.com/ndownloader/files/48166492,https://figshare.com/ndownloader/files/48166489 -PSEAE,Pseudomonas aeruginosa,UCBPP-PA14,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.gtf.gz,208963,,https://figshare.com/ndownloader/files/48166453,https://figshare.com/ndownloader/files/48166456 -RAT,Rattus norvegicus,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/rattus_norvegicus/dna/Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/rattus_norvegicus/Rattus_norvegicus.mRatBN7.2.112.gtf.gz,10116,org.Rn.eg.db,https://figshare.com/ndownloader/files/48166501,https://figshare.com/ndownloader/files/48166495 -YEAST,Saccharomyces cerevisiae,S288C,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.112.gtf.gz,559292,org.Sc.sgd.db,https://figshare.com/ndownloader/files/48166498,https://figshare.com/ndownloader/files/48166504 -SALTY,Salmonella enterica,serovar Typhimurium str. LT2,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.gtf.gz,99287,org.SentericaserovarTyphimuriumstrLT2.eg.db,https://figshare.com/ndownloader/files/48166423,https://figshare.com/ndownloader/files/48166426 -NOENTRY_SL,Serratia liquefaciens,ATCC 27592,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.gtf.gz,1346614,,https://figshare.com/ndownloader/files/48166465,https://figshare.com/ndownloader/files/48166468 -NOENTRY_SA,Staphylococcus aureus,MRSA252,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.gtf.gz,282458,,https://figshare.com/ndownloader/files/48166435,https://figshare.com/ndownloader/files/48166438 -NOENTRY_SM,Streptococcus mutans,UA159,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.gtf.gz,210007,,https://figshare.com/ndownloader/files/48166429,https://figshare.com/ndownloader/files/48166432 -NOENTRY_VF,Vibrio fischeri,ES114,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.gtf.gz,312309,,https://figshare.com/ndownloader/files/48166441,https://figshare.com/ndownloader/files/48166444 \ No newline at end of file +ECOLI,Escherichia coli,str. K-12 substr. MG1655,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/dna/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.59.gtf.gz,511145,org.EcolistrK12substrMG1655.eg.db,https://figshare.com/ndownloader/files/48354379,https://figshare.com/ndownloader/files/48354394 +HUMAN,Homo sapiens,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz,9606,org.Hs.eg.db,https://figshare.com/ndownloader/files/48354445,https://figshare.com/ndownloader/files/48354448 +NCFM,Lactobacillus acidophilus,NCFM,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.gtf.gz,272621,,https://figshare.com/ndownloader/files/48354424,https://figshare.com/ndownloader/files/48354415 +MOUSE,Mus musculus,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/mus_musculus/Mus_musculus.GRCm39.112.gtf.gz,10090,org.Mm.eg.db,https://figshare.com/ndownloader/files/48354460,https://figshare.com/ndownloader/files/48354457 +MMARINUMM,Mycobacterium marinum,M,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,216594,,https://figshare.com/ndownloader/files/48354433,https://figshare.com/ndownloader/files/48354430 +ORYSJ,Oryza sativa,Japonica,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/oryza_sativa/Oryza_sativa.IRGSP-1.0.59.gtf.gz,39947,,https://figshare.com/ndownloader/files/48354451,https://figshare.com/ndownloader/files/48354454 +ORYLA,Oryzias latipes,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/oryzias_latipes/dna/Oryzias_latipes.ASM223467v1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/oryzias_latipes/Oryzias_latipes.ASM223467v1.112.gtf.gz,8090,org.Olatipes.eg.db,https://figshare.com/ndownloader/files/48354463,https://figshare.com/ndownloader/files/48354466 +PA14,Pseudomonas aeruginosa,UCBPP-PA14,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.gtf.gz,208963,,https://figshare.com/ndownloader/files/48354421,https://figshare.com/ndownloader/files/48354427 +RAT,Rattus norvegicus,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/rattus_norvegicus/dna/Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/rattus_norvegicus/Rattus_norvegicus.mRatBN7.2.112.gtf.gz,10116,org.Rn.eg.db,https://figshare.com/ndownloader/files/48354472,https://figshare.com/ndownloader/files/48354475 +YEAST,Saccharomyces cerevisiae,S288C,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.112.gtf.gz,559292,org.Sc.sgd.db,https://figshare.com/ndownloader/files/48354469,https://figshare.com/ndownloader/files/48354478 +SALTY,Salmonella enterica,serovar Typhimurium str. LT2,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.gtf.gz,99287,org.SentericaserovarTyphimuriumstrLT2.eg.db,https://figshare.com/ndownloader/files/48354385,https://figshare.com/ndownloader/files/48354391 +ATCC27592,Serratia liquefaciens,ATCC 27592,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.gtf.gz,1346614,,https://figshare.com/ndownloader/files/48354436,https://figshare.com/ndownloader/files/48354439 +MRSA252,Staphylococcus aureus,MRSA252,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.gtf.gz,282458,,https://figshare.com/ndownloader/files/48354403,https://figshare.com/ndownloader/files/48354409 +UA159,Streptococcus mutans,UA159,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.gtf.gz,210007,,https://figshare.com/ndownloader/files/48354397,https://figshare.com/ndownloader/files/48354406 +ES114,Vibrio fischeri,ES114,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.gtf.gz,312309,,https://figshare.com/ndownloader/files/48354412,https://figshare.com/ndownloader/files/48354418 \ No newline at end of file diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md index b6da06a8..87afed3a 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md @@ -23,16 +23,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - Staphylococcus aureus MRSA252 - Streptococcus mutans UA159 - Vibrio fischeri ES114 -- Added AnnotationForge helper script install-annot-dbi.R to create organism-specific annotation packages (org.*.eg.db) in R if not available on Bioconductor. Used for: +- Added AnnotationForge helper script install-org-db.R to create organism-specific annotation packages (org.*.eg.db) in R if not available on Bioconductor. Used for: - Bacillus subtilis, subsp. subtilis 168 - - Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 + - Brachypodium distachyon - Escherichia coli,str. K-12 substr. MG1655 - Oryzias latipes -- Added NCBI as a source for GASTA and GTF files + - Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 +- Added NCBI as a source for FASTA and GTF files ### Fixed -- Fixed automated processing for ECOLI +- Fixed processing for ECOLI ### Changed diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index ec6c1d89..865a0da9 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -82,10 +82,10 @@ Rscript GL-DPPD-7110-A_build-genome-annots-tab.R MOUSE ### 5. Run the annotations database creation function as a stand-alone script -When the workflow is run, if the reference table does not specify an annotations database for the target_organism in the `annotations` column, the `install_annotations` function, defined in the `install-annot-dbi.R` script, will be executed. This script will locally create and install an annotations database R package using AnnotationForge. This function can also be run as a stand-alone script from the command line: +When the workflow is run, if the reference table does not specify an annotations database for the target_organism in the `annotations` column, the `install_annotations` function, defined in the `install-org-db.R` script, will be executed. This script will locally create and install an annotations database R package using AnnotationForge. This function can also be run as a stand-alone script from the command line: ```bash -Rscript install-annot-dbi.R BACSU /path/to/GL-DPPD-7110-A_annotations.csv +Rscript install-org-db.R BACSU /path/to/GL-DPPD-7110-A_annotations.csv ``` **Input data:** diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R index 8f098db4..134195f0 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R @@ -1,431 +1,515 @@ #!/usr/bin/env Rscript +# Written by Mike Lee +# GeneLab script for generating organism-specific gene annotation tables +# Example usage: Rscript GL-DPPD-7110-A_build-genome-annots-tab.R MOUSE -# Written by Mike Lee -# GeneLab script for generating organism ENSEMBL annotation tables -# Example usage: Rscript GL-DPPD-7110_build-genome-annots-tab.R MOUSE - +# Define variables associated with current pipeline and annotation table versions GL_DPPD_ID <- "GL-DPPD-7110-A" +ref_tab_path <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv" +readme_path <- "https://github.com/nasa/GeneLab_Data_Processing/tree/master/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md" -######################################################################### -############### Pull In and Check Command Line Arguments ################ -######################################################################### +# List currently supported organisms +currently_accepted_orgs <- c("ARABIDOPSIS", "BACSU", "BRADI", "WORM", "ZEBRAFISH", + "FLY", "ECOLI", "HUMAN", "NCFM", "MOUSE", + "MMARINUMM", "ORYSJ", "ORYLA", "PA14", "RAT", + "YEAST", "SALTY", "ATCC27592", "MRSA252", "UA159", + "ES114") -## Import command line arguments ## +######################################################################### +############### Pull in and check command line arguments ################ +######################################################################### +# Pull in command-line arguments args <- commandArgs(trailingOnly = TRUE) -## Define currently acceptable input organisms (matching names in ref organisms.csv table) ## - -currently_accepted_orgs <- c("ARABIDOPSIS", - "FLY", - "HUMAN", - "MOUSE", - "RAT", - "WORM", - "YEAST", - "ZEBRAFISH", - "BACSU", - "ECOLI", - "ORYLA") - -## Check that at least one positional command line argument was provided ## - -if ( length(args) < 1 ) { - cat("\n One positional argument is required that specifies the target organism. Currently available include:\n") - - for ( item in currently_accepted_orgs ) { - - cat(paste0("\n ", item)) - } - - cat("\n\n") - - quit() - -} else { - - suppressWarnings(target_organism <- toupper(args[1])) - +# Get the target organism (CLI argument 1) and check that it is listed in currently_accepted_orgs +validate_arguments <- function(args, supported_orgs) { + if (length(args) < 1) { + stop("One positional argument is required that specifies the target organism. Available options are:\n", paste(supported_orgs, collapse = ", ")) + } + target_organism <- toupper(args[1]) + if (!target_organism %in% supported_orgs) { + stop(paste0("'", target_organism, "' is not currently supported.")) + } + return(target_organism) } +target_organism <- validate_arguments(args, currently_accepted_orgs) -## Check that the positional argument provided is acceptable ## - -if (!target_organism %in% currently_accepted_orgs) { +# If provided, get the reference table URL from CLI arguments (CLI argument 2) and update ref_tab_path +ref_tab_path <- if (length(args) >= 2) args[2] else ref_tab_path - cat(paste0("\n '", args[1], "' is not currently supported. \n")) - cat(" Creation of this annotation table will likely involve manual processing.\n\n") - quit() - -} - - -## checking for required packages other than the org-specific db ## +######################################################################### +######################## Set up environment ############################# +######################################################################### -# helper function for pointing to GL setup page if missing a package +required_packages <- c("tidyverse", "STRINGdb", "PANTHER.db", "rtracklayer") +# Check for required packages other than the org-specific db # report_package_needed <- function(package_name) { - cat(paste0("\n The package '", package_name, "' is required. Please see:\n")) - cat(" https://github.com/nasa/GeneLab_Data_Processing/tree/master/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable/README.md\n\n") - quit() + cat(paste0("\n The package '", package_name, "' is required. Please see:\n")) + cat(" https://github.com/nasa/GeneLab_Data_Processing/tree/master/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md\n\n") + quit() } -# checking and reporting -if (!requireNamespace("tidyverse", quietly = TRUE)) - report_package_needed("tidyverse") - -if (!requireNamespace("BiocManager", quietly = TRUE)) - report_package_needed("BiocManager") - -if (!requireNamespace("STRINGdb", quietly = TRUE)) - report_package_needed("STRINGdb") - -if (!requireNamespace("PANTHER.db", quietly = TRUE)) - report_package_needed("PANTHER.db") - -if (!requireNamespace("rtracklayer", quietly = TRUE)) - report_package_needed("rtracklayer") - -######################################################################### -######################## Set Up Environment ############################# -######################################################################### - -## Import libraries ## +# Check and report missing packages other than the org-specific db +for (pkg in required_packages) { + if (!requireNamespace(pkg, quietly = TRUE)) { + report_package_needed(pkg) + } +} +# Import libraries library(tidyverse) library(STRINGdb) library(PANTHER.db) library(rtracklayer) -# Set the primary key type based on the target organism -if (target_organism == "ARABIDOPSIS") { - primary_keytype <- "TAIR" -} else { - primary_keytype <- "ENSEMBL" -} - -## Define annotation keys to retrieve ## - -wanted_keys_vec <- c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID") - -## Define links to tables containing species-specific annotation info ## - -if (length(args) >= 2) { - ref_tab_link <- args[2] -} else { - ref_tab_link <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv" -} - ######################################################################### -############## Define Variables and Output File Names ################### +############## Define variables and output file names ################### ######################################################################### - -## Set timeout time to ensure annotation file downloads will complete ## - +# Set timeout time to ensure annotation file downloads will complete options(timeout = 600) -## Read in tables containing species-specific annotation info ## - -ref_table <- read.csv(ref_tab_link) - -## Retrieve and define target organism taxid, annotation database name, and scientific name ## - -target_taxid <- ref_table %>% - filter(name == target_organism) %>% - pull(taxon) - -target_org_db <- ref_table %>% - filter(name == target_organism) %>% - pull(annotations) - -target_species_designation <- ref_table %>% - filter(name == target_organism) %>% - pull(species) +ref_table <- tryCatch( + read.csv(ref_tab_path), + error = function(e) { + message <- paste("Error: Unable to read the reference table from the path provided. Please check the path and try again.\nPath:", ref_tab_path) + stop(message) + } +) -## Define link to Ensembl annotation gtf file for the target organism ## +# Get target organism information +target_info <- ref_table %>% + filter(name == target_organism) -gtf_link <- ref_table %>% - filter(species == target_species_designation) %>% - pull(gtf) +# Extract the relevant columns from the reference table +target_taxid <- target_info$taxon # Taxonomic identifier +target_org_db <- target_info$annotations # org.eg.db R package +target_species_designation <- target_info$species # Full species name +gtf_link <- target_info$gtf # Path to reference assembly GTF -## Create output files names ## +# Error handling for missing values +if (is.na(target_taxid) || is.na(target_org_db) || is.na(target_species_designation) || is.na(gtf_link)) { + stop(paste("Error: Missing data for target organism", target_organism, "in reference table.")) +} +# Create output filenames base_gtf_filename <- basename(gtf_link) base_output_name <- str_replace(base_gtf_filename, ".gtf.gz", "") out_table_filename <- paste0(base_output_name, "-GL-annotations.tsv") out_log_filename <- paste0(base_output_name, "-GL-build-info.txt") -## Check if output file already exists and if it does, exit without overwriting ## - +# Check if output file already exists and if it does, exit without overwriting if ( file.exists(out_table_filename) ) { + cat("\n-------------------------------------------------------------------------------------------------\n") + cat(paste0("\n The file that would be created, '", out_table_filename, "', exists already.\n")) + cat(paste0(" We don't want to overwrite it accidentally. Move it and run this again if wanting to proceed.\n")) + cat("\n-------------------------------------------------------------------------------------------------\n") + quit() +} - cat("\n-------------------------------------------------------------------------------------------------\n") - cat(paste0("\n The file that would be created, '", out_table_filename, "', exists already.\n")) - cat(paste0(" We don't want to overwrite it accidentally. Move it and run this again if wanting to proceed.\n")) - cat("\n-------------------------------------------------------------------------------------------------\n") - quit() +############################################# +######## Load annotation databases ######### +############################################# -} +# Set timeout time to ensure annotation file downloads will complete +options(timeout = 600) +####### GTF ########## -######################################################################### -######## Load Annotation Databases and Retrieve Unique Gene IDs ######### -######################################################################### - +# Create the GTF dataframe from its path, unique gene identities in the reference assembly are under 'gene_id' +GTF <- rtracklayer::import(gtf_link) +GTF <- data.frame(GTF) -## Import Ensembl annotation gtf file for the target organism ## +###### org.db ######## -gtf_obj <- import(gtf_link) +# Define a function to load the specified org.db package for a given target organism +install_and_load_org_db <- function(target_organism, target_org_db, ref_tab_path) { + if (!is.na(target_org_db) && target_org_db != "") { + # Attempt to install the package from Bioconductor + BiocManager::install(target_org_db, ask = FALSE) + + # Check if the package was successfully loaded + if (!requireNamespace(target_org_db, quietly = TRUE)) { + # If not, attempt to create it locally using a helper script + source("install-org-db.R") + target_org_db <- install_annotations(target_organism, ref_tab_path) + } + } else { + # If target_org_db is NA or empty, create it locally using the helper script + source("install-org-db.R") + target_org_db <- install_annotations(target_organism, ref_tab_path) + } + + # Load the package into the R session + library(target_org_db, character.only = TRUE) +} -## Define unique Ensembl IDs ## +# Define list of supported organisms which do not use annotations from an org.db +no_org_db <- c("NCFM", "MMARINUMM", "ORYSJ", "PA14", "ATCC27592", "MRSA252", "UA159", "ES114") -unique_IDs <- gtf_obj$gene_id %>% unique() +# Run the function unless the target_organism is in no_org_db +if (!(target_organism %in% no_org_db) && (target_organism %in% currently_accepted_orgs)) { + install_and_load_org_db(target_organism, target_org_db, ref_tab_path) +} -## Define target organism annotation database ## -ann.dbi <- target_org_db +############################################ +######## Build annotation table ############ +############################################ + +# Initialize table from GTF + +# Define GTF keys based on the target organism; gene_id conrains unique gene IDs in the reference assembly. Defaults to ENSEMBL + +gtf_keytype_mappings <- list( + ARABIDOPSIS = c(gene_id = "TAIR"), + BACSU = c(gene_id = "ENSEMBL", gene_name = "SYMBOL"), + BRADI = c(gene_id = "ENSEMBL", transcript_id = "ACCNUM"), + WORM = c(gene_id = "ENSEMBL"), + ECOLI = c(gene_id = "ENSEMBL", gene_name = "SYMBOL"), + NCFM = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + MMARINUMM = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + PA14 = c(gene_id = "LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + SALTY = c(gene_id = "ENSEMBL", db_xref = "ENTREZID"), + ATCC27592 = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + MRSA252 = c(gene_id = "LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + UA159 = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + ES114 = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + default = c(gene_id = "ENSEMBL") +) -## If ann.dbi is not null, try to install the annotations database from bioconductor, else create with install-annot-dbi.R -if (!is.na(ann.dbi) && ann.dbi != "") { - BiocManager::install(ann.dbi, ask = FALSE) - if (!requireNamespace(ann.dbi, quietly = TRUE)) { - source("install-annot-dbi.R") - ann.dbi <- install_annotations(target_organism, ref_tab_link) - } +# Get the key types for the target organism or use the default +wanted_gtf_keytypes <- if (!is.null(gtf_keytype_mappings[[target_organism]])) { + gtf_keytype_mappings[[target_organism]] } else { - source("install-annot-dbi.R") - ann.dbi <- install_annotations(target_organism, ref_tab_link) + c(gene_id = "ENSEMBL") } +# Initialize the annotation table from the GTF, keeping only the wanted_gtf_keytypes +annot_gtf <- GTF[, names(wanted_gtf_keytypes), drop = FALSE] +annot_gtf <- annot_gtf %>% distinct() -library(ann.dbi, character.only = TRUE) +# Rename the columns in the annot_gtf dataframe according to the key types +colnames(annot_gtf) <- wanted_gtf_keytypes +# Save the name of the primary key type (gene_id) being used +primary_keytype <- wanted_gtf_keytypes[1] -######################################################################### -######################## Build Annotation Table ######################### -######################################################################### - -## Begin annotation table using unique IDs of the primary keytype ## +# Filter out unwanted genes from the GTF -if (target_organism == "BACSU") { - gtf_df <- as.data.frame(gtf_obj) - # Create a dataframe with unique gene_ids - annot <- gtf_df %>% - dplyr::select(gene_id, gene_name) %>% - distinct(gene_id, .keep_all = TRUE) - colnames(annot) <- c(primary_keytype, "SYMBOL") -} else { - annot <- data.frame(unique_IDs) - colnames(annot) <- primary_keytype -} - -# If organism is BACSU, remove underscores from gene_ids that are present in the GTF -if (target_organism == "BACSU") { - # Create a mapping of original and modified gene IDs - annot$original_IDs <- annot[[primary_keytype]] - annot[[primary_keytype]] <- gsub("_", "", annot[[primary_keytype]]) -} +# Define filtering criteria for specific organisms +filter_criteria <- list( + BACSU = "^BSU", + FLY = "^RR", + YEAST = "^Y[A-Z0-9]{6}-?[A-Z]?$", + ECOLI = "^b[0-9]{4}$" +) -## Add additional annotation keys as table columns ## +# Apply the filter if there's a specific criterion for the target organism +filter_pattern <- filter_criteria[[target_organism]] -for ( key in wanted_keys_vec ) { - - if ( key %in% columns(eval(parse(text = ann.dbi), env = .GlobalEnv))) { - - if (target_organism == "BACSU") { - new_list <- mapIds(eval(parse(text = ann.dbi), env = .GlobalEnv), keys = annot[["SYMBOL"]], keytype = "SYMBOL", column = key, multiVals = "list") - } else if (target_organism == "ECOLI") { - new_list <- mapIds(eval(parse(text = ann.dbi), env = .GlobalEnv), keys = unique_IDs, keytype = "ALIAS", column = key, multiVals = "list") - } else { new_list <- mapIds(eval(parse(text = ann.dbi), env = .GlobalEnv), keys = unique_IDs, keytype = primary_keytype, column = key, multiVals = "list") - } - annot[[key]] <- sapply(new_list, paste, collapse = "|") - +if (!is.null(filter_pattern)) { + if (target_organism == "FLY") { + annot_gtf <- annot_gtf %>% filter(!grepl(filter_pattern, !!sym(primary_keytype))) } else { - # if the annotation DB didn't have any of the wanted key types, that column will be missing - # adding in here as an empty column - annot[key] <- NA + annot_gtf <- annot_gtf %>% filter(grepl(filter_pattern, !!sym(primary_keytype))) } } +# Remove "Gene:" labels on ENTREZ IDs +if (target_organism == "SALTY") { + annot_gtf <- annot_gtf %>% dplyr::mutate(ENTREZID = gsub("^GeneID:", "", ENTREZID)) %>% as.data.frame +} ######################################################################### -########################### Add STRING IDs ############################## +########################### Add org.db keys ############################# ######################################################################### -## Retrieve target organism STRING protein-protein interaction database and create STRING ID map to the primary keytype ## +annot_orgdb <- annot_gtf -# for some organisms, the taxonid is not supported by STRING. -taxid_map <- list( - YEAST = 4932 +# Define the initial keys to pull from the organism-specific database +orgdb_keytypes_list <- list( + BRADI = c("GENENAME", "REFSEQ", "ENTREZID"), + ECOLI = c("GENENAME", "REFSEQ", "ENTREZID"), + WORM = c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID", "GO"), + SALTY = c("SYMBOL", "GENENAME", "REFSEQ"), + YEAST = c("GENENAME", "ALIAS", "REFSEQ", "ENTREZID"), + default = c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID") ) -# Assign the tax ID based on the target organism -if (target_organism %in% names(taxid_map)) { - target_taxid <- taxid_map[[target_organism]] +# Add entries for organisms in no_org_db as character(0) (no keys wanted from the org.db) +for (organism in no_org_db) { + orgdb_keytypes_list[[organism]] <- character(0) } - -## Remove gtf object to conserve RAM, since it is no longer needed ## -rm(gtf_obj) - -string_db <- STRINGdb$new(version = "12.0", species = target_taxid, score_threshold = 0) -string_map <- string_db$map(annot, primary_keytype, removeUnmappedRows = FALSE, takeFirst = FALSE) - - -## Adding some blank lines just for spacing on print-out ## -cat("\n\n") - -## Create a table using the gene IDs of the primary keytype as row names and a column containing STRING IDs. For genes containing multiple STRING IDs, combine all STRING IDs for each gene into one row and separate each ID with a '|' ## - -tab_with_multiple_STRINGids_combined <- - data.frame(row.names = annot[[primary_keytype]]) - -for ( curr_gene_ID in row.names(tab_with_multiple_STRINGids_combined) ) { - - curr_STRING_ids <- string_map %>% - filter(!!rlang::sym(primary_keytype) == curr_gene_ID) %>% - pull(STRING_id) %>% paste(collapse = "|") - - tab_with_multiple_STRINGids_combined[curr_gene_ID, "STRING_id"] <- curr_STRING_ids - +wanted_org_db_keytypes <- if (target_organism %in% names(orgdb_keytypes_list)) { + orgdb_keytypes_list[[target_organism]] +} else { + orgdb_keytypes_list[["default"]] } -## Move the primary keytype gene IDs back to being a column in the STRING ID table (since they were switched to row names above) ## - -tab_with_multiple_STRINGids_combined <- - tab_with_multiple_STRINGids_combined %>% - rownames_to_column(primary_keytype) +# Define mappings for query and keytype based on target organism +orgdb_keytype_mappings <- list( + BACSU = list(query = "SYMBOL", keytype = "SYMBOL"), + BRADI = list(query = "ACCNUM", keytype = "ACCNUM"), + WORM = list(query = primary_keytype, keytype = "ENSEMBL"), + ECOLI = list(query = "SYMBOL", keytype = "SYMBOL"), + SALTY = list(query = "ENTREZID", keytype = "ENTREZID"), + default = list(query = primary_keytype, keytype = primary_keytype) +) -## Add the STRING ID column to the annotation table ## +# Define the orgdb_query, this is the key type that will be used to map to the org.db +orgdb_query <- if (!is.null(orgdb_keytype_mappings[[target_organism]])) { + orgdb_keytype_mappings[[target_organism]][["query"]] +} else { + orgdb_keytype_mappings[["default"]][["query"]] +} -if (target_organism == "ECOLI") { - # Add a temporary key for joining in both tables - annot <- annot %>% - mutate(join_key = toupper(ENSEMBL)) - string_map <- string_map %>% - mutate(join_key = toupper(ENSEMBL)) +# Define the orgdb_keytype, this is the name of the key type in the org.db +orgdb_keytype <- if (!is.null(orgdb_keytype_mappings[[target_organism]])) { + orgdb_keytype_mappings[[target_organism]][["keytype"]] +} else { + orgdb_keytype_mappings[["default"]][["keytype"]] +} - # Perform the left join using the temporary key and drop the join_key column if no longer needed - annot <- left_join(annot, string_map %>% dplyr::select(join_key, STRING_id), by = "join_key") %>% - dplyr::select(-join_key) -} else{ - annot <- left_join(annot, tab_with_multiple_STRINGids_combined, by = primary_keytype) +# Function to clean and match ACCNUM keys for BRADI +clean_and_match_accnum <- function(annot_table, org_db, query_col, keytype_col, target_column) { + # Clean the ACCNUM keys in the GTF annotations + cleaned_annot_keys <- sub("\\..*", "", annot_table[[query_col]]) + + # Retrieve and clean the org.db keys + orgdb_keys <- keys(org_db, keytype = keytype_col) + cleaned_orgdb_keys <- sub("\\..*", "", orgdb_keys) + + # Create a lookup table for matching cleaned keys to original keys + lookup_table <- setNames(orgdb_keys, cleaned_orgdb_keys) + + # Match cleaned GTF keys to original org.db keys + matched_keys <- lookup_table[cleaned_annot_keys] + + # Use the matched keys to retrieve the target annotations from org.db + mapIds(org_db, keys = matched_keys, keytype = keytype_col, column = target_column, multiVals = "list") } +# Loop through the desired key types and add annotations to the GTF table +for (keytype in wanted_org_db_keytypes) { + # Check if keytype is a valid column in the target org.db + if (keytype %in% columns(get(target_org_db, envir = .GlobalEnv))) { + if (target_organism == "BRADI" && orgdb_query == "ACCNUM") { + # For BRADI: use the clean_and_match_accnum function to map to org.db ACCNUM entries + org_matches <- clean_and_match_accnum(annot_orgdb, get(target_org_db, envir = .GlobalEnv), query_col = orgdb_query, keytype_col = orgdb_keytype, target_column = keytype) + } else { + # Default mapping for other organisms + org_matches <- mapIds(get(target_org_db, envir = .GlobalEnv), keys = annot_orgdb[[orgdb_query]], keytype = orgdb_keytype, column = keytype, multiVals = "list") + } + # Add the mapped annotations to the GTF table + annot_orgdb[[keytype]] <- sapply(org_matches, function(x) paste(x, collapse = "|")) + } else { + # Set column to NA if keytype is not present in org.db + annot_orgdb[[keytype]] <- NA + } +} +# For SALTY, reorder columns to mtach other tables +if (target_organism == "SALTY") { # Reorder columns to match others; was mismatched since ENTREZ came from GTF + annot_orgdb <- annot_orgdb[, c("ENSEMBL", "SYMBOL", "GENENAME", "REFSEQ", "ENTREZID")] +} +# For YEAST, Rename ALIAS to GENENAME +if (target_organism == "YEAST") { + colnames(annot_orgdb) <- c("ENSEMBL", "SYMBOL", "GENENAME", "REFSEQ", "ENTREZID") +} ######################################################################### -################ Add Gene Ontology (GO) slim IDs ######################## +########################### Add STRING IDs ############################## ######################################################################### +# Define organisms that do not use STRING annotations +no_stringdb <- c("PA14", "MRSA252") + +# Define the key type used for mapping to STRING +stringdb_query_list <- list( + NCFM = "OLD_LOCUS", + MMARINUMM = "OLD_LOCUS", + ATCC27592 = "OLD_LOCUS", + UA159 = "OLD_LOCUS", + ES114 = "OLD_LOCUS", + default = primary_keytype +) -## Retrieve target organism PANTHER GO slim annotations database ## +# Define the key type for mapping in STRING, using the default if necessary +stringdb_query <- if (!is.null(stringdb_query_list[[target_organism]])) { + stringdb_query_list[[target_organism]] +} else { + stringdb_query_list[["default"]] +} -pthOrganisms(PANTHER.db) <- target_organism +# Handle organisms which do not use the GTF's gene_id keys to map to STRING +# These are microbial species for which NCBI references were used rather than ENSEMBL, +# for which the STRING accessions match the GTF's gene_name keys, but not the gene_id keys. +uses_old_locus <- c("NCFM", "MMARINUMM", "ATCC27592", "UA159", "ES114") +# Handle STRING annotation processing based on the target organism +if (target_organism %in% uses_old_locus) { + # If the target organism is one of the NOENTRY organisms, handle the OLD_LOCUS splitting + annot_stringdb <- annot_orgdb %>% + separate_rows(!!sym(stringdb_query), sep = ",", convert = TRUE) %>% + distinct() %>% + as.data.frame() +} else { + # For other organisms, collapse on the primary key + annot_stringdb <- annot_orgdb %>% distinct() + annot_stringdb <- annot_stringdb %>% + group_by(!!sym(primary_keytype)) %>% + summarise(across(everything(), ~paste(unique(na.omit(.))[unique(na.omit(.)) != ""], collapse = "|")), .groups = 'drop') %>% + as.data.frame() +} -## Use ENTREZ IDs to map genes to respective PANTHER GO slim annotation(s) ## +# Replace "BSU_" with "BSU" in the primary_keytype column for BACSU before STRING mapping +if (target_organism == "BACSU") { + annot_stringdb[[stringdb_query]] <- gsub("^BSU_", "BSU", annot_stringdb[[stringdb_query]]) +} -## Note: Since there can be none (indicated in the annotation table as "NA"), one, or multiple ENTREZ IDs for a gene, this section contains 3 distinct parts to handle each of those scenarios and create a new column in the annotation table containing the GO slim IDs ## +# Map alternative taxonomy IDs for organisms not directly supported by STRING +taxid_map <- list( + YEAST = 4932, + BRARP = 51351, + ATCC27592 = 614 +) -for ( curr_row in 1:dim(annot)[1] ) { +# Assign the alternative taxonomy identifier if applicable +target_taxid <- if (!is.null(taxid_map[[target_organism]])) { + taxid_map[[target_organism]] +} else { + target_taxid +} - curr_entry <- annot[curr_row, "ENTREZID"] +# Initialize string_map +string_map <- NULL - ## For genes without an ENTREZ ID ## - if ( curr_entry == "NA" ) { +# If the target organism is supported by STRING, get STRING annotations +if (!(target_organism %in% no_stringdb)) { + string_db <- STRINGdb$new(version = "12.0", species = target_taxid, score_threshold = 0) + string_map <- string_db$map(annot_stringdb, stringdb_query, removeUnmappedRows = FALSE, takeFirst = FALSE) +} +if (!is.null(string_map)) { + annot_stringdb <- annot_stringdb %>% + group_by(!!sym(primary_keytype)) %>% + summarise(across(everything(), ~paste(unique(na.omit(.))[unique(na.omit(.)) != ""], collapse = "|")), .groups = 'drop') + + string_map <- string_map %>% + group_by(!!sym(primary_keytype)) %>% + summarise(across(everything(), ~paste(unique(na.omit(.))[unique(na.omit(.)) != ""], collapse = "|")), .groups = 'drop') +} - annot[curr_row, "GOSLIM_IDS"] <- "NA" +if (!is.null(string_map)) { + # Determine the appropriate join key + join_key <- if (target_organism %in% c("NCFM", "MMARINUMM", "ATCC27592", "UA159", "ES114")) { + primary_keytype + } else { + stringdb_query + } + + # Add temporary column to add string IDs to annotation table + annot_stringdb <- annot_stringdb %>% + mutate(join_key = toupper(!!sym(join_key))) + + string_map <- string_map %>% + mutate(join_key = toupper(!!sym(join_key))) + + # Join STRING IDs to the annotation table + annot_stringdb <- left_join(annot_stringdb, string_map %>% dplyr::select(join_key, STRING_id), by = "join_key") %>% + dplyr::select(-join_key) +} - } else if ( ! grepl("|", curr_entry, fixed = TRUE) ) { +# Undo the "BSU_" to "BSU" replacement for BACSU after STRING mapping +if (target_organism == "BACSU") { + annot_stringdb[[stringdb_query]] <- gsub("^BSU", "BSU_", annot_stringdb[[stringdb_query]]) +} - ## For genes with one ENTREZ ID ## - curr_GO_IDs <- mapIds(PANTHER.db, keys = curr_entry, keytype = "ENTREZ", column = "GOSLIM_ID", multiVals = "list") %>% unlist() %>% as.vector() +annot_stringdb <- as.data.frame(annot_stringdb) - ## Add "NA" to the GO slim column for ENTREZ IDs that do not contain a respective GO slim ID ## - if ( is.null(curr_GO_IDs) ) { +######################################################################### +################ Add Gene Ontology (GO) slim IDs ######################## +######################################################################### - curr_GO_IDs <- "NA" - } +# Define organisms that do not use PANTHER annotations +no_panther_db <- c("WORM", "MMARINUMM", "ORYSJ", "MRSA252", "NCFM", "ATCC27592", "UA159", "ES114", "PA14") - annot[curr_row, "GOSLIM_IDS"] <- paste(curr_GO_IDs, collapse = "|") +annot_pantherdb <- annot_stringdb +if (!(target_organism %in% no_panther_db)) { + + # Define the key type in the annotation table used to map to PANTHER DB + pantherdb_query = "ENTREZID" + pantherdb_keytype = "ENTREZ" + + # Retrieve target organism PANTHER GO slim annotations database + pthOrganisms(PANTHER.db) <- target_organism + + # Define a function to retrieve GO slim IDs for a given gene's ENTREZIDs, which may include entries separated by a "|" + get_go_slim_ids <- function(entrez_id) { + if (is.na(entrez_id) || entrez_id == "NA") { + return("NA") + } + + entrez_ids <- unlist(strsplit(entrez_id, "|", fixed = TRUE)) + go_ids <- lapply(entrez_ids, function(id) { + mapIds(PANTHER.db, keys = id, keytype = pantherdb_keytype, column = "GOSLIM_ID", multiVals = "list") + }) + + # Flatten the list and remove duplicates + go_ids <- unique(unlist(go_ids)) + + if (length(go_ids) == 0) { + return("NA") } else { - - ## For genes with multiple ENTREZ ID ## - ## Note: In this scenario, the ENTREZ IDs for each gene are first split with a '|' to separate the IDs, then the GO slim ID(s) for each ENTREZ ID are collected and combined, then duplicates are removed, and the final list of GO slim IDs for each gene are added in a single row, separated with a '|' ## - - ## Split the ENTREZ IDs ## - curr_entry_vec <- strsplit(curr_entry, "|", fixed = TRUE) - - ## Start a vector of current GO slim IDs ## - curr_GO_IDs <- vector() - - ## Collect and combine GO slim ID(s) for each ENTREZ ID ## - for ( curr_entry in curr_entry_vec ) { - - new_GO_IDs <- mapIds(PANTHER.db, keys = curr_entry, keytype = "ENTREZ", column = "GOSLIM_ID", multiVals = "list") %>% unlist() %>% as.vector() - - ## Add new GO slim IDs to the GO slim IDs vector ## - curr_GO_IDs <- c(curr_GO_IDs, new_GO_IDs) - - } - - ## Remove duplicate GO slim IDs ## - curr_GO_IDs <- unique(curr_GO_IDs) - - ## Add "NA" to the GO slim vector for ENTREZ IDs that do not contain a respective GO slim ID ## - if ( length(curr_GO_IDs) == 0 ) { - - curr_GO_IDs <- "NA" - } - - ## Add additional GO slim IDs to the GOSLIM ID column in the annotation table ## - annot[curr_row, "GOSLIM_IDS"] <- paste(curr_GO_IDs, collapse = "|") - + return(paste(go_ids, collapse = "|")) } - + } + + # Apply the GO slim ID mapping function to all valid rows + annot_pantherdb <- annot_pantherdb %>% + mutate(GOSLIM_IDS = sapply(get(pantherdb_query), get_go_slim_ids)) } ######################################################################### -############# Export Annotation Table and Build Info #################### +############# Export annotation table and build info #################### ######################################################################### -## BACSU-specific: revert gene IDs to originals with underscores ## -if (target_organism == "BACSU") { - annot[["ENSEMBL"]] <- annot$original_IDs - annot$original_IDs <- NULL -} - -## Sort the annotation table based on primary keytype gene IDs ## +# Group by primary key to remove any remaining unjoined or duplicate rows +annot <- annot_pantherdb %>% + group_by(!!sym(primary_keytype)) %>% + summarise(across(everything(), ~paste(unique(na.omit(.))[unique(na.omit(.)) != ""], collapse = "|")), .groups = 'drop') +# Sort the annotation table based on primary keytype gene IDs annot <- annot %>% arrange(.[[1]]) -## Replacing any blank cells with NA ## -annot[annot == ""] <- NA - -## Export the annotation table ## +# Replace any blank cells with NA +annot[annot == "" | annot == "NA"] <- NA +# Export the annotation table write.table(annot, out_table_filename, sep = "\t", quote = FALSE, row.names = FALSE) -## Define the date the annotation table was generated ## - +# Define the date when the annotation table was generated date_generated <- format(Sys.time(), "%d-%B-%Y") -## Export annotation table build info ## - +# Export annotation build information writeLines(paste(c("Based on:\n ", GL_DPPD_ID), collapse = ""), out_log_filename) write(paste(c("\nBuild done on:\n ", date_generated), collapse = ""), out_log_filename, append = TRUE) write(paste(c("\nUsed gtf file:\n ", gtf_link), collapse = ""), out_log_filename, append = TRUE) -write(paste(c("\nUsed ", ann.dbi, " version:\n ", packageVersion(ann.dbi) %>% as.character()), collapse = ""), out_log_filename, append = TRUE) +if (!(target_organism %in% no_org_db)) { + write(paste(c("\nUsed ", target_org_db, " version:\n ", packageVersion(target_org_db) %>% as.character()), collapse = ""), out_log_filename, append = TRUE) +} write(paste(c("\nUsed STRINGdb version:\n ", packageVersion("STRINGdb") %>% as.character()), collapse = ""), out_log_filename, append = TRUE) write(paste(c("\nUsed PANTHER.db version:\n ", packageVersion("PANTHER.db") %>% as.character()), collapse = ""), out_log_filename, append = TRUE) write("\n\nAll session info:\n", out_log_filename, append = TRUE) -write(capture.output(sessionInfo()), out_log_filename, append = TRUE) +write(capture.output(sessionInfo()), out_log_filename, append = TRUE) \ No newline at end of file diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-annot-dbi.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R similarity index 99% rename from GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-annot-dbi.R rename to GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R index e5af82f6..3374cd4b 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-annot-dbi.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R @@ -1,4 +1,4 @@ -# install-annot-dbi.R +# install-org-db.R # Function: Get annotations db from ref table. If no annotations db is defined, create the package name from genus, species, (and strain for microbes), # Try to Bioconductor install annotations db. If fail then build the package using AnnotationForge, install it into the current directory. From 9b2b9439725e3c4a8ebfa377d293b437a6a56c04 Mon Sep 17 00:00:00 2001 From: Alexis Torres Date: Tue, 3 Sep 2024 12:11:21 -0700 Subject: [PATCH 09/58] [GL_RefAnnotTable] Added makeOrgPackageFromNCBI to DPPD doc --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 95 ++++++++++++------- 1 file changed, 62 insertions(+), 33 deletions(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index 74d2a6be..4787a76a 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -114,12 +114,13 @@ The default columns in the annotation table are: - [Annotation Table Build Overview with Example Commands](#annotation-table-build-overview-with-example-commands) - [0. Set Up Environment](#0-set-up-environment) - [1. Define Variables and Output File Names](#1-define-variables-and-output-file-names) - - [2. Load Annotation Databases](#2-load-annotation-databases) - - [3. Build Initial Annotation Table](#3-build-initial-annotation-table) - - [4. Add org.db Keys](#4-add-orgdb-keys) - - [5. Add STRING IDs](#5-add-string-ids) - - [6. Add Gene Ontology (GO) Slim IDs](#6-add-gene-ontology-go-slim-ids) - - [7. Export Annotation Table and Build Info](#7-export-annotation-table-and-build-info) + - [2. Create the Organism Package if it is Not Hosted by Bioconductor](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor) + - [3. Load Annotation Databases](#3-load-annotation-databases) + - [4. Build Initial Annotation Table](#4-build-initial-annotation-table) + - [5. Add org.db Keys](#5-add-orgdb-keys) + - [6. Add STRING IDs](#6-add-string-ids) + - [7. Add Gene Ontology (GO) Slim IDs](#7-add-gene-ontology-go-slim-ids) + - [8. Export Annotation Table and Build Info](#8-export-annotation-table-and-build-info) @@ -234,7 +235,54 @@ if ( file.exists(out_table_filename) ) { --- -## 2. Load Annotation Databases +## 2. Create the Organism Package if it is Not Hosted by Bioconductor + +```R +# Use AnnotationForge's makeOrgPackageFromNCBI function with default settings to create the organism-specific org.db R package from available NCBI annotations + +# Try to download the org.db from Bioconductor, build it locally if installation fails +BiocManager::install(target_org_db, ask = FALSE) +if (!requireNamespace(target_org_db, quietly = TRUE)) { + tryCatch({ + # Parse organism's name in the reference table to create the org.db name (target_org_db) + genus_species <- strsplit(target_species_designation, " ")[[1]] + if (length(genus_species) < 1) { + stop("Species designation is not correctly formatted: ", target_species_designation) + } + genus <- genus_species[1] + species <- ifelse(length(genus_species) > 1, genus_species[2], "") + strain <- ref_table %>% + filter(name == target_organism) %>% + pull(strain) %>% + gsub("[^A-Za-z0-9]", "", .) + if (!is.na(strain) && strain != "") { + species <- paste0(species, strain) + } + target_org_db <- paste0("org.", substr(genus, 1, 1), species, ".eg.db") + + BiocManager::install(c("AnnotationForge", "biomaRt", "GO.db"), ask = FALSE) + library(AnnotationForge) + makeOrgPackageFromNCBI( + version = "0.1", + author = "Your Name ", + maintainer = "Your Name ", + outputDir = "./", + tax_id = target_taxid, + genus = genus, + species = species + ) + install.packages(file.path("./", target_org_db), repos = NULL, type = "source", quiet = TRUE) + cat(paste0("'", target_org_db, "' has been successfully built and installed.\n")) + }, error = function(e) { + stop("Failed to build and load the package: ", target_org_db, "\nError: ", e$message) + }) + target_org_db <- install_annotations(target_organism, ref_tab_path) +} +``` + +--- + +## 3. Load Annotation Databases ```R # Set timeout time to ensure annotation file downloads will complete @@ -248,27 +296,8 @@ GTF <- data.frame(GTF) ###### org.db ######## -# Define a function to load the specified org.db package for a given target organism -install_and_load_org_db <- function(target_organism, target_org_db, ref_tab_path) { - if (!is.na(target_org_db) && target_org_db != "") { - # Attempt to install the package from Bioconductor - BiocManager::install(target_org_db, ask = FALSE) - - # Check if the package was successfully loaded - if (!requireNamespace(target_org_db, quietly = TRUE)) { - # If not, attempt to create it locally using a helper script - source("install-org-db.R") - target_org_db <- install_annotations(target_organism, ref_tab_path) - } - } else { - # If target_org_db is NA or empty, create it locally using the helper script - source("install-org-db.R") - target_org_db <- install_annotations(target_organism, ref_tab_path) - } - - # Load the package into the R session - library(target_org_db, character.only = TRUE) -} +# Load the package into the R session +library(target_org_db, character.only = TRUE) # Define list of supported organisms which do not use annotations from an org.db no_org_db <- c("NCFM", "MMARINUMM", "ORYSJ", "PA14", "ATCC27592", "MRSA252", "UA159", "ES114") @@ -283,7 +312,7 @@ if (!(target_organism %in% no_org_db) && (target_organism %in% currently_accepte --- -## 3. Build Initial Annotation Table +## 4. Build Initial Annotation Table ```R # Initialize table from GTF @@ -355,7 +384,7 @@ if (target_organism == "SALTY") { --- -## 4. Add org.db Keys +## 5. Add org.db Keys ```R annot_orgdb <- annot_gtf @@ -458,7 +487,7 @@ if (target_organism == "YEAST") { --- -## 5. Add STRING IDs +## 6. Add STRING IDs ```R # Define organisms that do not use STRING annotations @@ -570,7 +599,7 @@ annot_stringdb <- as.data.frame(annot_stringdb) --- -## 6. Add Gene Ontology (GO) slim IDs +## 7. Add Gene Ontology (GO) slim IDs ```R # Define organisms that do not use PANTHER annotations @@ -618,7 +647,7 @@ if (!(target_organism %in% no_panther_db)) { --- -## 7. Export Annotation Table and Build Info +## 8. Export Annotation Table and Build Info ```R # Group by primary key to remove any remaining unjoined or duplicate rows From 0020c19a8194f8c2c706f0e6b77fdfc4fc9dfcc5 Mon Sep 17 00:00:00 2001 From: Alexis Torres Date: Tue, 3 Sep 2024 12:51:47 -0700 Subject: [PATCH 10/58] [GL_RefAnnotTable] Adjust DPPD doc --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index 4787a76a..44b2b7f8 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -151,11 +151,14 @@ The default columns in the annotation table are: > Current GeneLab annotation tables are available on [figshare](https://figshare.com/), exact links for each reference organism are provided in the [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) file. > -> **[Ensembl Reference Files](https://www.ensembl.org/index.html) Used:** +> **[Ensembl Reference Versions](https://www.ensembl.org/index.html):** > - Animals: Ensembl release 112 > - Plants: Ensembl plants release 59 -> - Bacteria: Ensembl bacteria release 59 - +> - Bacteria: Ensembl bacteria release 59 +> +> **PANTHER:** 18.0 +> +> **STRING:** 12.0 --- From 3dc8f2c03a58e8fc08034d25ab2dd7d73d1ffe1d Mon Sep 17 00:00:00 2001 From: Alexis Torres Date: Wed, 4 Sep 2024 11:28:28 -0700 Subject: [PATCH 11/58] [GL_RefAnnotTable] Change input arg to full name --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 185 +++++++++--------- .../GL-DPPD-7110-A_annotations.csv | 14 +- .../GL_RefAnnotTable-A/CHANGELOG.md | 1 + .../GL_RefAnnotTable-A/README.md | 8 +- .../GL-DPPD-7110-A_build-genome-annots-tab.R | 133 +++++++------ .../workflow_code/install-org-db.R | 8 +- 6 files changed, 183 insertions(+), 166 deletions(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index 44b2b7f8..ec3528b4 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -70,38 +70,38 @@ The default columns in the annotation table are: - ENSEMBL (or TAIR), SYMBOL, GENENAME, REFSEQ, ENTREZID, STRING_id, GOSLIM_IDS - For organisms with FASTA and GTF files sourced from NCBI, the LOCUS, OLD_LOCUS, SYMBOL, GENENAME, and GO annotations were directly derived from the GTF file. The `GO` column contains GO terms. `OLD_LOCUS`, or `old_locus_tag` in the GTF was retained when needed to map to STRING IDs. -- Missing columns indicate the absence of corresponding data for that organism +- Missing columns indicate the absence of corresponding data for that organism. -1. **Brachypodium distachyon (BRADI)**: +1. **Brachypodium distachyon**: - Columns: ENSEMBL, ACCNUM, SYMBOL, GENENAME, REFSEQ, ENTREZID, STRING_id, GOSLIM_IDS > Note: GTF `transcript_id` entries were matched with `ACCNUM` keys in the `org.db` and saved as `ACCNUM` -2. **Caenorhabditis elegans (WORM)**: +2. **Caenorhabditis elegans**: - Columns: ENSEMBL, SYMBOL, GENENAME, REFSEQ, ENTREZID, STRING_id > Note: org.db ENTREZ keys did not match PANTHER ENTREZ keys so the empty `GOSLIM_IDS` column was ommitted -3. **Lactobacillus acidophilus (NCFM)**: +3. **Lactobacillus acidophilus**: - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, GO, STRING_id -4. **Mycobacterium marinum (MMARINUMM)**: +4. **Mycobacterium marinum**: - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, GO, STRING_id -5. **Oryza sativa Japonica (ORYSJ)**: +5. **Oryza sativa Japonica**: - Columns: ENSEMBL, STRING_id -6. **Pseudomonas aeruginosa UCBPP-PA14 (PA14)**: +6. **Pseudomonas aeruginosa UCBPP-PA14**: - Columns: LOCUS, SYMBOL, GENENAME, GO -7. **Serratia liquefaciens ATCC 27592 (ATCC27592)**: +7. **Serratia liquefaciens ATCC 27592**: - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, GO, STRING_id -8. **Staphylococcus aureus MRSA252 (MRSA252)**: +8. **Staphylococcus aureus MRSA252**: - Columns: LOCUS, SYMBOL, GENENAME, GO -9. **Streptococcus mutans UA159 (UA159)**: +9. **Streptococcus mutans UA159**: - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, GO, STRING_id -10. **Vibrio fischeri ES114 (ES114)**: +10. **Vibrio fischeri ES114**: - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, GO, STRING_id --- @@ -128,22 +128,25 @@ The default columns in the annotation table are: # Software Used -| Program | Version | Relevant Links | -|:--------------|:-------:|:---------------| -| R | 4.4.0 | [https://www.r-project.org/](https://www.r-project.org/) | -| Bioconductor | 3.19.1 | [https://bioconductor.org](https://bioconductor.org) | -| tidyverse | 2.0.0 | [https://www.tidyverse.org](https://www.tidyverse.org) | -| STRINGdb | 2.16.0 | [https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html](https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html) | -| PANTHER.db | 1.0.12 | [https://bioconductor.org/packages/release/data/annotation/html/PANTHER.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/PANTHER.db.html) | -| rtracklayer | 1.64.0 | [https://bioconductor.org/packages/release/bioc/html/rtracklayer.html](https://www.bioconductor.org/packages/release/bioc/html/rtracklayer.html) | -| org.At.tair.db| 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.At.tair.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.At.tair.db.html) | -| org.Ce.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Ce.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Ce.eg.db.html) | -| org.Dm.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Dm.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Dm.eg.db.html) | -| org.Dr.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html) | -| org.Hs.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html) | -| org.Mm.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html) | -| org.Rn.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Rn.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Rn.eg.db.html) | -| org.Sc.sgd.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html) | +| Program | Version | Relevant Links | +|:----------------|:-------:|:---------------| +| R | 4.4.0 | [https://www.r-project.org/](https://www.r-project.org/) | +| Bioconductor | 3.19.1 | [https://bioconductor.org](https://bioconductor.org) | +| tidyverse | 2.0.0 | [https://www.tidyverse.org](https://www.tidyverse.org) | +| STRINGdb | 2.16.0 | [https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html](https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html) | +| PANTHER.db | 1.0.12 | [https://bioconductor.org/packages/release/data/annotation/html/PANTHER.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/PANTHER.db.html) | +| rtracklayer | 1.64.0 | [https://bioconductor.org/packages/release/bioc/html/rtracklayer.html](https://www.bioconductor.org/packages/release/bioc/html/rtracklayer.html) | +| org.At.tair.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.At.tair.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.At.tair.db.html) | +| org.Ce.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Ce.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Ce.eg.db.html) | +| org.Dm.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Dm.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Dm.eg.db.html) | +| org.Dr.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html) | +| org.Hs.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html) | +| org.Mm.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html) | +| org.Rn.eg.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Rn.eg.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Rn.eg.db.html) | +| org.Sc.sgd.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html) | +| AnnotationForge | 1.46.0 | [https://bioconductor.org/packages/AnnotationForge](https://bioconductor.org/packages/AnnotationForge) | +| biomaRt | 2.60.1 | [https://bioconductor.org/packages/biomaRt](https://bioconductor.org/packages/biomaRt) | +| GO.db | 2.0.0 | [https://bioconductor.org/packages/GO.db](https://bioconductor.org/packages/GO.db) | --- @@ -157,8 +160,7 @@ The default columns in the annotation table are: > - Bacteria: Ensembl bacteria release 59 > > **PANTHER:** 18.0 -> -> **STRING:** 12.0 +> > *Note: The values in the 'name' column of [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) (e.g., MOUSE, HUMAN, ARABIDOPSIS) are derived from the short names used in PANTHER. These short names are subject to change.* --- @@ -173,11 +175,13 @@ ref_tab_path <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/ readme_path <- "https://github.com/nasa/GeneLab_Data_Processing/tree/master/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md" # List currently supported organisms -currently_accepted_orgs <- c("ARABIDOPSIS", "BACSU", "BRADI", "WORM", "ZEBRAFISH", - "FLY", "ECOLI", "HUMAN", "NCFM", "MOUSE", - "MMARINUMM", "ORYSJ", "ORYLA", "PA14", "RAT", - "YEAST", "SALTY", "ATCC27592", "MRSA252", "UA159", - "ES114") +currently_accepted_orgs <- c("Arabidopsis thaliana", "Bacillus subtilis", "Brachypodium distachyon", + "Caenorhabditis elegans", "Danio rerio", "Drosophila melanogaster", + "Escherichia coli", "Homo sapiens", "Lactobacillus acidophilus", + "Mus musculus", "Mycobacterium marinum", "Oryza sativa", + "Oryzias latipes", "Pseudomonas aeruginosa", "Rattus norvegicus", + "Saccharomyces cerevisiae", "Salmonella enterica", "Serratia liquefaciens", + "Staphylococcus aureus", "Streptococcus mutans", "Vibrio fischeri") # Import libraries library(tidyverse) @@ -204,13 +208,14 @@ ref_table <- tryCatch( # Get target organism information target_info <- ref_table %>% - filter(name == target_organism) + filter(species == target_organism) # Extract the relevant columns from the reference table target_taxid <- target_info$taxon # Taxonomic identifier target_org_db <- target_info$annotations # org.eg.db R package target_species_designation <- target_info$species # Full species name gtf_link <- target_info$gtf # Path to reference assembly GTF +target_short_name <- target_info$name # PANTHER / UNIPROT short name; blank if not available # Error handling for missing values if (is.na(target_taxid) || is.na(target_org_db) || is.na(target_species_designation) || is.na(gtf_link)) { @@ -255,7 +260,7 @@ if (!requireNamespace(target_org_db, quietly = TRUE)) { genus <- genus_species[1] species <- ifelse(length(genus_species) > 1, genus_species[2], "") strain <- ref_table %>% - filter(name == target_organism) %>% + filter(species == target_organism) %>% pull(strain) %>% gsub("[^A-Za-z0-9]", "", .) if (!is.na(strain) && strain != "") { @@ -303,7 +308,8 @@ GTF <- data.frame(GTF) library(target_org_db, character.only = TRUE) # Define list of supported organisms which do not use annotations from an org.db -no_org_db <- c("NCFM", "MMARINUMM", "ORYSJ", "PA14", "ATCC27592", "MRSA252", "UA159", "ES114") +no_org_db <- c("Lactobacillus acidophilus", "Mycobacterium marinum", "Oryza sativa", "Pseudomonas aeruginosa", + "Serratia liquefaciens", "Staphylococcus aureus", "Streptococcus mutans", "Vibrio fischeri") # Run the function unless the target_organism is in no_org_db if (!(target_organism %in% no_org_db) && (target_organism %in% currently_accepted_orgs)) { @@ -323,20 +329,20 @@ if (!(target_organism %in% no_org_db) && (target_organism %in% currently_accepte # Define GTF keys based on the target organism; gene_id conrains unique gene IDs in the reference assembly. Defaults to ENSEMBL gtf_keytype_mappings <- list( - ARABIDOPSIS = c(gene_id = "TAIR"), - BACSU = c(gene_id = "ENSEMBL", gene_name = "SYMBOL"), - BRADI = c(gene_id = "ENSEMBL", transcript_id = "ACCNUM"), - WORM = c(gene_id = "ENSEMBL"), - ECOLI = c(gene_id = "ENSEMBL", gene_name = "SYMBOL"), - NCFM = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), - MMARINUMM = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), - PA14 = c(gene_id = "LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), - SALTY = c(gene_id = "ENSEMBL", db_xref = "ENTREZID"), - ATCC27592 = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), - MRSA252 = c(gene_id = "LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), - UA159 = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), - ES114 = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), - default = c(gene_id = "ENSEMBL") + "Arabidopsis thaliana" = c(gene_id = "TAIR"), + "Bacillus subtilis" = c(gene_id = "ENSEMBL", gene_name = "SYMBOL"), + "Brachypodium distachyon" = c(gene_id = "ENSEMBL", transcript_id = "ACCNUM"), + "Caenorhabditis elegans" = c(gene_id = "ENSEMBL"), + "Escherichia coli" = c(gene_id = "ENSEMBL", gene_name = "SYMBOL"), + "Lactobacillus acidophilus" = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + "Mycobacterium marinum" = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + "Pseudomonas aeruginosa" = c(gene_id = "LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + "Salmonella enterica" = c(gene_id = "ENSEMBL", db_xref = "ENTREZID"), + "Serratia liquefaciens" = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + "Staphylococcus aureus" = c(gene_id = "LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + "Streptococcus mutans" = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + "Vibrio fischeri" = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + "default" = c(gene_id = "ENSEMBL") ) # Get the key types for the target organism or use the default @@ -360,17 +366,17 @@ primary_keytype <- wanted_gtf_keytypes[1] # Define filtering criteria for specific organisms filter_criteria <- list( - BACSU = "^BSU", - FLY = "^RR", - YEAST = "^Y[A-Z0-9]{6}-?[A-Z]?$", - ECOLI = "^b[0-9]{4}$" + "Bacillus subtilis" = "^BSU", + "Drosophila melanogaster" = "^RR", + "Saccharomyces cerevisiae" = "^Y[A-Z0-9]{6}-?[A-Z]?$", + "Escherichia coli" = "^b[0-9]{4}$" ) # Apply the filter if there's a specific criterion for the target organism filter_pattern <- filter_criteria[[target_organism]] if (!is.null(filter_pattern)) { - if (target_organism == "FLY") { + if (target_organism == "Drosophila melanogaster") { annot_gtf <- annot_gtf %>% filter(!grepl(filter_pattern, !!sym(primary_keytype))) } else { annot_gtf <- annot_gtf %>% filter(grepl(filter_pattern, !!sym(primary_keytype))) @@ -378,7 +384,7 @@ if (!is.null(filter_pattern)) { } # Remove "Gene:" labels on ENTREZ IDs -if (target_organism == "SALTY") { +if (target_organism == "Salmonella enterica") { annot_gtf <- annot_gtf %>% dplyr::mutate(ENTREZID = gsub("^GeneID:", "", ENTREZID)) %>% as.data.frame } ``` @@ -394,12 +400,12 @@ annot_orgdb <- annot_gtf # Define the initial keys to pull from the organism-specific database orgdb_keytypes_list <- list( - BRADI = c("GENENAME", "REFSEQ", "ENTREZID"), - ECOLI = c("GENENAME", "REFSEQ", "ENTREZID"), - WORM = c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID", "GO"), - SALTY = c("SYMBOL", "GENENAME", "REFSEQ"), - YEAST = c("GENENAME", "ALIAS", "REFSEQ", "ENTREZID"), - default = c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID") + "Brachypodium distachyon" = c("GENENAME", "REFSEQ", "ENTREZID"), + "Escherichia coli" = c("GENENAME", "REFSEQ", "ENTREZID"), + "Caenorhabditis elegans" = c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID", "GO"), + "Salmonella enterica" = c("SYMBOL", "GENENAME", "REFSEQ"), + "Saccharomyces cerevisiae" = c("GENENAME", "ALIAS", "REFSEQ", "ENTREZID"), + "default" = c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID") ) # Add entries for organisms in no_org_db as character(0) (no keys wanted from the org.db) @@ -415,12 +421,12 @@ wanted_org_db_keytypes <- if (target_organism %in% names(orgdb_keytypes_list)) { # Define mappings for query and keytype based on target organism orgdb_keytype_mappings <- list( - BACSU = list(query = "SYMBOL", keytype = "SYMBOL"), - BRADI = list(query = "ACCNUM", keytype = "ACCNUM"), - WORM = list(query = primary_keytype, keytype = "ENSEMBL"), - ECOLI = list(query = "SYMBOL", keytype = "SYMBOL"), - SALTY = list(query = "ENTREZID", keytype = "ENTREZID"), - default = list(query = primary_keytype, keytype = primary_keytype) + "Bacillus subtilis" = list(query = "SYMBOL", keytype = "SYMBOL"), + "Brachypodium distachyon" = list(query = "ACCNUM", keytype = "ACCNUM"), + "Caenorhabditis elegans" = list(query = primary_keytype, keytype = "ENSEMBL"), + "Escherichia coli" = list(query = "SYMBOL", keytype = "SYMBOL"), + "Salmonella enterica" = list(query = "ENTREZID", keytype = "ENTREZID"), + "default" = list(query = primary_keytype, keytype = primary_keytype) ) # Define the orgdb_query, this is the key type that will be used to map to the org.db @@ -460,7 +466,7 @@ clean_and_match_accnum <- function(annot_table, org_db, query_col, keytype_col, for (keytype in wanted_org_db_keytypes) { # Check if keytype is a valid column in the target org.db if (keytype %in% columns(get(target_org_db, envir = .GlobalEnv))) { - if (target_organism == "BRADI" && orgdb_query == "ACCNUM") { + if (target_organism == "Brachypodium distachyon" && orgdb_query == "ACCNUM") { # For BRADI: use the clean_and_match_accnum function to map to org.db ACCNUM entries org_matches <- clean_and_match_accnum(annot_orgdb, get(target_org_db, envir = .GlobalEnv), query_col = orgdb_query, keytype_col = orgdb_keytype, target_column = keytype) } else { @@ -476,12 +482,12 @@ for (keytype in wanted_org_db_keytypes) { } # For SALTY, reorder columns to mtach other tables -if (target_organism == "SALTY") { # Reorder columns to match others; was mismatched since ENTREZ came from GTF +if (target_organism == "Salmonella enterica") { # Reorder columns to match others; was mismatched since ENTREZ came from GTF annot_orgdb <- annot_orgdb[, c("ENSEMBL", "SYMBOL", "GENENAME", "REFSEQ", "ENTREZID")] } # For YEAST, Rename ALIAS to GENENAME -if (target_organism == "YEAST") { +if (target_organism == "Saccharomyces cerevisiae") { colnames(annot_orgdb) <- c("ENSEMBL", "SYMBOL", "GENENAME", "REFSEQ", "ENTREZID") } ``` @@ -494,16 +500,16 @@ if (target_organism == "YEAST") { ```R # Define organisms that do not use STRING annotations -no_stringdb <- c("PA14", "MRSA252") +no_stringdb <- c("Pseudomonas aeruginosa", "Staphylococcus aureus") # Define the key type used for mapping to STRING stringdb_query_list <- list( - NCFM = "OLD_LOCUS", - MMARINUMM = "OLD_LOCUS", - ATCC27592 = "OLD_LOCUS", - UA159 = "OLD_LOCUS", - ES114 = "OLD_LOCUS", - default = primary_keytype + "Lactobacillus acidophilus" = "OLD_LOCUS", + "Mycobacterium marinum" = "OLD_LOCUS", + "Serratia liquefaciens" = "OLD_LOCUS", + "Streptococcus mutans" = "OLD_LOCUS", + "Vibrio fischeri" = "OLD_LOCUS", + "default" = primary_keytype ) # Define the key type for mapping in STRING, using the default if necessary @@ -516,7 +522,7 @@ stringdb_query <- if (!is.null(stringdb_query_list[[target_organism]])) { # Handle organisms which do not use the GTF's gene_id keys to map to STRING # These are microbial species for which NCBI references were used rather than ENSEMBL, # for which the STRING accessions match the GTF's gene_name keys, but not the gene_id keys. -uses_old_locus <- c("NCFM", "MMARINUMM", "ATCC27592", "UA159", "ES114") +uses_old_locus <- c("Lactobacillus acidophilus", "Mycobacterium marinum", "Serratia liquefaciens", "Streptococcus mutans", "Vibrio fischeri") # Handle STRING annotation processing based on the target organism if (target_organism %in% uses_old_locus) { # If the target organism is one of the NOENTRY organisms, handle the OLD_LOCUS splitting @@ -534,15 +540,15 @@ if (target_organism %in% uses_old_locus) { } # Replace "BSU_" with "BSU" in the primary_keytype column for BACSU before STRING mapping -if (target_organism == "BACSU") { +if (target_organism == "Bacillus subtilis") { annot_stringdb[[stringdb_query]] <- gsub("^BSU_", "BSU", annot_stringdb[[stringdb_query]]) } # Map alternative taxonomy IDs for organisms not directly supported by STRING taxid_map <- list( - YEAST = 4932, - BRARP = 51351, - ATCC27592 = 614 + "Saccharomyces cerevisiae" = 4932, + "Brassica rapa" = 51351, + "Serratia liquefaciens" = 614 ) # Assign the alternative taxonomy identifier if applicable @@ -572,7 +578,7 @@ if (!is.null(string_map)) { if (!is.null(string_map)) { # Determine the appropriate join key - join_key <- if (target_organism %in% c("NCFM", "MMARINUMM", "ATCC27592", "UA159", "ES114")) { + join_key <- if (target_organism %in% c("Lactobacillus acidophilus", "Mycobacterium marinum", "Serratia liquefaciens", "Streptococcus mutans", "Vibrio fischeri")) { primary_keytype } else { stringdb_query @@ -591,7 +597,7 @@ if (!is.null(string_map)) { } # Undo the "BSU_" to "BSU" replacement for BACSU after STRING mapping -if (target_organism == "BACSU") { +if (target_organism == "Bacillus subtilis") { annot_stringdb[[stringdb_query]] <- gsub("^BSU", "BSU_", annot_stringdb[[stringdb_query]]) } @@ -606,7 +612,7 @@ annot_stringdb <- as.data.frame(annot_stringdb) ```R # Define organisms that do not use PANTHER annotations -no_panther_db <- c("WORM", "MMARINUMM", "ORYSJ", "MRSA252", "NCFM", "ATCC27592", "UA159", "ES114", "PA14") +no_panther_db <- c("Caenorhabditis elegans", "Mycobacterium marinum", "Oryza sativa", "Staphylococcus aureus", "Lactobacillus acidophilus", "Serratia liquefaciens", "Streptococcus mutans", "Vibrio fischeri", "Pseudomonas aeruginosa") annot_pantherdb <- annot_stringdb @@ -616,8 +622,9 @@ if (!(target_organism %in% no_panther_db)) { pantherdb_query = "ENTREZID" pantherdb_keytype = "ENTREZ" - # Retrieve target organism PANTHER GO slim annotations database - pthOrganisms(PANTHER.db) <- target_organism + # Retrieve target organism PANTHER GO slim annotations database using the UNIPROT / PANTHER short name + target_short_name <- target_species_designation + pthOrganisms(PANTHER.db) <- target_short_name # Define a function to retrieve GO slim IDs for a given gene's ENTREZIDs, which may include entries separated by a "|" get_go_slim_ids <- function(entrez_id) { diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv index caa49b23..0beb3696 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv @@ -9,16 +9,16 @@ FLY,Drosophila melanogaster,,112,ensembl,http://ftp.ensembl.org/pub/release-112/ ERCC,,,,ThermoFisher,https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip,https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip,,,, ECOLI,Escherichia coli,str. K-12 substr. MG1655,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/dna/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.59.gtf.gz,511145,org.EcolistrK12substrMG1655.eg.db,https://figshare.com/ndownloader/files/48354379,https://figshare.com/ndownloader/files/48354394 HUMAN,Homo sapiens,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz,9606,org.Hs.eg.db,https://figshare.com/ndownloader/files/48354445,https://figshare.com/ndownloader/files/48354448 -NCFM,Lactobacillus acidophilus,NCFM,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.gtf.gz,272621,,https://figshare.com/ndownloader/files/48354424,https://figshare.com/ndownloader/files/48354415 +,Lactobacillus acidophilus,NCFM,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.gtf.gz,272621,,https://figshare.com/ndownloader/files/48354424,https://figshare.com/ndownloader/files/48354415 MOUSE,Mus musculus,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/mus_musculus/Mus_musculus.GRCm39.112.gtf.gz,10090,org.Mm.eg.db,https://figshare.com/ndownloader/files/48354460,https://figshare.com/ndownloader/files/48354457 -MMARINUMM,Mycobacterium marinum,M,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,216594,,https://figshare.com/ndownloader/files/48354433,https://figshare.com/ndownloader/files/48354430 +,Mycobacterium marinum,M,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,216594,,https://figshare.com/ndownloader/files/48354433,https://figshare.com/ndownloader/files/48354430 ORYSJ,Oryza sativa,Japonica,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/oryza_sativa/Oryza_sativa.IRGSP-1.0.59.gtf.gz,39947,,https://figshare.com/ndownloader/files/48354451,https://figshare.com/ndownloader/files/48354454 ORYLA,Oryzias latipes,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/oryzias_latipes/dna/Oryzias_latipes.ASM223467v1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/oryzias_latipes/Oryzias_latipes.ASM223467v1.112.gtf.gz,8090,org.Olatipes.eg.db,https://figshare.com/ndownloader/files/48354463,https://figshare.com/ndownloader/files/48354466 -PA14,Pseudomonas aeruginosa,UCBPP-PA14,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.gtf.gz,208963,,https://figshare.com/ndownloader/files/48354421,https://figshare.com/ndownloader/files/48354427 +,Pseudomonas aeruginosa,UCBPP-PA14,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.gtf.gz,208963,,https://figshare.com/ndownloader/files/48354421,https://figshare.com/ndownloader/files/48354427 RAT,Rattus norvegicus,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/rattus_norvegicus/dna/Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/rattus_norvegicus/Rattus_norvegicus.mRatBN7.2.112.gtf.gz,10116,org.Rn.eg.db,https://figshare.com/ndownloader/files/48354472,https://figshare.com/ndownloader/files/48354475 YEAST,Saccharomyces cerevisiae,S288C,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.112.gtf.gz,559292,org.Sc.sgd.db,https://figshare.com/ndownloader/files/48354469,https://figshare.com/ndownloader/files/48354478 SALTY,Salmonella enterica,serovar Typhimurium str. LT2,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.gtf.gz,99287,org.SentericaserovarTyphimuriumstrLT2.eg.db,https://figshare.com/ndownloader/files/48354385,https://figshare.com/ndownloader/files/48354391 -ATCC27592,Serratia liquefaciens,ATCC 27592,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.gtf.gz,1346614,,https://figshare.com/ndownloader/files/48354436,https://figshare.com/ndownloader/files/48354439 -MRSA252,Staphylococcus aureus,MRSA252,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.gtf.gz,282458,,https://figshare.com/ndownloader/files/48354403,https://figshare.com/ndownloader/files/48354409 -UA159,Streptococcus mutans,UA159,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.gtf.gz,210007,,https://figshare.com/ndownloader/files/48354397,https://figshare.com/ndownloader/files/48354406 -ES114,Vibrio fischeri,ES114,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.gtf.gz,312309,,https://figshare.com/ndownloader/files/48354412,https://figshare.com/ndownloader/files/48354418 \ No newline at end of file +,Serratia liquefaciens,ATCC 27592,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.gtf.gz,1346614,,https://figshare.com/ndownloader/files/48354436,https://figshare.com/ndownloader/files/48354439 +,Staphylococcus aureus,MRSA252,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.gtf.gz,282458,,https://figshare.com/ndownloader/files/48354403,https://figshare.com/ndownloader/files/48354409 +,Streptococcus mutans,UA159,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.gtf.gz,210007,,https://figshare.com/ndownloader/files/48354397,https://figshare.com/ndownloader/files/48354406 +,Vibrio fischeri,ES114,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.gtf.gz,312309,,https://figshare.com/ndownloader/files/48354412,https://figshare.com/ndownloader/files/48354418 \ No newline at end of file diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md index 87afed3a..8f1a8c5f 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md @@ -42,6 +42,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - Plants: Ensembl plants release 59 - Bacteria: Ensembl bacteria release 59 - Removed org.EcK12.eg.db and replaced it with a locally created annotations database, as it is no longer available on Bioconductor +- Changed the first argument of GL-DPPD-7110-A_build-genome-annots-tab.R from the 'name' column value to the 'species' column value (e.g., 'Mus musculus' instead of 'MOUSE') ## [1.0.0](https://github.com/nasa/GeneLab_Data_Processing/releases/tag/GL_RefAnnotTable_1.0.0) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index 865a0da9..c19bd34d 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -66,12 +66,12 @@ chmod -R u+x *R While in the GL_RefAnnotTable workflow directory, you are now able to run the workflow. Below is an example of how to run the workflow to build an annotation table for Mus musculus (mouse): ```bash -Rscript GL-DPPD-7110-A_build-genome-annots-tab.R MOUSE +Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' ``` **Input data:** -- No input files are required. Specify the target organism using a positional command line argument. `MOUSE` is used in the example above. To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. The correct argument for each organism can also be found in the 'name' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) +- No input files are required. Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above. To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) - Optional: a reference table CSV can be supplied as a second positional argument instead of using the default [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) @@ -85,12 +85,12 @@ Rscript GL-DPPD-7110-A_build-genome-annots-tab.R MOUSE When the workflow is run, if the reference table does not specify an annotations database for the target_organism in the `annotations` column, the `install_annotations` function, defined in the `install-org-db.R` script, will be executed. This script will locally create and install an annotations database R package using AnnotationForge. This function can also be run as a stand-alone script from the command line: ```bash -Rscript install-org-db.R BACSU /path/to/GL-DPPD-7110-A_annotations.csv +Rscript install-org-db.R 'Bacillus subtilis' /path/to/GL-DPPD-7110-A_annotations.csv ``` **Input data:** -- The target organism must be specified as the first positional command line argument, `BACSU` is used in the example above. The correct argument for each organism can be found in the 'name' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) +- The target organism must be specified as the first positional command line argument, `Bacillus subtilis` is used in the example above. The correct argument for each organism can be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) - The path to a local reference table must also be supplied as the second positional argument diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R index 134195f0..48832823 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R @@ -1,7 +1,7 @@ #!/usr/bin/env Rscript # Written by Mike Lee # GeneLab script for generating organism-specific gene annotation tables -# Example usage: Rscript GL-DPPD-7110-A_build-genome-annots-tab.R MOUSE +# Example usage: Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' # Define variables associated with current pipeline and annotation table versions GL_DPPD_ID <- "GL-DPPD-7110-A" @@ -9,11 +9,13 @@ ref_tab_path <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/ readme_path <- "https://github.com/nasa/GeneLab_Data_Processing/tree/master/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md" # List currently supported organisms -currently_accepted_orgs <- c("ARABIDOPSIS", "BACSU", "BRADI", "WORM", "ZEBRAFISH", - "FLY", "ECOLI", "HUMAN", "NCFM", "MOUSE", - "MMARINUMM", "ORYSJ", "ORYLA", "PA14", "RAT", - "YEAST", "SALTY", "ATCC27592", "MRSA252", "UA159", - "ES114") +currently_accepted_orgs <- c("Arabidopsis thaliana", "Bacillus subtilis", "Brachypodium distachyon", + "Caenorhabditis elegans", "Danio rerio", "Drosophila melanogaster", + "Escherichia coli", "Homo sapiens", "Lactobacillus acidophilus", + "Mus musculus", "Mycobacterium marinum", "Oryza sativa", + "Oryzias latipes", "Pseudomonas aeruginosa", "Rattus norvegicus", + "Saccharomyces cerevisiae", "Salmonella enterica", "Serratia liquefaciens", + "Staphylococcus aureus", "Streptococcus mutans", "Vibrio fischeri") ######################################################################### @@ -28,11 +30,16 @@ validate_arguments <- function(args, supported_orgs) { if (length(args) < 1) { stop("One positional argument is required that specifies the target organism. Available options are:\n", paste(supported_orgs, collapse = ", ")) } + + # Convert the first argument to uppercase target_organism <- toupper(args[1]) - if (!target_organism %in% supported_orgs) { + + # Check if the uppercased target organism is in the uppercased supported_orgs + if (!target_organism %in% sapply(supported_orgs, toupper)) { stop(paste0("'", target_organism, "' is not currently supported.")) } - return(target_organism) + + return(args[1]) } target_organism <- validate_arguments(args, currently_accepted_orgs) @@ -84,13 +91,14 @@ ref_table <- tryCatch( # Get target organism information target_info <- ref_table %>% - filter(name == target_organism) + filter(species == target_organism) # Extract the relevant columns from the reference table target_taxid <- target_info$taxon # Taxonomic identifier target_org_db <- target_info$annotations # org.eg.db R package target_species_designation <- target_info$species # Full species name gtf_link <- target_info$gtf # Path to reference assembly GTF +target_short_name <- target_info$name # PANTHER / UNIPROT short name; blank if not available # Error handling for missing values if (is.na(target_taxid) || is.na(target_org_db) || is.na(target_species_designation) || is.na(gtf_link)) { @@ -152,7 +160,8 @@ install_and_load_org_db <- function(target_organism, target_org_db, ref_tab_path } # Define list of supported organisms which do not use annotations from an org.db -no_org_db <- c("NCFM", "MMARINUMM", "ORYSJ", "PA14", "ATCC27592", "MRSA252", "UA159", "ES114") +no_org_db <- c("Lactobacillus acidophilus", "Mycobacterium marinum", "Oryza sativa", "Pseudomonas aeruginosa", + "Serratia liquefaciens", "Staphylococcus aureus", "Streptococcus mutans", "Vibrio fischeri") # Run the function unless the target_organism is in no_org_db if (!(target_organism %in% no_org_db) && (target_organism %in% currently_accepted_orgs)) { @@ -169,20 +178,20 @@ if (!(target_organism %in% no_org_db) && (target_organism %in% currently_accepte # Define GTF keys based on the target organism; gene_id conrains unique gene IDs in the reference assembly. Defaults to ENSEMBL gtf_keytype_mappings <- list( - ARABIDOPSIS = c(gene_id = "TAIR"), - BACSU = c(gene_id = "ENSEMBL", gene_name = "SYMBOL"), - BRADI = c(gene_id = "ENSEMBL", transcript_id = "ACCNUM"), - WORM = c(gene_id = "ENSEMBL"), - ECOLI = c(gene_id = "ENSEMBL", gene_name = "SYMBOL"), - NCFM = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), - MMARINUMM = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), - PA14 = c(gene_id = "LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), - SALTY = c(gene_id = "ENSEMBL", db_xref = "ENTREZID"), - ATCC27592 = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), - MRSA252 = c(gene_id = "LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), - UA159 = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), - ES114 = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), - default = c(gene_id = "ENSEMBL") + "Arabidopsis thaliana" = c(gene_id = "TAIR"), + "Bacillus subtilis" = c(gene_id = "ENSEMBL", gene_name = "SYMBOL"), + "Brachypodium distachyon" = c(gene_id = "ENSEMBL", transcript_id = "ACCNUM"), + "Caenorhabditis elegans" = c(gene_id = "ENSEMBL"), + "Escherichia coli" = c(gene_id = "ENSEMBL", gene_name = "SYMBOL"), + "Lactobacillus acidophilus" = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + "Mycobacterium marinum" = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + "Pseudomonas aeruginosa" = c(gene_id = "LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + "Salmonella enterica" = c(gene_id = "ENSEMBL", db_xref = "ENTREZID"), + "Serratia liquefaciens" = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + "Staphylococcus aureus" = c(gene_id = "LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + "Streptococcus mutans" = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + "Vibrio fischeri" = c(gene_id = "LOCUS", old_locus_tag = "OLD_LOCUS", gene = "SYMBOL", product = "GENENAME", Ontology_term = "GO"), + "default" = c(gene_id = "ENSEMBL") ) # Get the key types for the target organism or use the default @@ -206,17 +215,17 @@ primary_keytype <- wanted_gtf_keytypes[1] # Define filtering criteria for specific organisms filter_criteria <- list( - BACSU = "^BSU", - FLY = "^RR", - YEAST = "^Y[A-Z0-9]{6}-?[A-Z]?$", - ECOLI = "^b[0-9]{4}$" + "Bacillus subtilis" = "^BSU", + "Drosophila melanogaster" = "^RR", + "Saccharomyces cerevisiae" = "^Y[A-Z0-9]{6}-?[A-Z]?$", + "Escherichia coli" = "^b[0-9]{4}$" ) # Apply the filter if there's a specific criterion for the target organism filter_pattern <- filter_criteria[[target_organism]] if (!is.null(filter_pattern)) { - if (target_organism == "FLY") { + if (target_organism == "Drosophila melanogaster") { annot_gtf <- annot_gtf %>% filter(!grepl(filter_pattern, !!sym(primary_keytype))) } else { annot_gtf <- annot_gtf %>% filter(grepl(filter_pattern, !!sym(primary_keytype))) @@ -224,7 +233,7 @@ if (!is.null(filter_pattern)) { } # Remove "Gene:" labels on ENTREZ IDs -if (target_organism == "SALTY") { +if (target_organism == "Salmonella enterica") { annot_gtf <- annot_gtf %>% dplyr::mutate(ENTREZID = gsub("^GeneID:", "", ENTREZID)) %>% as.data.frame } @@ -236,12 +245,12 @@ annot_orgdb <- annot_gtf # Define the initial keys to pull from the organism-specific database orgdb_keytypes_list <- list( - BRADI = c("GENENAME", "REFSEQ", "ENTREZID"), - ECOLI = c("GENENAME", "REFSEQ", "ENTREZID"), - WORM = c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID", "GO"), - SALTY = c("SYMBOL", "GENENAME", "REFSEQ"), - YEAST = c("GENENAME", "ALIAS", "REFSEQ", "ENTREZID"), - default = c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID") + "Brachypodium distachyon" = c("GENENAME", "REFSEQ", "ENTREZID"), + "Escherichia coli" = c("GENENAME", "REFSEQ", "ENTREZID"), + "Caenorhabditis elegans" = c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID", "GO"), + "Salmonella enterica" = c("SYMBOL", "GENENAME", "REFSEQ"), + "Saccharomyces cerevisiae" = c("GENENAME", "ALIAS", "REFSEQ", "ENTREZID"), + "default" = c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID") ) # Add entries for organisms in no_org_db as character(0) (no keys wanted from the org.db) @@ -257,12 +266,12 @@ wanted_org_db_keytypes <- if (target_organism %in% names(orgdb_keytypes_list)) { # Define mappings for query and keytype based on target organism orgdb_keytype_mappings <- list( - BACSU = list(query = "SYMBOL", keytype = "SYMBOL"), - BRADI = list(query = "ACCNUM", keytype = "ACCNUM"), - WORM = list(query = primary_keytype, keytype = "ENSEMBL"), - ECOLI = list(query = "SYMBOL", keytype = "SYMBOL"), - SALTY = list(query = "ENTREZID", keytype = "ENTREZID"), - default = list(query = primary_keytype, keytype = primary_keytype) + "Bacillus subtilis" = list(query = "SYMBOL", keytype = "SYMBOL"), + "Brachypodium distachyon" = list(query = "ACCNUM", keytype = "ACCNUM"), + "Caenorhabditis elegans" = list(query = primary_keytype, keytype = "ENSEMBL"), + "Escherichia coli" = list(query = "SYMBOL", keytype = "SYMBOL"), + "Salmonella enterica" = list(query = "ENTREZID", keytype = "ENTREZID"), + "default" = list(query = primary_keytype, keytype = primary_keytype) ) # Define the orgdb_query, this is the key type that will be used to map to the org.db @@ -302,7 +311,7 @@ clean_and_match_accnum <- function(annot_table, org_db, query_col, keytype_col, for (keytype in wanted_org_db_keytypes) { # Check if keytype is a valid column in the target org.db if (keytype %in% columns(get(target_org_db, envir = .GlobalEnv))) { - if (target_organism == "BRADI" && orgdb_query == "ACCNUM") { + if (target_organism == "Brachypodium distachyon" && orgdb_query == "ACCNUM") { # For BRADI: use the clean_and_match_accnum function to map to org.db ACCNUM entries org_matches <- clean_and_match_accnum(annot_orgdb, get(target_org_db, envir = .GlobalEnv), query_col = orgdb_query, keytype_col = orgdb_keytype, target_column = keytype) } else { @@ -318,12 +327,12 @@ for (keytype in wanted_org_db_keytypes) { } # For SALTY, reorder columns to mtach other tables -if (target_organism == "SALTY") { # Reorder columns to match others; was mismatched since ENTREZ came from GTF +if (target_organism == "Salmonella enterica") { # Reorder columns to match others; was mismatched since ENTREZ came from GTF annot_orgdb <- annot_orgdb[, c("ENSEMBL", "SYMBOL", "GENENAME", "REFSEQ", "ENTREZID")] } # For YEAST, Rename ALIAS to GENENAME -if (target_organism == "YEAST") { +if (target_organism == "Saccharomyces cerevisiae") { colnames(annot_orgdb) <- c("ENSEMBL", "SYMBOL", "GENENAME", "REFSEQ", "ENTREZID") } @@ -332,16 +341,16 @@ if (target_organism == "YEAST") { ######################################################################### # Define organisms that do not use STRING annotations -no_stringdb <- c("PA14", "MRSA252") +no_stringdb <- c("Pseudomonas aeruginosa", "Staphylococcus aureus") # Define the key type used for mapping to STRING stringdb_query_list <- list( - NCFM = "OLD_LOCUS", - MMARINUMM = "OLD_LOCUS", - ATCC27592 = "OLD_LOCUS", - UA159 = "OLD_LOCUS", - ES114 = "OLD_LOCUS", - default = primary_keytype + "Lactobacillus acidophilus" = "OLD_LOCUS", + "Mycobacterium marinum" = "OLD_LOCUS", + "Serratia liquefaciens" = "OLD_LOCUS", + "Streptococcus mutans" = "OLD_LOCUS", + "Vibrio fischeri" = "OLD_LOCUS", + "default" = primary_keytype ) # Define the key type for mapping in STRING, using the default if necessary @@ -354,7 +363,7 @@ stringdb_query <- if (!is.null(stringdb_query_list[[target_organism]])) { # Handle organisms which do not use the GTF's gene_id keys to map to STRING # These are microbial species for which NCBI references were used rather than ENSEMBL, # for which the STRING accessions match the GTF's gene_name keys, but not the gene_id keys. -uses_old_locus <- c("NCFM", "MMARINUMM", "ATCC27592", "UA159", "ES114") +uses_old_locus <- c("Lactobacillus acidophilus", "Mycobacterium marinum", "Serratia liquefaciens", "Streptococcus mutans", "Vibrio fischeri") # Handle STRING annotation processing based on the target organism if (target_organism %in% uses_old_locus) { # If the target organism is one of the NOENTRY organisms, handle the OLD_LOCUS splitting @@ -372,15 +381,15 @@ if (target_organism %in% uses_old_locus) { } # Replace "BSU_" with "BSU" in the primary_keytype column for BACSU before STRING mapping -if (target_organism == "BACSU") { +if (target_organism == "Bacillus subtilis") { annot_stringdb[[stringdb_query]] <- gsub("^BSU_", "BSU", annot_stringdb[[stringdb_query]]) } # Map alternative taxonomy IDs for organisms not directly supported by STRING taxid_map <- list( - YEAST = 4932, - BRARP = 51351, - ATCC27592 = 614 + "Saccharomyces cerevisiae" = 4932, + "Brassica rapa" = 51351, + "Serratia liquefaciens" = 614 ) # Assign the alternative taxonomy identifier if applicable @@ -410,7 +419,7 @@ if (!is.null(string_map)) { if (!is.null(string_map)) { # Determine the appropriate join key - join_key <- if (target_organism %in% c("NCFM", "MMARINUMM", "ATCC27592", "UA159", "ES114")) { + join_key <- if (target_organism %in% c("Lactobacillus acidophilus", "Mycobacterium marinum", "Serratia liquefaciens", "Streptococcus mutans", "Vibrio fischeri")) { primary_keytype } else { stringdb_query @@ -429,7 +438,7 @@ if (!is.null(string_map)) { } # Undo the "BSU_" to "BSU" replacement for BACSU after STRING mapping -if (target_organism == "BACSU") { +if (target_organism == "Bacillus subtilis") { annot_stringdb[[stringdb_query]] <- gsub("^BSU", "BSU_", annot_stringdb[[stringdb_query]]) } @@ -440,7 +449,7 @@ annot_stringdb <- as.data.frame(annot_stringdb) ######################################################################### # Define organisms that do not use PANTHER annotations -no_panther_db <- c("WORM", "MMARINUMM", "ORYSJ", "MRSA252", "NCFM", "ATCC27592", "UA159", "ES114", "PA14") +no_panther_db <- c("Caenorhabditis elegans", "Mycobacterium marinum", "Oryza sativa", "Staphylococcus aureus", "Lactobacillus acidophilus", "Serratia liquefaciens", "Streptococcus mutans", "Vibrio fischeri", "Pseudomonas aeruginosa") annot_pantherdb <- annot_stringdb @@ -450,8 +459,8 @@ if (!(target_organism %in% no_panther_db)) { pantherdb_query = "ENTREZID" pantherdb_keytype = "ENTREZ" - # Retrieve target organism PANTHER GO slim annotations database - pthOrganisms(PANTHER.db) <- target_organism + # Retrieve target organism PANTHER GO slim annotations database using the UNIPROT / PANTHER short name + pthOrganisms(PANTHER.db) <- target_short_name # Define a function to retrieve GO slim IDs for a given gene's ENTREZIDs, which may include entries separated by a "|" get_go_slim_ids <- function(entrez_id) { diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R index 3374cd4b..5ecffc5b 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R @@ -10,18 +10,18 @@ install_annotations <- function(target_organism, refTablePath) { ref_table <- read.csv(refTablePath) target_taxid <- ref_table %>% - filter(name == target_organism) %>% + filter(species == target_organism) %>% pull(taxon) # Get package name or build it if not provided target_org_db <- ref_table %>% - filter(name == target_organism) %>% + filter(species == target_organism) %>% pull(annotations) if (is.na(target_org_db) || target_org_db == "") { cat("\nNo annotation database specified. Constructing package name...\n") target_species_designation <- ref_table %>% - filter(name == target_organism) %>% + filter(species == target_organism) %>% pull(species) %>% gsub("\\s+", " ", .) %>% gsub("[^A-Za-z0-9 ]", "", .) @@ -34,7 +34,7 @@ install_annotations <- function(target_organism, refTablePath) { genus <- genus_species[1] species <- ifelse(length(genus_species) > 1, genus_species[2], "") strain <- ref_table %>% - filter(name == target_organism) %>% + filter(species == target_organism) %>% pull(strain) %>% gsub("[^A-Za-z0-9]", "", .) From eb93b05b2dc90575ebe0f79a2a0a9b7bc906ca45 Mon Sep 17 00:00:00 2001 From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com> Date: Thu, 5 Sep 2024 07:44:11 -0700 Subject: [PATCH 12/58] Adding missing updates --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 27 +++++++++++-------- 1 file changed, 16 insertions(+), 11 deletions(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index ec3528b4..42308b8a 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -24,7 +24,16 @@ Barbara Novak (GeneLab Data Processing Lead) - **Updated Software:** - R version updated from 4.1.3 to 4.4.0. - - Bioconductor version updated from 3.15.1 to 3.19.1. + - Bioconductor version updated from 3.15.1 to 3.19.1. + - tidyverse version updated from 1.3.2 to 2.0.0. + - STRINGdb version updated from 2.8.4 to 2.16.0. + - PANTHER.db version updated from 1.0.11 to 1.0.12. + - rtracklayer version updated from 1.56.1 to 1.64.0. + +- **Added Software:** + - AnnotationForge version 1.46.0. + - biomaRt version 2.60.1. + - GO.db version 2.0.0. - **Ensembl Releases:** - Animals: Updated from release 107 to 112 @@ -57,7 +66,7 @@ Barbara Novak (GeneLab Data Processing Lead) 7. Vibrio fischeri ES114 - **org.db Creation:** - Added functionality to create an annotation database using `AnnotationForge`. This is applicable to organisms without a maintained annotation database package in Bioconductor (e.g., `org.Hs.eg.db`). Currently, this approach is in use for the following organisms: + Added functionality to create an annotation database using `AnnotationForge`. This is applicable to organisms without a maintained annotation database package in Bioconductor (e.g., `org.Hs.eg.db`). This approach was used for the following organisms: 1. Bacillus subtilis, subsp. subtilis 168 2. Brachypodium distachyon 3. Escherichia coli, str. K-12 substr. MG1655 @@ -81,10 +90,10 @@ The default columns in the annotation table are: > Note: org.db ENTREZ keys did not match PANTHER ENTREZ keys so the empty `GOSLIM_IDS` column was ommitted 3. **Lactobacillus acidophilus**: - - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, GO, STRING_id + - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, STRING_id, GO 4. **Mycobacterium marinum**: - - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, GO, STRING_id + - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, STRING_id, GO 5. **Oryza sativa Japonica**: - Columns: ENSEMBL, STRING_id @@ -93,23 +102,21 @@ The default columns in the annotation table are: - Columns: LOCUS, SYMBOL, GENENAME, GO 7. **Serratia liquefaciens ATCC 27592**: - - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, GO, STRING_id + - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, STRING_id, GO 8. **Staphylococcus aureus MRSA252**: - Columns: LOCUS, SYMBOL, GENENAME, GO 9. **Streptococcus mutans UA159**: - - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, GO, STRING_id + - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, STRING_id, GO 10. **Vibrio fischeri ES114**: - - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, GO, STRING_id + - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, STRING_id, GO --- # Table of Contents -- [GeneLab Pipeline for Generating Reference Annotation Tables](#genelab-pipeline-for-generating-reference-annotation-tables) -- [Table of Contents](#table-of-contents) - [Software Used](#software-used) - [Annotation Table Build Overview with Example Commands](#annotation-table-build-overview-with-example-commands) - [0. Set Up Environment](#0-set-up-environment) @@ -122,8 +129,6 @@ The default columns in the annotation table are: - [7. Add Gene Ontology (GO) Slim IDs](#7-add-gene-ontology-go-slim-ids) - [8. Export Annotation Table and Build Info](#8-export-annotation-table-and-build-info) - - --- # Software Used From bbf7a78323ba11b06c2b6a4c94d3c429b08be742 Mon Sep 17 00:00:00 2001 From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com> Date: Thu, 5 Sep 2024 07:53:10 -0700 Subject: [PATCH 13/58] Updating install and run instructions. --- .../GL_RefAnnotTable-A/README.md | 23 +++++++++++-------- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index c19bd34d..de1462a5 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -1,7 +1,7 @@ # GL_RefAnnotTable Workflow Information and Usage Instructions ## General workflow info -The current GeneLab Reference Annotation Table (GL_RefAnnotTable) pipeline is implemented as an R workflow that can be run from a command line interface (CLI) using bash. The workflow can be used even if you are unfamiliar with R, but if you want to learn more about R, visit the [R-project about page here](https://www.r-project.org/about.html). Additionally, an introduction to R along with installation help and information about using R for bioinformatics can be found [here at Happy Belly Bioinformatics](https://astrobiomike.github.io/R/basics). +The current GeneLab Reference Annotation Table (GL_RefAnnotTable-A) pipeline is implemented as an R workflow that can be run from a command line interface (CLI) using bash. The workflow can be used even if you are unfamiliar with R, but if you want to learn more about R, visit the [R-project about page here](https://www.r-project.org/about.html). Additionally, an introduction to R along with installation help and information about using R for bioinformatics can be found [here at Happy Belly Bioinformatics](https://astrobiomike.github.io/R/basics). ## Utilizing the workflow @@ -32,18 +32,21 @@ Within an active R environment, run the following commands to install the requir ```R install.packages("tidyverse", version = 2.0.0, repos = "http://cran.us.r-project.org") -install.packages("BiocManager", version = 3.19, repos = "http://cran.us.r-project.org") +install.packages("BiocManager", version = 3.19.1, repos = "http://cran.us.r-project.org") -BiocManager::install("STRINGdb", version = 3.19) -BiocManager::install("PANTHER.db", version = 3.19) -BiocManager::install("rtracklayer", version = 3.19) +BiocManager::install("STRINGdb", version = 3.19.1) +BiocManager::install("PANTHER.db", version = 3.19.1) +BiocManager::install("rtracklayer", version = 3.19.1) +BiocManager::install("AnnotationForge", version = 1.46.0) +BiocManager::install("biomaRt", version = 2.60.1) +BiocManager::install("GO.db", version = 3.19.1) ```
### 2. Download the Workflow Files -All files required for utilizing the GL_RefAnnotTable workflow for generating reference annotation tables are in the [workflow_code](workflow_code) directory. To get a copy of latest GL_RefAnnotTable version on to your system, run the following command: +All files required for utilizing the GL_RefAnnotTable-A workflow for generating reference annotation tables are in the [workflow_code](workflow_code) directory. To get a copy of latest GL_RefAnnotTable version on to your system, run the following command: ```bash curl -LO https://github.com/nasa/GeneLab_Data_Processing/releases/download/GL_RefAnnotTable-A_1.1.0/GL_RefAnnotTable-A_1.1.0.zip @@ -53,9 +56,11 @@ curl -LO https://github.com/nasa/GeneLab_Data_Processing/releases/download/GL_Re ### 3. Setup Execution Permission for Workflow Scripts -Once you've downloaded the GL_RefAnnotTable workflow directory as a zip file, unzip the workflow then `cd` into the GL_RefAnnotTable-A_1.1.0 directory on the CLI. Next, run the following command to set the execution permissions for the R script: +Once you've downloaded the GL_RefAnnotTable-A workflow directory as a zip file, unzip the workflow then `cd` into the GL_RefAnnotTable-A_1.1.0 directory on the CLI. Next, run the following command to set the execution permissions for the R script: ```bash +unzip GL_RefAnnotTable-A_1.1.0.zip +cd GL_RefAnnotTable-A_1.1.0 chmod -R u+x *R ``` @@ -94,6 +99,6 @@ Rscript install-org-db.R 'Bacillus subtilis' /path/to/GL-DPPD-7110-A_annotations - The path to a local reference table must also be supplied as the second positional argument -Output data: +**Output data:** -- org.*.eg.db/ (species-specific annotation database, as a local R package) \ No newline at end of file +- org.*.eg.db/ (species-specific annotation database, as a local R package) From 7c011a228898ea12de90713e87b107cac945d480 Mon Sep 17 00:00:00 2001 From: Alexis Torres Date: Fri, 6 Sep 2024 14:45:58 -0700 Subject: [PATCH 14/58] [GL_RefAnnotTable] Misc fixes - Add software updates to CHANGELOG - Add input and output variables to DPPD document - Prepend species name to output files for non-ENSEMBL reference organisms to make sure it is in the file names - Fix unclear variable names and wording in some functions in the script - Move GO column to the end of the annotation tables when applicable --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 143 +++++++++++++++++- .../GL-DPPD-7110-A_annotations.csv | 16 +- .../GL_RefAnnotTable-A/CHANGELOG.md | 22 ++- .../GL-DPPD-7110-A_build-genome-annots-tab.R | 25 ++- 4 files changed, 180 insertions(+), 26 deletions(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index 42308b8a..db12ffd2 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -221,6 +221,7 @@ target_org_db <- target_info$annotations # org.eg.db R package target_species_designation <- target_info$species # Full species name gtf_link <- target_info$gtf # Path to reference assembly GTF target_short_name <- target_info$name # PANTHER / UNIPROT short name; blank if not available +ref_source <- target_info$ref_source # Reference files source # Error handling for missing values if (is.na(target_taxid) || is.na(target_org_db) || is.na(target_species_designation) || is.na(gtf_link)) { @@ -231,6 +232,11 @@ if (is.na(target_taxid) || is.na(target_org_db) || is.na(target_species_designat base_gtf_filename <- basename(gtf_link) base_output_name <- str_replace(base_gtf_filename, ".gtf.gz", "") +# Add the species name to base_output_name if the reference source is not ENSEMBL +if (!(ref_source %in% c("ensembl_plants", "ensembl_bacteria", "ensembl"))) { + base_output_name <- paste(str_replace(target_species_designation, " ", "_"), base_output_name, sep = "_") +} + out_table_filename <- paste0(base_output_name, "-GL-annotations.tsv") out_log_filename <- paste0(base_output_name, "-GL-build-info.txt") @@ -243,6 +249,21 @@ if ( file.exists(out_table_filename) ) { quit() } ``` +**Input Data:** + +- ref_tab_path (path to the reference table CSV file containing organism-specific information) +- target_organism (name of the target organism for which annotations are being generated) + +**Output Data:** + +- target_taxid (taxonomic identifier for the target organism) +- target_org_db (name of the org.db R package for the target organism) +- target_species_designation (full species name of the target organism) +- gtf_link (URL to the GTF file for the target organism) +- target_short_name (PANTHER/UNIPROT short name for the target organism) +- ref_source (source of the reference files, e.g., "ensembl", "ensembl_plants", "ensembl_bacteria", "ncbi") +- out_table_filename (name of the output annotation table file) +- out_log_filename (name of the output log file)
@@ -293,6 +314,21 @@ if (!requireNamespace(target_org_db, quietly = TRUE)) { } ``` +**Input Data:** + +- target_org_db (name of the org.db R package for the target organism, output from [step 1](#1-define-variables-and-output-file-names)) +- target_species_designation (full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names)) +- ref_table (reference table containing organism-specific information, output from [step 1](#1-define-variables-and-output-file-names)) +- target_organism (name of the target organism, output from [step 1](#1-define-variables-and-output-file-names)) +- target_taxid (taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names)) + +**Output Data:** + +- target_org_db (updated name of the org.db R package, if it was created locally) +- Locally installed org.db package (if the package is not available on Bioconductor, a new package is created and installed) + +
+ --- ## 3. Load Annotation Databases @@ -322,6 +358,20 @@ if (!(target_organism %in% no_org_db) && (target_organism %in% currently_accepte } ``` +**Input Data:** + +- gtf_link (URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names)) +- target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor)) +- target_organism (name of the target organism, output from [step 1](#1-define-variables-and-output-file-names)) +- currently_accepted_orgs (list of currently supported organisms, defined at the beginning of the script) +- ref_tab_path ([path to the reference table CSV](GL-DPPD-7110-A_annotations.csv)) + +**Output Data:** + +- GTF (data frame containing the GTF file for the target organism) +- no_org_db (list of organisms that do not use org.db annotations due to inconsistent gene names across GTF and org.db) +- Loaded org.db package (the organism-specific annotation package is loaded into the R session, if applicable) +
--- @@ -394,6 +444,17 @@ if (target_organism == "Salmonella enterica") { } ``` +**Input Data:** + +- GTF (data frame containing the parsed GTF file for the target organism, output from [step 3](#3-load-annotation-databases)) +- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names)) +- gtf_keytype_mappings (list of keys to extract from the GTF, for each organism) + +**Output Data:** + +- annot_gtf (initial annotation table derived from the GTF file, containing only the relevant columns for the target organism) +- primary_keytype (the name of the primary key type being used, e.g., "ENSEMBL", "TAIR", "LOCUS", based on the GTF gene_id entries) +
--- @@ -448,12 +509,12 @@ orgdb_keytype <- if (!is.null(orgdb_keytype_mappings[[target_organism]])) { orgdb_keytype_mappings[["default"]][["keytype"]] } -# Function to clean and match ACCNUM keys for BRADI -clean_and_match_accnum <- function(annot_table, org_db, query_col, keytype_col, target_column) { - # Clean the ACCNUM keys in the GTF annotations +# Function to remove version numbers from ACCNUM keys and match them for BRADI +match_accnum <- function(annot_table, org_db, query_col, keytype_col, target_column) { + # Remove version numbers from the ACCNUM keys in the GTF annotations cleaned_annot_keys <- sub("\\..*", "", annot_table[[query_col]]) - # Retrieve and clean the org.db keys + # Retrieve and remove version numbers from the org.db keys orgdb_keys <- keys(org_db, keytype = keytype_col) cleaned_orgdb_keys <- sub("\\..*", "", orgdb_keys) @@ -472,8 +533,8 @@ for (keytype in wanted_org_db_keytypes) { # Check if keytype is a valid column in the target org.db if (keytype %in% columns(get(target_org_db, envir = .GlobalEnv))) { if (target_organism == "Brachypodium distachyon" && orgdb_query == "ACCNUM") { - # For BRADI: use the clean_and_match_accnum function to map to org.db ACCNUM entries - org_matches <- clean_and_match_accnum(annot_orgdb, get(target_org_db, envir = .GlobalEnv), query_col = orgdb_query, keytype_col = orgdb_keytype, target_column = keytype) + # For BRADI: use the match_accnum function to map to org.db ACCNUM entries + org_matches <- match_accnum(annot_orgdb, get(target_org_db, envir = .GlobalEnv), query_col = orgdb_query, keytype_col = orgdb_keytype, target_column = keytype) } else { # Default mapping for other organisms org_matches <- mapIds(get(target_org_db, envir = .GlobalEnv), keys = annot_orgdb[[orgdb_query]], keytype = orgdb_keytype, column = keytype, multiVals = "list") @@ -497,6 +558,20 @@ if (target_organism == "Saccharomyces cerevisiae") { } ``` +**Input Data:** + +- annot_gtf (initial annotation table derived from the GTF file, output from [step 4](#4-build-initial-annotation-table)) +- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names)) +- no_org_db (list of organisms that do not use annotations from an org.db, output from [step 3](#3-load-annotation-databases)) +- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table)) +- target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor)) + +**Output Data:** + +- annot_orgdb (updated annotation table with additional keys from the organism-specific org.db) +- orgdb_query (the key type used to map to the org.db) +- orgdb_keytype (the name of the key type in the org.db) +
--- @@ -609,6 +684,20 @@ if (target_organism == "Bacillus subtilis") { annot_stringdb <- as.data.frame(annot_stringdb) ``` +**Input Data:** + +- annot_orgdb (annotation table with GTF and org.db annotations, output from [step 5](#5-add-orgdb-keys)) +- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names)) +- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table)) +- target_taxid (taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names)) + +**Output Data:** + +- annot_stringdb (updated annotation table with added STRING IDs) +- no_stringdb (list of organisms that do not use STRING annotations) +- stringdb_query (the key type used for mapping to STRING database) +- uses_old_locus (list of organisms where GTF gene_id entries do not match those in STRING, so entries in OLD_LOCUS are used to query STRING) +
--- @@ -658,6 +747,20 @@ if (!(target_organism %in% no_panther_db)) { } ``` +**Input Data:** + +- annot_orgdb (annotation table with GTF and org.db annotations, output from [step 5](#5-add-orgdb-keys)) +- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names)) +- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table)) +- target_taxid (taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names)) + +**Output Data:** + +- annot_stringdb (updated annotation table with added STRING IDs) +- no_stringdb (list of organisms that do not use STRING annotations) +- stringdb_query (the key type used for mapping to STRING database) +- uses_old_locus (list of organisms where the 'gene_id' column in the GTF dataframe does not match STRING identifiers, so the 'old_locus_tag' column from the GTF dataframe is used to query STRING instead) +
--- @@ -670,6 +773,13 @@ annot <- annot_pantherdb %>% group_by(!!sym(primary_keytype)) %>% summarise(across(everything(), ~paste(unique(na.omit(.))[unique(na.omit(.)) != ""], collapse = "|")), .groups = 'drop') +# If "GO" column exists, move it to the end to keep columns in consistent order across organisms +if ("GO" %in% names(annot)) { + go_column <- annot$GO + annot$GO <- NULL + annot$GO <- go_column +} + # Sort the annotation table based on primary keytype gene IDs annot <- annot %>% arrange(.[[1]]) @@ -696,6 +806,23 @@ write("\n\nAll session info:\n", out_log_filename, append = TRUE) write(capture.output(sessionInfo()), out_log_filename, append = TRUE) ``` +**Input Data:** + +- annot_pantherdb (annotation table with GTF, org.db, STRING, and PANTHER annotations, output from [step 7](#7-add-gene-ontology-go-slim-ids)) +- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table)) +- out_table_filename (name of the output annotation table file, output from [step 1](#1-define-variables-and-output-file-names)) +- out_log_filename (name of the output log file, output from [step 1](#1-define-variables-and-output-file-names)) +- GL_DPPD_ID (GeneLab Data Processing Pipeline Document ID, from step 0) +- gtf_link (URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names)) +- target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor)) +- no_org_db (list of organisms that do not use org.db annotations, output from [step 3](#3-load-annotation-databases)) + +**Output Data:** + +- annot (final annotation table with annotations from the GTF, org.db, STRING, and PANTHER) +- ***-GL-annotations.tsv** (annot saved as a tab-delimited table file) +- ***-GL-build-info.txt** (annotation table build information log file) +
--- @@ -706,5 +833,5 @@ write(capture.output(sessionInfo()), out_log_filename, append = TRUE) **Pipeline Output data:** -- *-GL-annotations.tsv (Tab delineated table of gene annotations, used to add gene annotations in other GeneLab processing pipelines) -- *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation) +- ***-GL-annotations.tsv** (Tab-delineated table of gene annotations, used to add gene annotations in other GeneLab processing pipelines) +- ***-GL-build-info.txt** (Text file containing information used to create the annotation table, including tool and tool versions and date of creation) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv index 0beb3696..5ce006d9 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv @@ -9,16 +9,16 @@ FLY,Drosophila melanogaster,,112,ensembl,http://ftp.ensembl.org/pub/release-112/ ERCC,,,,ThermoFisher,https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip,https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip,,,, ECOLI,Escherichia coli,str. K-12 substr. MG1655,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/dna/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.59.gtf.gz,511145,org.EcolistrK12substrMG1655.eg.db,https://figshare.com/ndownloader/files/48354379,https://figshare.com/ndownloader/files/48354394 HUMAN,Homo sapiens,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz,9606,org.Hs.eg.db,https://figshare.com/ndownloader/files/48354445,https://figshare.com/ndownloader/files/48354448 -,Lactobacillus acidophilus,NCFM,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.gtf.gz,272621,,https://figshare.com/ndownloader/files/48354424,https://figshare.com/ndownloader/files/48354415 +,Lactobacillus acidophilus,NCFM,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.gtf.gz,272621,,https://figshare.com/ndownloader/files/49061254,https://figshare.com/ndownloader/files/49061257 MOUSE,Mus musculus,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/mus_musculus/Mus_musculus.GRCm39.112.gtf.gz,10090,org.Mm.eg.db,https://figshare.com/ndownloader/files/48354460,https://figshare.com/ndownloader/files/48354457 -,Mycobacterium marinum,M,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,216594,,https://figshare.com/ndownloader/files/48354433,https://figshare.com/ndownloader/files/48354430 +,Mycobacterium marinum,M,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,216594,,https://figshare.com/ndownloader/files/49061260,https://figshare.com/ndownloader/files/49061263 ORYSJ,Oryza sativa,Japonica,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/oryza_sativa/Oryza_sativa.IRGSP-1.0.59.gtf.gz,39947,,https://figshare.com/ndownloader/files/48354451,https://figshare.com/ndownloader/files/48354454 ORYLA,Oryzias latipes,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/oryzias_latipes/dna/Oryzias_latipes.ASM223467v1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/oryzias_latipes/Oryzias_latipes.ASM223467v1.112.gtf.gz,8090,org.Olatipes.eg.db,https://figshare.com/ndownloader/files/48354463,https://figshare.com/ndownloader/files/48354466 -,Pseudomonas aeruginosa,UCBPP-PA14,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.gtf.gz,208963,,https://figshare.com/ndownloader/files/48354421,https://figshare.com/ndownloader/files/48354427 +,Pseudomonas aeruginosa,UCBPP-PA14,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.gtf.gz,208963,,https://figshare.com/ndownloader/files/49061266,https://figshare.com/ndownloader/files/49061269 RAT,Rattus norvegicus,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/rattus_norvegicus/dna/Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/rattus_norvegicus/Rattus_norvegicus.mRatBN7.2.112.gtf.gz,10116,org.Rn.eg.db,https://figshare.com/ndownloader/files/48354472,https://figshare.com/ndownloader/files/48354475 YEAST,Saccharomyces cerevisiae,S288C,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.112.gtf.gz,559292,org.Sc.sgd.db,https://figshare.com/ndownloader/files/48354469,https://figshare.com/ndownloader/files/48354478 -SALTY,Salmonella enterica,serovar Typhimurium str. LT2,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.gtf.gz,99287,org.SentericaserovarTyphimuriumstrLT2.eg.db,https://figshare.com/ndownloader/files/48354385,https://figshare.com/ndownloader/files/48354391 -,Serratia liquefaciens,ATCC 27592,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.gtf.gz,1346614,,https://figshare.com/ndownloader/files/48354436,https://figshare.com/ndownloader/files/48354439 -,Staphylococcus aureus,MRSA252,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.gtf.gz,282458,,https://figshare.com/ndownloader/files/48354403,https://figshare.com/ndownloader/files/48354409 -,Streptococcus mutans,UA159,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.gtf.gz,210007,,https://figshare.com/ndownloader/files/48354397,https://figshare.com/ndownloader/files/48354406 -,Vibrio fischeri,ES114,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.gtf.gz,312309,,https://figshare.com/ndownloader/files/48354412,https://figshare.com/ndownloader/files/48354418 \ No newline at end of file +SALTY,Salmonella enterica,serovar Typhimurium str. LT2,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.gtf.gz,99287,org.SentericaserovarTyphimuriumstrLT2.eg.db,https://figshare.com/ndownloader/files/49061272,https://figshare.com/ndownloader/files/49061275 +,Serratia liquefaciens,ATCC 27592,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.gtf.gz,1346614,,https://figshare.com/ndownloader/files/49061278,https://figshare.com/ndownloader/files/49061281 +,Staphylococcus aureus,MRSA252,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.gtf.gz,282458,,https://figshare.com/ndownloader/files/49061284,https://figshare.com/ndownloader/files/49061287 +,Streptococcus mutans,UA159,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.gtf.gz,210007,,https://figshare.com/ndownloader/files/49061290,https://figshare.com/ndownloader/files/49061293 +,Vibrio fischeri,ES114,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.gtf.gz,312309,,https://figshare.com/ndownloader/files/49061296,https://figshare.com/ndownloader/files/49061299 \ No newline at end of file diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md index 8f1a8c5f..e4d1e0c5 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md @@ -9,6 +9,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added +- Added software: + - AnnotationForge version 1.46.0. + - biomaRt version 2.60.1. + - GO.db version 2.0.0. - Added support for: - Bacillus subtilis, subsp. subtilis 168 - Brachypodium distachyon @@ -23,7 +27,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - Staphylococcus aureus MRSA252 - Streptococcus mutans UA159 - Vibrio fischeri ES114 -- Added AnnotationForge helper script install-org-db.R to create organism-specific annotation packages (org.*.eg.db) in R if not available on Bioconductor. Used for: +- Added AnnotationForge helper script install-org-db.R to create +organism-specific annotation packages (org.*.eg.db) in R if not available on +Bioconductor. Used for: - Bacillus subtilis, subsp. subtilis 168 - Brachypodium distachyon - Escherichia coli,str. K-12 substr. MG1655 @@ -37,12 +43,20 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Changed -- Updated Ensembl versions +- Updated Ensembl versions: - Animals: Ensembl release 112 - Plants: Ensembl plants release 59 - Bacteria: Ensembl bacteria release 59 -- Removed org.EcK12.eg.db and replaced it with a locally created annotations database, as it is no longer available on Bioconductor -- Changed the first argument of GL-DPPD-7110-A_build-genome-annots-tab.R from the 'name' column value to the 'species' column value (e.g., 'Mus musculus' instead of 'MOUSE') +- Updated software: + - tidyverse version updated from 1.3.2 to 2.0.0. + - STRINGdb version updated from 2.8.4 to 2.16.0. + - PANTHER.db version updated from 1.0.11 to 1.0.12. + - rtracklayer version updated from 1.56.1 to 1.64.0. + - Bioconductor version updated from 3.15.1 to 3.19.1. +- Removed org.EcK12.eg.db and replaced it with a locally created annotations +database, as it is no longer available on Bioconductor +- Changed the first argument of GL-DPPD-7110-A_build-genome-annots-tab.R from +the 'name' column value to the 'species' column value (e.g., 'Mus musculus' instead of 'MOUSE') ## [1.0.0](https://github.com/nasa/GeneLab_Data_Processing/releases/tag/GL_RefAnnotTable_1.0.0) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R index 48832823..53148ca4 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R @@ -99,6 +99,7 @@ target_org_db <- target_info$annotations # org.eg.db R package target_species_designation <- target_info$species # Full species name gtf_link <- target_info$gtf # Path to reference assembly GTF target_short_name <- target_info$name # PANTHER / UNIPROT short name; blank if not available +ref_source <- target_info$ref_source # Reference files source # Error handling for missing values if (is.na(target_taxid) || is.na(target_org_db) || is.na(target_species_designation) || is.na(gtf_link)) { @@ -109,6 +110,11 @@ if (is.na(target_taxid) || is.na(target_org_db) || is.na(target_species_designat base_gtf_filename <- basename(gtf_link) base_output_name <- str_replace(base_gtf_filename, ".gtf.gz", "") +# Add the species name to base_output_name if the reference source is not ENSEMBL +if (!(ref_source %in% c("ensembl_plants", "ensembl_bacteria", "ensembl"))) { + base_output_name <- paste(str_replace(target_species_designation, " ", "_"), base_output_name, sep = "_") +} + out_table_filename <- paste0(base_output_name, "-GL-annotations.tsv") out_log_filename <- paste0(base_output_name, "-GL-build-info.txt") @@ -288,12 +294,12 @@ orgdb_keytype <- if (!is.null(orgdb_keytype_mappings[[target_organism]])) { orgdb_keytype_mappings[["default"]][["keytype"]] } -# Function to clean and match ACCNUM keys for BRADI -clean_and_match_accnum <- function(annot_table, org_db, query_col, keytype_col, target_column) { - # Clean the ACCNUM keys in the GTF annotations +# Function to remove version numbers from ACCNUM keys and match them for BRADI +match_accnum <- function(annot_table, org_db, query_col, keytype_col, target_column) { + # Remove version numbers from the ACCNUM keys in the GTF annotations cleaned_annot_keys <- sub("\\..*", "", annot_table[[query_col]]) - # Retrieve and clean the org.db keys + # Retrieve and remove version numbers from the org.db keys orgdb_keys <- keys(org_db, keytype = keytype_col) cleaned_orgdb_keys <- sub("\\..*", "", orgdb_keys) @@ -312,8 +318,8 @@ for (keytype in wanted_org_db_keytypes) { # Check if keytype is a valid column in the target org.db if (keytype %in% columns(get(target_org_db, envir = .GlobalEnv))) { if (target_organism == "Brachypodium distachyon" && orgdb_query == "ACCNUM") { - # For BRADI: use the clean_and_match_accnum function to map to org.db ACCNUM entries - org_matches <- clean_and_match_accnum(annot_orgdb, get(target_org_db, envir = .GlobalEnv), query_col = orgdb_query, keytype_col = orgdb_keytype, target_column = keytype) + # For BRADI: use the match_accnum function to map to org.db ACCNUM entries + org_matches <- match_accnum(annot_orgdb, get(target_org_db, envir = .GlobalEnv), query_col = orgdb_query, keytype_col = orgdb_keytype, target_column = keytype) } else { # Default mapping for other organisms org_matches <- mapIds(get(target_org_db, envir = .GlobalEnv), keys = annot_orgdb[[orgdb_query]], keytype = orgdb_keytype, column = keytype, multiVals = "list") @@ -498,6 +504,13 @@ annot <- annot_pantherdb %>% group_by(!!sym(primary_keytype)) %>% summarise(across(everything(), ~paste(unique(na.omit(.))[unique(na.omit(.)) != ""], collapse = "|")), .groups = 'drop') +# If "GO" column exists, move it to the end to keep columns in consistent order across organisms +if ("GO" %in% names(annot)) { + go_column <- annot$GO + annot$GO <- NULL + annot$GO <- go_column +} + # Sort the annotation table based on primary keytype gene IDs annot <- annot %>% arrange(.[[1]]) From 9fd9fb76c9f824bc34bfa4ac85f2567675f98401 Mon Sep 17 00:00:00 2001 From: Alexis Torres Date: Fri, 6 Sep 2024 14:57:35 -0700 Subject: [PATCH 15/58] [GL_RefAnnotTable] Typo fixes --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 21 ++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index db12ffd2..4f69cf2e 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -194,6 +194,18 @@ library(STRINGdb) library(PANTHER.db) library(rtracklayer) ``` +**Input Data:** + +- None (This is an initial setup step using predefined variables) + +**Output Data:** + +- GL_DPPD_ID (GeneLab Data Processing Pipeline Document ID) +- ref_tab_path (path to the reference table CSV file) +- readme_path (path to the README file) +- currently_accepted_orgs (list of currently supported organisms) + +
--- @@ -251,7 +263,7 @@ if ( file.exists(out_table_filename) ) { ``` **Input Data:** -- ref_tab_path (path to the reference table CSV file containing organism-specific information) +- ref_tab_path (path to the reference table CSV file, output from [step 0](#0-set-up-environment)) - target_organism (name of the target organism for which annotations are being generated) **Output Data:** @@ -363,14 +375,13 @@ if (!(target_organism %in% no_org_db) && (target_organism %in% currently_accepte - gtf_link (URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names)) - target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor)) - target_organism (name of the target organism, output from [step 1](#1-define-variables-and-output-file-names)) -- currently_accepted_orgs (list of currently supported organisms, defined at the beginning of the script) -- ref_tab_path ([path to the reference table CSV](GL-DPPD-7110-A_annotations.csv)) +- currently_accepted_orgs (list of currently supported organisms, output from [step 0](#0-set-up-environment)) +- ref_tab_path (path to the reference table CSV file, output from [step 0](#0-set-up-environment)) **Output Data:** - GTF (data frame containing the GTF file for the target organism) - no_org_db (list of organisms that do not use org.db annotations due to inconsistent gene names across GTF and org.db) -- Loaded org.db package (the organism-specific annotation package is loaded into the R session, if applicable)
@@ -812,7 +823,7 @@ write(capture.output(sessionInfo()), out_log_filename, append = TRUE) - primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table)) - out_table_filename (name of the output annotation table file, output from [step 1](#1-define-variables-and-output-file-names)) - out_log_filename (name of the output log file, output from [step 1](#1-define-variables-and-output-file-names)) -- GL_DPPD_ID (GeneLab Data Processing Pipeline Document ID, from step 0) +- GL_DPPD_ID (GeneLab Data Processing Pipeline Document ID, output from [step 0](#0-set-up-environment)) - gtf_link (URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names)) - target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor)) - no_org_db (list of organisms that do not use org.db annotations, output from [step 3](#3-load-annotation-databases)) From 8050e326daf2d9c7b94ffb8e9ef24ea506822dae Mon Sep 17 00:00:00 2001 From: Alexis Torres Date: Fri, 6 Sep 2024 15:36:59 -0700 Subject: [PATCH 16/58] [GL_RefAnnotTable] Add database versions, fix go.db version --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 10 +++++++--- .../GL_RefAnnotTable-A/CHANGELOG.md | 16 ++++++++-------- 2 files changed, 15 insertions(+), 11 deletions(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index 4f69cf2e..0f5206cf 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -151,7 +151,7 @@ The default columns in the annotation table are: | org.Sc.sgd.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html) | | AnnotationForge | 1.46.0 | [https://bioconductor.org/packages/AnnotationForge](https://bioconductor.org/packages/AnnotationForge) | | biomaRt | 2.60.1 | [https://bioconductor.org/packages/biomaRt](https://bioconductor.org/packages/biomaRt) | -| GO.db | 2.0.0 | [https://bioconductor.org/packages/GO.db](https://bioconductor.org/packages/GO.db) | +| GO.db | 3.19.1 | [https://bioconductor.org/packages/GO.db](https://bioconductor.org/packages/GO.db) | --- @@ -164,8 +164,12 @@ The default columns in the annotation table are: > - Plants: Ensembl plants release 59 > - Bacteria: Ensembl bacteria release 59 > -> **PANTHER:** 18.0 -> > *Note: The values in the 'name' column of [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) (e.g., MOUSE, HUMAN, ARABIDOPSIS) are derived from the short names used in PANTHER. These short names are subject to change.* +> **Database Versions:** +> - STRINGdb: 12.0 +> - PANTHERdb: 18.0 +> - GO.db: 2.1 (used only when creating a local org.db R package) +> +> > *Note: The values in the 'name' column of [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) (e.g., HUMAN, MOUSE, RAT) are derived from the short names used in PANTHER. These short names are subject to change.* --- diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md index e4d1e0c5..a7c5ee8c 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md @@ -10,9 +10,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added - Added software: - - AnnotationForge version 1.46.0. - - biomaRt version 2.60.1. - - GO.db version 2.0.0. + - AnnotationForge version 1.46.0 + - biomaRt version 2.60.1 + - GO.db version 3.19.1 - Added support for: - Bacillus subtilis, subsp. subtilis 168 - Brachypodium distachyon @@ -48,11 +48,11 @@ Bioconductor. Used for: - Plants: Ensembl plants release 59 - Bacteria: Ensembl bacteria release 59 - Updated software: - - tidyverse version updated from 1.3.2 to 2.0.0. - - STRINGdb version updated from 2.8.4 to 2.16.0. - - PANTHER.db version updated from 1.0.11 to 1.0.12. - - rtracklayer version updated from 1.56.1 to 1.64.0. - - Bioconductor version updated from 3.15.1 to 3.19.1. + - tidyverse version updated from 1.3.2 to 2.0.0 + - STRINGdb version updated from 2.8.4 to 2.16.0 + - PANTHER.db version updated from 1.0.11 to 1.0.12 + - rtracklayer version updated from 1.56.1 to 1.64.0 + - Bioconductor version updated from 3.15.1 to 3.19.1 - Removed org.EcK12.eg.db and replaced it with a locally created annotations database, as it is no longer available on Bioconductor - Changed the first argument of GL-DPPD-7110-A_build-genome-annots-tab.R from From d910e55258fb5da253b56d5cac58b7e085ea0ec0 Mon Sep 17 00:00:00 2001 From: Alexis Torres Date: Fri, 6 Sep 2024 15:49:19 -0700 Subject: [PATCH 17/58] [GL_RefAnnotTable] Add go.db info --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 29 ++++++++++--------- 1 file changed, 16 insertions(+), 13 deletions(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index 0f5206cf..8f603412 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -157,19 +157,22 @@ The default columns in the annotation table are: # Annotation table build overview with example commands -> Current GeneLab annotation tables are available on [figshare](https://figshare.com/), exact links for each reference organism are provided in the [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) file. -> -> **[Ensembl Reference Versions](https://www.ensembl.org/index.html):** -> - Animals: Ensembl release 112 -> - Plants: Ensembl plants release 59 -> - Bacteria: Ensembl bacteria release 59 -> -> **Database Versions:** -> - STRINGdb: 12.0 -> - PANTHERdb: 18.0 -> - GO.db: 2.1 (used only when creating a local org.db R package) -> -> > *Note: The values in the 'name' column of [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) (e.g., HUMAN, MOUSE, RAT) are derived from the short names used in PANTHER. These short names are subject to change.* +Current GeneLab annotation tables are available on [figshare](https://figshare.com/), exact links for each reference organism are provided in the [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) file. + +**[Ensembl Reference Versions](https://www.ensembl.org/index.html):** +- Animals: Ensembl release 112 +- Plants: Ensembl plants release 59 +- Bacteria: Ensembl bacteria release 59 + +**Database Versions:** +- STRINGdb: 12.0 +- PANTHERdb: 18.0 +- GO.db: + - GO ontology file updated on 2024-01-17 + - Entrez gene data updated on 2024-03-12 + - DB schema version 2.1 + + > *Note: The values in the 'name' column of [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) (e.g., HUMAN, MOUSE, RAT) are derived from the short names used in PANTHER. These short names are subject to change.* --- From 81d06dd45d976b54f681ef776946f9ca6ffae0c4 Mon Sep 17 00:00:00 2001 From: Alexis Torres Date: Fri, 6 Sep 2024 15:51:13 -0700 Subject: [PATCH 18/58] [GL_RefAnnotTable] Move panther note line --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index 8f603412..98366dec 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -167,12 +167,13 @@ Current GeneLab annotation tables are available on [figshare](https://figshare.c **Database Versions:** - STRINGdb: 12.0 - PANTHERdb: 18.0 + > Note: The values in the 'name' column of [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) (e.g., HUMAN, MOUSE, RAT) are derived from the short names used in PANTHER. These short names are subject to change. - GO.db: - GO ontology file updated on 2024-01-17 - Entrez gene data updated on 2024-03-12 - DB schema version 2.1 - > *Note: The values in the 'name' column of [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) (e.g., HUMAN, MOUSE, RAT) are derived from the short names used in PANTHER. These short names are subject to change.* + --- From cc11ff9e6515e86760af7cc0fb9a2cd554eb5265 Mon Sep 17 00:00:00 2001 From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com> Date: Wed, 11 Sep 2024 10:49:49 -0700 Subject: [PATCH 19/58] Specify DB versions used --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index 98366dec..e556904e 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -26,14 +26,14 @@ Barbara Novak (GeneLab Data Processing Lead) - R version updated from 4.1.3 to 4.4.0. - Bioconductor version updated from 3.15.1 to 3.19.1. - tidyverse version updated from 1.3.2 to 2.0.0. - - STRINGdb version updated from 2.8.4 to 2.16.0. - - PANTHER.db version updated from 1.0.11 to 1.0.12. + - STRINGdb version updated from 2.8.4 to 2.16.0 (DB version: 12.0). + - PANTHER.db version updated from 1.0.11 to 1.0.12 (DB version: 18.0). - rtracklayer version updated from 1.56.1 to 1.64.0. - **Added Software:** - AnnotationForge version 1.46.0. - biomaRt version 2.60.1. - - GO.db version 2.0.0. + - GO.db version 3.19.1 (DB schema version 2.1) - **Ensembl Releases:** - Animals: Updated from release 107 to 112 From 6299719f85b7c71f1342cd5b99815babadc0c528 Mon Sep 17 00:00:00 2001 From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com> Date: Wed, 11 Sep 2024 12:47:50 -0700 Subject: [PATCH 20/58] Input output updates, remove unnecessary variables --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 136 +++++++++--------- 1 file changed, 64 insertions(+), 72 deletions(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index e556904e..e83eb46a 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -208,10 +208,10 @@ library(rtracklayer) **Output Data:** -- GL_DPPD_ID (GeneLab Data Processing Pipeline Document ID) -- ref_tab_path (path to the reference table CSV file) -- readme_path (path to the README file) -- currently_accepted_orgs (list of currently supported organisms) +- `GL_DPPD_ID` (variable specifying the GeneLab Data Processing Pipeline Document ID) +- `ref_tab_path` (variable specifying the path to the reference table CSV file) +- `readme_path` (variable specifying the path to the README file) +- `currently_accepted_orgs` (variable specifying the list of currently supported organisms)
@@ -238,13 +238,12 @@ target_info <- ref_table %>% # Extract the relevant columns from the reference table target_taxid <- target_info$taxon # Taxonomic identifier target_org_db <- target_info$annotations # org.eg.db R package -target_species_designation <- target_info$species # Full species name gtf_link <- target_info$gtf # Path to reference assembly GTF target_short_name <- target_info$name # PANTHER / UNIPROT short name; blank if not available ref_source <- target_info$ref_source # Reference files source # Error handling for missing values -if (is.na(target_taxid) || is.na(target_org_db) || is.na(target_species_designation) || is.na(gtf_link)) { +if (is.na(target_taxid) || is.na(target_org_db) || is.na(target_organism) || is.na(gtf_link)) { stop(paste("Error: Missing data for target organism", target_organism, "in reference table.")) } @@ -271,19 +270,19 @@ if ( file.exists(out_table_filename) ) { ``` **Input Data:** -- ref_tab_path (path to the reference table CSV file, output from [step 0](#0-set-up-environment)) -- target_organism (name of the target organism for which annotations are being generated) +- `ref_tab_path` (variable specifying the path to the reference table CSV file, output from [step 0](#0-set-up-environment)) +- `target_organism` (variable specifying the full species name of the target organism for which annotations are being generated) +- > *Note: This is provided as a positional argument when the R script is run.* **Output Data:** -- target_taxid (taxonomic identifier for the target organism) -- target_org_db (name of the org.db R package for the target organism) -- target_species_designation (full species name of the target organism) -- gtf_link (URL to the GTF file for the target organism) -- target_short_name (PANTHER/UNIPROT short name for the target organism) -- ref_source (source of the reference files, e.g., "ensembl", "ensembl_plants", "ensembl_bacteria", "ncbi") -- out_table_filename (name of the output annotation table file) -- out_log_filename (name of the output log file) +- `target_taxid` (variable specifying the taxonomic identifier for the target organism) +- `target_org_db` (variable specifying the name of the org.db R package for the target organism) +- `gtf_link` (variable specifying the URL to the GTF file for the target organism) +- `target_short_name` (variable specifying the PANTHER/UNIPROT short name for the target organism) +- `ref_source` (variable specifying the source of the reference files, e.g., "ensembl", "ensembl_plants", "ensembl_bacteria", "ncbi") +- `out_table_filename` (variable specifying the name of the output annotation table file) +- `out_log_filename` (variable specifying the name of the output log file)
@@ -299,9 +298,9 @@ BiocManager::install(target_org_db, ask = FALSE) if (!requireNamespace(target_org_db, quietly = TRUE)) { tryCatch({ # Parse organism's name in the reference table to create the org.db name (target_org_db) - genus_species <- strsplit(target_species_designation, " ")[[1]] + genus_species <- strsplit(target_organism, " ")[[1]] if (length(genus_species) < 1) { - stop("Species designation is not correctly formatted: ", target_species_designation) + stop("Species designation is not correctly formatted: ", target_organism) } genus <- genus_species[1] species <- ifelse(length(genus_species) > 1, genus_species[2], "") @@ -336,15 +335,14 @@ if (!requireNamespace(target_org_db, quietly = TRUE)) { **Input Data:** -- target_org_db (name of the org.db R package for the target organism, output from [step 1](#1-define-variables-and-output-file-names)) -- target_species_designation (full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names)) -- ref_table (reference table containing organism-specific information, output from [step 1](#1-define-variables-and-output-file-names)) -- target_organism (name of the target organism, output from [step 1](#1-define-variables-and-output-file-names)) -- target_taxid (taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names)) +- `target_org_db` (variable specifying the name of the org.db R package for the target organism, output from [step 1](#1-define-variables-and-output-file-names)) +- `ref_table` (variable specifying the reference table containing organism-specific information, output from [step 1](#1-define-variables-and-output-file-names)) +- `target_organism` (variable specifying the full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names)) +- `target_taxid` (variable specifying the taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names)) **Output Data:** -- target_org_db (updated name of the org.db R package, if it was created locally) +- `target_org_db` (variable specifying the updated name of the org.db R package, if it was created locally) - Locally installed org.db package (if the package is not available on Bioconductor, a new package is created and installed)
@@ -380,16 +378,16 @@ if (!(target_organism %in% no_org_db) && (target_organism %in% currently_accepte **Input Data:** -- gtf_link (URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names)) -- target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor)) -- target_organism (name of the target organism, output from [step 1](#1-define-variables-and-output-file-names)) -- currently_accepted_orgs (list of currently supported organisms, output from [step 0](#0-set-up-environment)) -- ref_tab_path (path to the reference table CSV file, output from [step 0](#0-set-up-environment)) +- `gtf_link` (variable specifying the URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names)) +- `target_org_db` (variable specifying the name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) or [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor)) +- `target_organism` (variable specifying the full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names)) +- `currently_accepted_orgs` (variable specifying the list of currently supported organisms, output from [step 0](#0-set-up-environment)) +- `ref_tab_path` (variable specifying the path to the reference table CSV file, output from [step 0](#0-set-up-environment)) **Output Data:** -- GTF (data frame containing the GTF file for the target organism) -- no_org_db (list of organisms that do not use org.db annotations due to inconsistent gene names across GTF and org.db) +- `GTF` (variable holding the data frame containing the GTF file for the target organism) +- `no_org_db` (variable specifying the list of organisms that do not use org.db annotations due to inconsistent gene names across GTF and org.db)
@@ -465,14 +463,14 @@ if (target_organism == "Salmonella enterica") { **Input Data:** -- GTF (data frame containing the parsed GTF file for the target organism, output from [step 3](#3-load-annotation-databases)) -- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names)) -- gtf_keytype_mappings (list of keys to extract from the GTF, for each organism) +- `GTF` (variable holding the data frame containing the parsed GTF file for the target organism, output from [step 3](#3-load-annotation-databases)) +- `target_organism` (variable specifying the full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names)) +- `gtf_keytype_mappings` (variable specifying the list of keys to extract from the GTF, for each organism) **Output Data:** -- annot_gtf (initial annotation table derived from the GTF file, containing only the relevant columns for the target organism) -- primary_keytype (the name of the primary key type being used, e.g., "ENSEMBL", "TAIR", "LOCUS", based on the GTF gene_id entries) +- `annot_gtf` (variable holding the initial annotation table derived from the GTF file, containing only the relevant columns for the target organism) +- `primary_keytype` (variable specifying the name of the primary key type being used, e.g., "ENSEMBL", "TAIR", "LOCUS", based on the GTF gene_id entries)
@@ -579,17 +577,17 @@ if (target_organism == "Saccharomyces cerevisiae") { **Input Data:** -- annot_gtf (initial annotation table derived from the GTF file, output from [step 4](#4-build-initial-annotation-table)) -- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names)) -- no_org_db (list of organisms that do not use annotations from an org.db, output from [step 3](#3-load-annotation-databases)) -- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table)) -- target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor)) +- `annot_gtf` (variable holding the initial annotation table derived from the GTF file, output from [step 4](#4-build-initial-annotation-table)) +- `target_organism` (variable specifying the full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names)) +- `no_org_db` (variable specifying the list of organisms that do not use annotations from an org.db, output from [step 3](#3-load-annotation-databases)) +- `primary_keytype` (variable specifying the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table)) +- `target_org_db` (variable specifying the name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) or [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor)) **Output Data:** -- annot_orgdb (updated annotation table with additional keys from the organism-specific org.db) -- orgdb_query (the key type used to map to the org.db) -- orgdb_keytype (the name of the key type in the org.db) +- `annot_orgdb` (variable holding the updated annotation table with GTF and organism-specific org.db annotations) +- `orgdb_query` (variable specifying the key type used to map to the org.db) +- `orgdb_keytype` (variable specifying the name of the key type in the org.db)
@@ -624,7 +622,6 @@ stringdb_query <- if (!is.null(stringdb_query_list[[target_organism]])) { uses_old_locus <- c("Lactobacillus acidophilus", "Mycobacterium marinum", "Serratia liquefaciens", "Streptococcus mutans", "Vibrio fischeri") # Handle STRING annotation processing based on the target organism if (target_organism %in% uses_old_locus) { - # If the target organism is one of the NOENTRY organisms, handle the OLD_LOCUS splitting annot_stringdb <- annot_orgdb %>% separate_rows(!!sym(stringdb_query), sep = ",", convert = TRUE) %>% distinct() %>% @@ -705,17 +702,17 @@ annot_stringdb <- as.data.frame(annot_stringdb) **Input Data:** -- annot_orgdb (annotation table with GTF and org.db annotations, output from [step 5](#5-add-orgdb-keys)) -- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names)) -- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table)) -- target_taxid (taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names)) +- `annot_orgdb` (variable holding the annotation table with GTF and organism-specific org.db annotations, output from [step 5](#5-add-orgdb-keys)) +- `target_organism` (variable specifying the full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names)) +- `primary_keytype` (variable specifying the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table)) +- `target_taxid` (variable specifying the taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names)) **Output Data:** -- annot_stringdb (updated annotation table with added STRING IDs) -- no_stringdb (list of organisms that do not use STRING annotations) -- stringdb_query (the key type used for mapping to STRING database) -- uses_old_locus (list of organisms where GTF gene_id entries do not match those in STRING, so entries in OLD_LOCUS are used to query STRING) +- `annot_stringdb` (variable holding the updated annotation table with GTF, organism-specific org.db, and STRING annotations) +- `no_stringdb` (variable specifying the list of organisms that do not use STRING annotations) +- `stringdb_query` (variable specifying the key type used for mapping to STRING database) +- `uses_old_locus` (variable specifying the list of organisms where GTF gene_id entries do not match those in STRING, so entries in OLD_LOCUS are used to query STRING)
@@ -736,7 +733,6 @@ if (!(target_organism %in% no_panther_db)) { pantherdb_keytype = "ENTREZ" # Retrieve target organism PANTHER GO slim annotations database using the UNIPROT / PANTHER short name - target_short_name <- target_species_designation pthOrganisms(PANTHER.db) <- target_short_name # Define a function to retrieve GO slim IDs for a given gene's ENTREZIDs, which may include entries separated by a "|" @@ -768,17 +764,13 @@ if (!(target_organism %in% no_panther_db)) { **Input Data:** -- annot_orgdb (annotation table with GTF and org.db annotations, output from [step 5](#5-add-orgdb-keys)) -- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names)) -- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table)) -- target_taxid (taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names)) +- `annot_stringdb` (variable holding the annotation table with GTF, organism-specific org.db, and STRING annotations, output from [step 6](#6-add-string-ids)) +- `target_organism` (variable specifying the full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names)) **Output Data:** -- annot_stringdb (updated annotation table with added STRING IDs) -- no_stringdb (list of organisms that do not use STRING annotations) -- stringdb_query (the key type used for mapping to STRING database) -- uses_old_locus (list of organisms where the 'gene_id' column in the GTF dataframe does not match STRING identifiers, so the 'old_locus_tag' column from the GTF dataframe is used to query STRING instead) +- `annot_pantherdb` (variable holding the updated annotation table with GTF, organism-specific org.db, STRING, and PANTHER GO Slim annotations) +- `no_panther_db` (variable specifying the list of organisms that do not use PANTHER annotations)
@@ -827,19 +819,19 @@ write(capture.output(sessionInfo()), out_log_filename, append = TRUE) **Input Data:** -- annot_pantherdb (annotation table with GTF, org.db, STRING, and PANTHER annotations, output from [step 7](#7-add-gene-ontology-go-slim-ids)) -- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table)) -- out_table_filename (name of the output annotation table file, output from [step 1](#1-define-variables-and-output-file-names)) -- out_log_filename (name of the output log file, output from [step 1](#1-define-variables-and-output-file-names)) -- GL_DPPD_ID (GeneLab Data Processing Pipeline Document ID, output from [step 0](#0-set-up-environment)) -- gtf_link (URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names)) -- target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor)) -- no_org_db (list of organisms that do not use org.db annotations, output from [step 3](#3-load-annotation-databases)) +- `annot_pantherdb` (variable holding the updated annotation table with GTF, organism-specific org.db, STRING, and PANTHER GO Slim annotations, output from [step 7](#7-add-gene-ontology-go-slim-ids)) +- `primary_keytype` (variable specifying the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table)) +- `out_table_filename` (variable specifying the name of the output annotation table file, output from [step 1](#1-define-variables-and-output-file-names)) +- `out_log_filename` (variable specifying the name of the output log file, output from [step 1](#1-define-variables-and-output-file-names)) +- `GL_DPPD_ID` (variable specifying the GeneLab Data Processing Pipeline Document ID, output from [step 0](#0-set-up-environment)) +- `gtf_link` (variable specifying the URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names)) +- `target_org_db` (variable specifying the name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) or [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor)) +- `no_org_db` (variable specifying the list of organisms that do not use org.db annotations, output from [step 3](#3-load-annotation-databases)) **Output Data:** -- annot (final annotation table with annotations from the GTF, org.db, STRING, and PANTHER) -- ***-GL-annotations.tsv** (annot saved as a tab-delimited table file) +- `annot` (variable holding the final annotation table with GTF, organism-specific org.db, STRING, and PANTHER GO Slim annotations) +- ***-GL-annotations.tsv** (final annotation table saved as a tab-delimited table file) - ***-GL-build-info.txt** (annotation table build information log file)
From 8bae3a572280a3ddff8fcd78b59f979beedf98a0 Mon Sep 17 00:00:00 2001 From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com> Date: Wed, 11 Sep 2024 12:52:02 -0700 Subject: [PATCH 21/58] Removed target_species_designation variable This variable was the same as target_organism, so it was replaced with that variable when necessary. --- .../workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R index 53148ca4..7eebaee1 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R @@ -96,13 +96,12 @@ target_info <- ref_table %>% # Extract the relevant columns from the reference table target_taxid <- target_info$taxon # Taxonomic identifier target_org_db <- target_info$annotations # org.eg.db R package -target_species_designation <- target_info$species # Full species name gtf_link <- target_info$gtf # Path to reference assembly GTF target_short_name <- target_info$name # PANTHER / UNIPROT short name; blank if not available ref_source <- target_info$ref_source # Reference files source # Error handling for missing values -if (is.na(target_taxid) || is.na(target_org_db) || is.na(target_species_designation) || is.na(gtf_link)) { +if (is.na(target_taxid) || is.na(target_org_db) || is.na(target_organism) || is.na(gtf_link)) { stop(paste("Error: Missing data for target organism", target_organism, "in reference table.")) } @@ -112,7 +111,7 @@ base_output_name <- str_replace(base_gtf_filename, ".gtf.gz", "") # Add the species name to base_output_name if the reference source is not ENSEMBL if (!(ref_source %in% c("ensembl_plants", "ensembl_bacteria", "ensembl"))) { - base_output_name <- paste(str_replace(target_species_designation, " ", "_"), base_output_name, sep = "_") + base_output_name <- paste(str_replace(target_organism, " ", "_"), base_output_name, sep = "_") } out_table_filename <- paste0(base_output_name, "-GL-annotations.tsv") @@ -534,4 +533,4 @@ write(paste(c("\nUsed STRINGdb version:\n ", packageVersion("STRINGdb") %>% a write(paste(c("\nUsed PANTHER.db version:\n ", packageVersion("PANTHER.db") %>% as.character()), collapse = ""), out_log_filename, append = TRUE) write("\n\nAll session info:\n", out_log_filename, append = TRUE) -write(capture.output(sessionInfo()), out_log_filename, append = TRUE) \ No newline at end of file +write(capture.output(sessionInfo()), out_log_filename, append = TRUE) From c3f621bff8c8875cb5cf8a7403440b9f773aa8ff Mon Sep 17 00:00:00 2001 From: Alexis Torres Date: Wed, 11 Sep 2024 14:12:59 -0700 Subject: [PATCH 22/58] [GL_RefAnnotTable] Typo fixes --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 4 ++-- .../workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index e83eb46a..aebbe888 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -87,7 +87,7 @@ The default columns in the annotation table are: 2. **Caenorhabditis elegans**: - Columns: ENSEMBL, SYMBOL, GENENAME, REFSEQ, ENTREZID, STRING_id - > Note: org.db ENTREZ keys did not match PANTHER ENTREZ keys so the empty `GOSLIM_IDS` column was ommitted + > Note: org.db ENTREZ keys did not match PANTHER ENTREZ keys so the empty `GOSLIM_IDS` column was omitted 3. **Lactobacillus acidophilus**: - Columns: LOCUS, OLD_LOCUS, SYMBOL, GENENAME, STRING_id, GO @@ -564,7 +564,7 @@ for (keytype in wanted_org_db_keytypes) { } } -# For SALTY, reorder columns to mtach other tables +# For SALTY, reorder columns to match other tables if (target_organism == "Salmonella enterica") { # Reorder columns to match others; was mismatched since ENTREZ came from GTF annot_orgdb <- annot_orgdb[, c("ENSEMBL", "SYMBOL", "GENENAME", "REFSEQ", "ENTREZID")] } diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R index 7eebaee1..dd0236d2 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R @@ -331,7 +331,7 @@ for (keytype in wanted_org_db_keytypes) { } } -# For SALTY, reorder columns to mtach other tables +# For SALTY, reorder columns to match other tables if (target_organism == "Salmonella enterica") { # Reorder columns to match others; was mismatched since ENTREZ came from GTF annot_orgdb <- annot_orgdb[, c("ENSEMBL", "SYMBOL", "GENENAME", "REFSEQ", "ENTREZID")] } From 722888064622f7d1ccbf0ed593a24de4357053cc Mon Sep 17 00:00:00 2001 From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com> Date: Mon, 16 Sep 2024 07:26:05 -0700 Subject: [PATCH 23/58] Typo fix --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index aebbe888..2eadbb1d 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -177,7 +177,7 @@ Current GeneLab annotation tables are available on [figshare](https://figshare.c --- -This example below is done for *Mus musculus*. All code is executed in R. +*All code is executed in R.* ## 0. Set Up Environment From 2749fe5fe5d02a0d0314d14a2728621e6ac3db50 Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Mon, 16 Sep 2024 13:47:39 -0700 Subject: [PATCH 24/58] [GL_RefAnnotTable] Fix R packages, add docker instructions --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 56 +++++++----- .../GL_RefAnnotTable-A/README.md | 52 +++++++++-- .../workflow_code/install-org-db.R | 88 +++++++++---------- 3 files changed, 121 insertions(+), 75 deletions(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index 2eadbb1d..05afbd01 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -136,9 +136,9 @@ The default columns in the annotation table are: | Program | Version | Relevant Links | |:----------------|:-------:|:---------------| | R | 4.4.0 | [https://www.r-project.org/](https://www.r-project.org/) | -| Bioconductor | 3.19.1 | [https://bioconductor.org](https://bioconductor.org) | +| Bioconductor | 3.19 | [https://bioconductor.org](https://bioconductor.org) | | tidyverse | 2.0.0 | [https://www.tidyverse.org](https://www.tidyverse.org) | -| STRINGdb | 2.16.0 | [https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html](https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html) | +| STRINGdb | 2.16.4 | [https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html](https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html) | | PANTHER.db | 1.0.12 | [https://bioconductor.org/packages/release/data/annotation/html/PANTHER.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/PANTHER.db.html) | | rtracklayer | 1.64.0 | [https://bioconductor.org/packages/release/bioc/html/rtracklayer.html](https://www.bioconductor.org/packages/release/bioc/html/rtracklayer.html) | | org.At.tair.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.At.tair.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.At.tair.db.html) | @@ -294,42 +294,58 @@ if ( file.exists(out_table_filename) ) { # Use AnnotationForge's makeOrgPackageFromNCBI function with default settings to create the organism-specific org.db R package from available NCBI annotations # Try to download the org.db from Bioconductor, build it locally if installation fails -BiocManager::install(target_org_db, ask = FALSE) -if (!requireNamespace(target_org_db, quietly = TRUE)) { +BiocManager::install(target_org_db, ask = FALSE) +if (!requireNamespace(target_org_db, quietly = TRUE)) { tryCatch({ - # Parse organism's name in the reference table to create the org.db name (target_org_db) - genus_species <- strsplit(target_organism, " ")[[1]] + # Define genus and species regardless of target_org_db + target_species_designation <- ref_table %>% + filter(species == target_organism) %>% + pull(species) %>% + gsub("\\s+", " ", .) %>% + gsub("[^A-Za-z0-9 ]", "", .) + + genus_species <- strsplit(target_species_designation, " ")[[1]] if (length(genus_species) < 1) { - stop("Species designation is not correctly formatted: ", target_organism) + stop("Species designation is not correctly formatted: ", target_species_designation) } + genus <- genus_species[1] species <- ifelse(length(genus_species) > 1, genus_species[2], "") strain <- ref_table %>% filter(species == target_organism) %>% pull(strain) %>% gsub("[^A-Za-z0-9]", "", .) + if (!is.na(strain) && strain != "") { - species <- paste0(species, strain) + species <- paste0(species, strain) + } + + # Get package name or build it if not provided + target_org_db <- ref_table %>% + filter(species == target_organism) %>% + pull(annotations) + + if (is.na(target_org_db) || target_org_db == "") { + cat("\nNo annotation database specified. Constructing package name...\n") + target_org_db <- paste0("org.", substr(genus, 1, 1), species, ".eg.db") } - target_org_db <- paste0("org.", substr(genus, 1, 1), species, ".eg.db") - BiocManager::install(c("AnnotationForge", "biomaRt", "GO.db"), ask = FALSE) + BiocManager::install(c("AnnotationForge", "biomaRt", "GO.db"), ask = FALSE) library(AnnotationForge) - makeOrgPackageFromNCBI( - version = "0.1", - author = "Your Name ", - maintainer = "Your Name ", - outputDir = "./", - tax_id = target_taxid, - genus = genus, - species = species + makeOrgPackageFromNCBI( + version = "0.1", + author = "Your Name ", + maintainer = "Your Name ", + outputDir = "./", + tax_id = target_taxid, + genus = genus, + species = species ) install.packages(file.path("./", target_org_db), repos = NULL, type = "source", quiet = TRUE) cat(paste0("'", target_org_db, "' has been successfully built and installed.\n")) }, error = function(e) { - stop("Failed to build and load the package: ", target_org_db, "\nError: ", e$message) + stop("Failed to build and load the package: ", target_org_db, "\nError: ", e$message) }) - target_org_db <- install_annotations(target_organism, ref_tab_path) } ``` diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index de1462a5..a95c1ac3 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -26,20 +26,20 @@ Once R is installed, open a CLI terminal and run the following command to activa ```bash R ``` - +` Within an active R environment, run the following commands to install the required R packages: ```R -install.packages("tidyverse", version = 2.0.0, repos = "http://cran.us.r-project.org") +install.packages("tidyverse") -install.packages("BiocManager", version = 3.19.1, repos = "http://cran.us.r-project.org") +install.packages("BiocManager") -BiocManager::install("STRINGdb", version = 3.19.1) -BiocManager::install("PANTHER.db", version = 3.19.1) -BiocManager::install("rtracklayer", version = 3.19.1) -BiocManager::install("AnnotationForge", version = 1.46.0) -BiocManager::install("biomaRt", version = 2.60.1) -BiocManager::install("GO.db", version = 3.19.1) +BiocManager::install("STRINGdb") +BiocManager::install("PANTHER.db") +BiocManager::install("rtracklayer") +BiocManager::install("AnnotationForge") +BiocManager::install("biomaRt") +BiocManager::install("GO.db") ```
@@ -102,3 +102,37 @@ Rscript install-org-db.R 'Bacillus subtilis' /path/to/GL-DPPD-7110-A_annotations **Output data:** - org.*.eg.db/ (species-specific annotation database, as a local R package) + +### 6. Run the Workflow Using Docker + +Rather than running the workflow in your local environment, you can use a Docker image. This method ensures that all dependencies are correctly installed. + +1. **Pull the Docker image:** + + ```bash + docker pull quay.io/torres-alexis/gl_images:GL_RefAnnotTable_v1.1.0-rc.1 + ``` + +2. **Download the workflow files:** + + ```bash + curl -LO https://github.com/nasa/GeneLab_Data_Processing/releases/download/GL_RefAnnotTable-A_1.1.0/GL_RefAnnotTable-A_1.1.0.zip + unzip GL_RefAnnotTable-A_1.1.0.zip + ``` + +3. **Run the workflow using Docker:** + + ```bash + docker run -it -v $(pwd)/GL_RefAnnotTable-A_1.1.0:/home/rstudio/work quay.io/torres-alexis/gl_images:GL_RefAnnotTable_v1.1.0-rc.1 bash -c "cd /home/rstudio/work && Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'" + ``` + +**Input data:** + +- No input files are required. Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above. To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) + +- Optional: a reference table CSV can be supplied as a second positional argument instead of using the default [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) + +**Output data:** + +- *-GL-annotations.tsv (Tab delineated table of gene annotations) +- *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R index 5ecffc5b..606616b1 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R @@ -1,8 +1,3 @@ -# install-org-db.R - -# Function: Get annotations db from ref table. If no annotations db is defined, create the package name from genus, species, (and strain for microbes), -# Try to Bioconductor install annotations db. If fail then build the package using AnnotationForge, install it into the current directory. -# Requires ~80GB for NCBIFilesDir file caching install_annotations <- function(target_organism, refTablePath) { if (!file.exists(refTablePath)) { stop("Reference table file does not exist at the specified path: ", refTablePath) @@ -13,6 +8,29 @@ install_annotations <- function(target_organism, refTablePath) { filter(species == target_organism) %>% pull(taxon) + # Define genus and species regardless of target_org_db + target_species_designation <- ref_table %>% + filter(species == target_organism) %>% + pull(species) %>% + gsub("\\s+", " ", .) %>% + gsub("[^A-Za-z0-9 ]", "", .) + + genus_species <- strsplit(target_species_designation, " ")[[1]] + if (length(genus_species) < 1) { + stop("Species designation is not correctly formatted: ", target_species_designation) + } + + genus <- genus_species[1] + species <- ifelse(length(genus_species) > 1, genus_species[2], "") + strain <- ref_table %>% + filter(species == target_organism) %>% + pull(strain) %>% + gsub("[^A-Za-z0-9]", "", .) + + if (!is.na(strain) && strain != "") { + species <- paste0(species, strain) + } + # Get package name or build it if not provided target_org_db <- ref_table %>% filter(species == target_organism) %>% @@ -20,28 +38,6 @@ install_annotations <- function(target_organism, refTablePath) { if (is.na(target_org_db) || target_org_db == "") { cat("\nNo annotation database specified. Constructing package name...\n") - target_species_designation <- ref_table %>% - filter(species == target_organism) %>% - pull(species) %>% - gsub("\\s+", " ", .) %>% - gsub("[^A-Za-z0-9 ]", "", .) - - genus_species <- strsplit(target_species_designation, " ")[[1]] - if (length(genus_species) < 1) { - stop("Species designation is not correctly formatted: ", target_species_designation) - } - - genus <- genus_species[1] - species <- ifelse(length(genus_species) > 1, genus_species[2], "") - strain <- ref_table %>% - filter(species == target_organism) %>% - pull(strain) %>% - gsub("[^A-Za-z0-9]", "", .) - - if (!is.na(strain) && strain != "") { - species <- paste0(species, strain) - } - target_org_db <- paste0("org.", substr(genus, 1, 1), species, ".eg.db") } @@ -56,25 +52,25 @@ install_annotations <- function(target_organism, refTablePath) { } else { cat(paste0("\nInstallation from Bioconductor failed, attempting to build '", target_org_db, "'...\n")) if (!dir.exists(target_org_db)) { - tryCatch({ - BiocManager::install(c("AnnotationForge", "biomaRt", "GO.db"), ask = FALSE) - library(AnnotationForge) - makeOrgPackageFromNCBI( - version = "0.1", - author = "Your Name ", - maintainer = "Your Name ", - outputDir = "./", - tax_id = target_taxid, - genus = genus, - species = species - ) - install.packages(file.path("./", target_org_db), repos = NULL, type = "source", quiet = TRUE) - cat(paste0("'", target_org_db, "' has been successfully built and installed.\n")) - }, error = function(e) { - stop("Failed to build and load the package: ", target_org_db, "\nError: ", e$message) - }) + tryCatch({ + BiocManager::install(c("AnnotationForge", "biomaRt", "GO.db"), ask = FALSE) + library(AnnotationForge) + makeOrgPackageFromNCBI( + version = "0.1", + author = "Your Name ", + maintainer = "Your Name ", + outputDir = "./", + tax_id = target_taxid, + genus = genus, + species = species + ) + install.packages(file.path("./", target_org_db), repos = NULL, type = "source", quiet = TRUE) + cat(paste0("'", target_org_db, "' has been successfully built and installed.\n")) + }, error = function(e) { + stop("Failed to build and load the package: ", target_org_db, "\nError: ", e$message) + }) } else { - cat(paste0("Local annotation package ", target_org_db, " already exists. This local package will be installed.")) + cat(paste0("Local annotation package ", target_org_db, " already exists. This local package will be installed.\n")) install.packages(file.path("./", target_org_db), repos = NULL, type = "source", quiet = TRUE) } } @@ -83,4 +79,4 @@ install_annotations <- function(target_organism, refTablePath) { library(target_org_db, character.only = TRUE) cat(paste0("Using Annotation Database '", target_org_db, "'.\n")) return(target_org_db) -} \ No newline at end of file +} From c72d704aa25d3f19c8a662884c75654777fa0c8b Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Mon, 16 Sep 2024 13:56:21 -0700 Subject: [PATCH 25/58] [GL_RefAnnotTable] Typo fixes --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index 05afbd01..b82ab94c 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -253,7 +253,7 @@ base_output_name <- str_replace(base_gtf_filename, ".gtf.gz", "") # Add the species name to base_output_name if the reference source is not ENSEMBL if (!(ref_source %in% c("ensembl_plants", "ensembl_bacteria", "ensembl"))) { - base_output_name <- paste(str_replace(target_species_designation, " ", "_"), base_output_name, sep = "_") + base_output_name <- paste(str_replace(target_organism, " ", "_"), base_output_name, sep = "_") } out_table_filename <- paste0(base_output_name, "-GL-annotations.tsv") @@ -298,15 +298,9 @@ BiocManager::install(target_org_db, ask = FALSE) if (!requireNamespace(target_org_db, quietly = TRUE)) { tryCatch({ # Define genus and species regardless of target_org_db - target_species_designation <- ref_table %>% - filter(species == target_organism) %>% - pull(species) %>% - gsub("\\s+", " ", .) %>% - gsub("[^A-Za-z0-9 ]", "", .) - - genus_species <- strsplit(target_species_designation, " ")[[1]] + genus_species <- strsplit(target_organism, " ")[[1]] if (length(genus_species) < 1) { - stop("Species designation is not correctly formatted: ", target_species_designation) + stop("Species designation is not correctly formatted: ", target_organism) } genus <- genus_species[1] From fab25b4c2bb2661d53d70edcb9dd16889160d760 Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Mon, 16 Sep 2024 14:59:49 -0700 Subject: [PATCH 26/58] [GL_RefAnnotTable] Typo fixes --- .../Workflow_Documentation/GL_RefAnnotTable-A/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index a95c1ac3..a939e33e 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -123,7 +123,9 @@ Rather than running the workflow in your local environment, you can use a Docker 3. **Run the workflow using Docker:** ```bash - docker run -it -v $(pwd)/GL_RefAnnotTable-A_1.1.0:/home/rstudio/work quay.io/torres-alexis/gl_images:GL_RefAnnotTable_v1.1.0-rc.1 bash -c "cd /home/rstudio/work && Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'" + docker run -it -v $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ + quay.io/torres-alexis/gl_images:GL_RefAnnotTable_v1.1.0-rc.1 \ + bash -c "cd /work && Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'" ``` **Input data:** From 3e3dec6fb67fe2dc50bfa69ecfa1fb2b40d2a2e4 Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Mon, 16 Sep 2024 16:38:25 -0700 Subject: [PATCH 27/58] [GL_RefAnnotTable] Add Docker/Singularity, fix R lib --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 4 ++++ .../GL_RefAnnotTable-A/README.md | 23 +++++++++++++++---- .../GL-DPPD-7110-A_build-genome-annots-tab.R | 4 ++++ 3 files changed, 27 insertions(+), 4 deletions(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index b82ab94c..91a14fb2 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -182,6 +182,10 @@ Current GeneLab annotation tables are available on [figshare](https://figshare.c ## 0. Set Up Environment ```R +# Set R library path to current working directory +lib_path <- file.path(getwd()) +.libPaths(lib_path) + # Define variables associated with current pipeline and annotation table versions GL_DPPD_ID <- "GL-DPPD-7110-A" ref_tab_path <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv" diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index a939e33e..05485f89 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -10,6 +10,7 @@ The current GeneLab Reference Annotation Table (GL_RefAnnotTable-A) pipeline is 3. [Setup Execution Permission for Workflow Scripts](#3-setup-execution-permission-for-workflow-scripts) 4. [Run the workflow](#4-run-the-workflow) 5. [Run the annotations database creation function as a stand-alone script](#5-run-the-annotations-database-creation-function-as-a-stand-alone-script) +6. [Run the Workflow Using Docker or Singularity](#6-run-the-workflow-using-docker-or-singularity)
### 1. Install R and R packages @@ -103,16 +104,22 @@ Rscript install-org-db.R 'Bacillus subtilis' /path/to/GL-DPPD-7110-A_annotations - org.*.eg.db/ (species-specific annotation database, as a local R package) -### 6. Run the Workflow Using Docker +### 6. Run the Workflow Using Docker or Singularity -Rather than running the workflow in your local environment, you can use a Docker image. This method ensures that all dependencies are correctly installed. +Rather than running the workflow in your local environment, you can use a Docker or Singularity container. This method ensures that all dependencies are correctly installed. -1. **Pull the Docker image:** +1. **Pull the container image:** + Docker: ```bash docker pull quay.io/torres-alexis/gl_images:GL_RefAnnotTable_v1.1.0-rc.1 ``` + Singularity: + ```bash + singularity pull docker://quay.io/torres-alexis/gl_images:GL_RefAnnotTable_v1.1.0-rc.1 + ``` + 2. **Download the workflow files:** ```bash @@ -120,14 +127,22 @@ Rather than running the workflow in your local environment, you can use a Docker unzip GL_RefAnnotTable-A_1.1.0.zip ``` -3. **Run the workflow using Docker:** +3. **Run the workflow:** + Docker: ```bash docker run -it -v $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ quay.io/torres-alexis/gl_images:GL_RefAnnotTable_v1.1.0-rc.1 \ bash -c "cd /work && Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'" ``` + Singularity: + ```bash + singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ + gl_images_GL_RefAnnotTable_v1.1.0-rc.1.sif \ + bash -c "cd /work && Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'" + ``` + **Input data:** - No input files are required. Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above. To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R index dd0236d2..f6b043a7 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R @@ -3,6 +3,10 @@ # GeneLab script for generating organism-specific gene annotation tables # Example usage: Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' +# Set R library path to current working directory +lib_path <- file.path(getwd()) +.libPaths(lib_path) + # Define variables associated with current pipeline and annotation table versions GL_DPPD_ID <- "GL-DPPD-7110-A" ref_tab_path <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv" From f9c4f0327380a2154d2bf2e2b430ccad5342351c Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Mon, 16 Sep 2024 17:52:23 -0700 Subject: [PATCH 28/58] [GL_RefAnnotTable] Update docker image --- .../Workflow_Documentation/GL_RefAnnotTable-A/README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index 05485f89..ad76b0fa 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -112,12 +112,12 @@ Rather than running the workflow in your local environment, you can use a Docker Docker: ```bash - docker pull quay.io/torres-alexis/gl_images:GL_RefAnnotTable_v1.1.0-rc.1 + docker pull quay.io/nasa_genelab/gl-refannottable:v1.0.0 ``` Singularity: ```bash - singularity pull docker://quay.io/torres-alexis/gl_images:GL_RefAnnotTable_v1.1.0-rc.1 + singularity pull docker://quay.io/nasa_genelab/gl-refannottable:v1.0.0 ``` 2. **Download the workflow files:** @@ -132,14 +132,14 @@ Rather than running the workflow in your local environment, you can use a Docker Docker: ```bash docker run -it -v $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ - quay.io/torres-alexis/gl_images:GL_RefAnnotTable_v1.1.0-rc.1 \ + quay.io/nasa_genelab/gl-refannottable:v1.0.0 \ bash -c "cd /work && Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'" ``` Singularity: ```bash singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ - gl_images_GL_RefAnnotTable_v1.1.0-rc.1.sif \ + gl-refannottable_v1.0.0.sif \ bash -c "cd /work && Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'" ``` From ffec6ae65236842f50a0a7a5eb3e5ce40c1b8a55 Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Mon, 16 Sep 2024 18:02:31 -0700 Subject: [PATCH 29/58] [GL_RefAnnotTable] Typo fixes --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 2 +- .../GL_RefAnnotTable-A/workflow_code/install-org-db.R | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index 91a14fb2..bf6ffa64 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -301,7 +301,7 @@ if ( file.exists(out_table_filename) ) { BiocManager::install(target_org_db, ask = FALSE) if (!requireNamespace(target_org_db, quietly = TRUE)) { tryCatch({ - # Define genus and species regardless of target_org_db + # Parse organism's name in the reference table to create the org.db name (target_org_db) genus_species <- strsplit(target_organism, " ")[[1]] if (length(genus_species) < 1) { stop("Species designation is not correctly formatted: ", target_organism) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R index 606616b1..037a07a5 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R @@ -8,7 +8,7 @@ install_annotations <- function(target_organism, refTablePath) { filter(species == target_organism) %>% pull(taxon) - # Define genus and species regardless of target_org_db + # Parse organism's name in the reference table to create the org.db name (target_org_db) target_species_designation <- ref_table %>% filter(species == target_organism) %>% pull(species) %>% From c4acfad3f27b0ee0b0b630f915232d49809b3e02 Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Mon, 16 Sep 2024 18:03:39 -0700 Subject: [PATCH 30/58] [GL_RefAnnotTable] Readd comment --- .../GL_RefAnnotTable-A/workflow_code/install-org-db.R | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R index 037a07a5..7873f214 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R @@ -1,3 +1,8 @@ +# install-org-db.R + +# Function: Get annotations db from ref table. If no annotations db is defined, create the package name from genus, species, (and strain for microbes), +# Try to Bioconductor install annotations db. If fail then build the package using AnnotationForge, install it into the current directory. +# Requires ~80GB for NCBIFilesDir file caching install_annotations <- function(target_organism, refTablePath) { if (!file.exists(refTablePath)) { stop("Reference table file does not exist at the specified path: ", refTablePath) From 51570c476c90e60098f4a93c8bafada706e43df0 Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Tue, 17 Sep 2024 00:32:38 -0700 Subject: [PATCH 31/58] [GL_RefAnnotTable] Typo fix --- .../Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md index a7c5ee8c..014cf89a 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/CHANGELOG.md @@ -49,10 +49,10 @@ Bioconductor. Used for: - Bacteria: Ensembl bacteria release 59 - Updated software: - tidyverse version updated from 1.3.2 to 2.0.0 - - STRINGdb version updated from 2.8.4 to 2.16.0 + - STRINGdb version updated from 2.8.4 to 2.16.4 - PANTHER.db version updated from 1.0.11 to 1.0.12 - rtracklayer version updated from 1.56.1 to 1.64.0 - - Bioconductor version updated from 3.15.1 to 3.19.1 + - Bioconductor version updated from 3.15.1 to 3.19 - Removed org.EcK12.eg.db and replaced it with a locally created annotations database, as it is no longer available on Bioconductor - Changed the first argument of GL-DPPD-7110-A_build-genome-annots-tab.R from From 4f181bf305b56c191d5f4d3a58d066a30d9062c9 Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Tue, 1 Oct 2024 15:05:47 -0700 Subject: [PATCH 32/58] Refactor instructions for singularity use --- .../GL_RefAnnotTable-A/README.md | 144 +++++++----------- .../workflow_code/bin/prepull_singularity.sh | 32 ++++ 2 files changed, 84 insertions(+), 92 deletions(-) create mode 100644 GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_singularity.sh diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index ad76b0fa..bf582001 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -1,83 +1,85 @@ -# GL_RefAnnotTable Workflow Information and Usage Instructions +# GL_RefAnnotTable-A Workflow Information and Usage Instructions -## General workflow info -The current GeneLab Reference Annotation Table (GL_RefAnnotTable-A) pipeline is implemented as an R workflow that can be run from a command line interface (CLI) using bash. The workflow can be used even if you are unfamiliar with R, but if you want to learn more about R, visit the [R-project about page here](https://www.r-project.org/about.html). Additionally, an introduction to R along with installation help and information about using R for bioinformatics can be found [here at Happy Belly Bioinformatics](https://astrobiomike.github.io/R/basics). +## General Workflow Info -## Utilizing the workflow +### Implementation Tools + +The current GeneLab Reference Annotation Table (GL_RefAnnotTable-A) pipeline is implemented as an R workflow and utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) to run all tools in a containerized environment. This workflow is run using the command line interface (CLI) of any unix-based system. -1. [Install R and R packages](#1-install-r-and-r-packages) -2. [Download the workflow files](#2-download-the-workflow-files) -3. [Setup Execution Permission for Workflow Scripts](#3-setup-execution-permission-for-workflow-scripts) -4. [Run the workflow](#4-run-the-workflow) -5. [Run the annotations database creation function as a stand-alone script](#5-run-the-annotations-database-creation-function-as-a-stand-alone-script) -6. [Run the Workflow Using Docker or Singularity](#6-run-the-workflow-using-docker-or-singularity)
-### 1. Install R and R packages +--- +## Utilizing the Workflow -We recommend installing R via the [Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/) as follows: +1. [Install Singularity](#1-install-singularity) +2. [Download the Workflow Files](#2-download-the-workflow-files) +3. [Fetch Singularity Images](#3-fetch-singularity-images) +4. [Run the Workflow](#4-run-the-workflow) +5. [Run the annotations database creation function as a stand-alone script](#5-run-the-annotations-database-creation-function-as-a-stand-alone-script) -1. Select the [CRAN Mirror](https://cran.r-project.org/mirrors.html) closest to your location. -2. Click the link under the "Download and Install R" section that's consistent with your machine. -3. Click on the R-4.4.0 package consistent with your machine to download. -4. Double click on the R-4.4.0.pkg downloaded in step 3 and follow the installation instructions. +
-Once R is installed, open a CLI terminal and run the following command to activate R: +--- -```bash -R -``` -` -Within an active R environment, run the following commands to install the required R packages: +### 1. Install Singularity -```R -install.packages("tidyverse") +Singularity is a container platform that allows usage of containerized software. This enables the GL_RefAnnotTable-A workflow to retrieve and use all software required for processing without the need to install the software directly on the user's system. -install.packages("BiocManager") +We recommend installing Singularity on a system wide level as per the associated [documentation](https://docs.sylabs.io/guides/3.10/admin-guide/admin_quickstart.html). -BiocManager::install("STRINGdb") -BiocManager::install("PANTHER.db") -BiocManager::install("rtracklayer") -BiocManager::install("AnnotationForge") -BiocManager::install("biomaRt") -BiocManager::install("GO.db") -``` +> Note: Singularity is also available through [Anaconda](https://anaconda.org/conda-forge/singularity).
+--- + ### 2. Download the Workflow Files -All files required for utilizing the GL_RefAnnotTable-A workflow for generating reference annotation tables are in the [workflow_code](workflow_code) directory. To get a copy of latest GL_RefAnnotTable version on to your system, run the following command: +All files required for utilizing the GL_RefAnnotTable-A workflow for generating reference annotation tables are in the [workflow_code](workflow_code) directory. To get a copy of latest GL_RefAnnotTable-A version on to your system, run the following commands: ```bash curl -LO https://github.com/nasa/GeneLab_Data_Processing/releases/download/GL_RefAnnotTable-A_1.1.0/GL_RefAnnotTable-A_1.1.0.zip -``` +unzip GL_RefAnnotTable-A_1.1.0.zip +```
-### 3. Setup Execution Permission for Workflow Scripts +--- -Once you've downloaded the GL_RefAnnotTable-A workflow directory as a zip file, unzip the workflow then `cd` into the GL_RefAnnotTable-A_1.1.0 directory on the CLI. Next, run the following command to set the execution permissions for the R script: +### 3. Fetch Singularity Images + +Although Singularity can fetch images from a url, doing so may cause issues as detailed [here](https://github.com/nextflow-io/nextflow/issues/1210). + +To avoid this issue, run the following command to fetch the Singularity images prior to running the GL_RefAnnotTable-A workflow: +> Note: This command should be run in the location containing the `GL_RefAnnotTable-A_1.1.0` directory that was downloaded in [step 2](#2-download-the-workflow-files) above. Depending on your network speed, fetching the images will take ~20 minutes. ```bash -unzip GL_RefAnnotTable-A_1.1.0.zip -cd GL_RefAnnotTable-A_1.1.0 -chmod -R u+x *R +bash GL_RefAnnotTable-A_1.1.0/bin/prepull_singularity.sh GL_RefAnnotTable-A_1.1.0/config/software/by_docker_image.config +``` + +Once complete, a `singularity` folder containing the Singularity images will be created. Run the following command to export this folder as a Singularity configuration environment variable: + +```bash +export SINGULARITY_CACHEDIR=$(pwd)/singularity ```
+--- + ### 4. Run the Workflow -While in the GL_RefAnnotTable workflow directory, you are now able to run the workflow. Below is an example of how to run the workflow to build an annotation table for Mus musculus (mouse): +While in the location containing the `GL_RefAnnotTable-A_1.1.0` directory that was downloaded in [step 2](#2-download-the-workflow-files), you are now able to run the workflow. Below is an example of how to run the workflow to build an annotation table for Mus musculus (mouse): ```bash -Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' +singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ + $SINGULARITY_CACHEDIR/gl-refannottable_v1.0.0.sif \ + bash -c "cd /work && Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'" ``` **Input data:** -- No input files are required. Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above. To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) +- No input files are required. Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above. To see a list of all available organisms, run the command without positional arguments. The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) - Optional: a reference table CSV can be supplied as a second positional argument instead of using the default [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) @@ -86,12 +88,18 @@ Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' - *-GL-annotations.tsv (Tab delineated table of gene annotations) - *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation) +
+ +--- + ### 5. Run the annotations database creation function as a stand-alone script When the workflow is run, if the reference table does not specify an annotations database for the target_organism in the `annotations` column, the `install_annotations` function, defined in the `install-org-db.R` script, will be executed. This script will locally create and install an annotations database R package using AnnotationForge. This function can also be run as a stand-alone script from the command line: ```bash -Rscript install-org-db.R 'Bacillus subtilis' /path/to/GL-DPPD-7110-A_annotations.csv +singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ + $SINGULARITY_CACHEDIR/gl-refannottable_v1.0.0.sif \ + bash -c "cd /work && Rscript install-org-db.R 'Bacillus subtilis' /path/to/GL-DPPD-7110-A_annotations.csv" ``` **Input data:** @@ -104,52 +112,4 @@ Rscript install-org-db.R 'Bacillus subtilis' /path/to/GL-DPPD-7110-A_annotations - org.*.eg.db/ (species-specific annotation database, as a local R package) -### 6. Run the Workflow Using Docker or Singularity - -Rather than running the workflow in your local environment, you can use a Docker or Singularity container. This method ensures that all dependencies are correctly installed. - -1. **Pull the container image:** - - Docker: - ```bash - docker pull quay.io/nasa_genelab/gl-refannottable:v1.0.0 - ``` - - Singularity: - ```bash - singularity pull docker://quay.io/nasa_genelab/gl-refannottable:v1.0.0 - ``` - -2. **Download the workflow files:** - - ```bash - curl -LO https://github.com/nasa/GeneLab_Data_Processing/releases/download/GL_RefAnnotTable-A_1.1.0/GL_RefAnnotTable-A_1.1.0.zip - unzip GL_RefAnnotTable-A_1.1.0.zip - ``` - -3. **Run the workflow:** - - Docker: - ```bash - docker run -it -v $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ - quay.io/nasa_genelab/gl-refannottable:v1.0.0 \ - bash -c "cd /work && Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'" - ``` - - Singularity: - ```bash - singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ - gl-refannottable_v1.0.0.sif \ - bash -c "cd /work && Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'" - ``` - -**Input data:** - -- No input files are required. Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above. To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) - -- Optional: a reference table CSV can be supplied as a second positional argument instead of using the default [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) - -**Output data:** - -- *-GL-annotations.tsv (Tab delineated table of gene annotations) -- *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation) +
\ No newline at end of file diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_singularity.sh b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_singularity.sh new file mode 100644 index 00000000..de057d1b --- /dev/null +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_singularity.sh @@ -0,0 +1,32 @@ + +#!/usr/bin/env bash + +# Addresses issue: https://github.com/nextflow-io/nextflow/issues/1210 + +CONFILE=${1:-nextflow.config} +OUTDIR=${2:-./singularity} + +if [ ! -e $CONFILE ]; then + echo "$CONFILE does not exist" + exit +fi + +TMPFILE=`mktemp` + +CURDIR=$(pwd) + +mkdir -p $OUTDIR + +cat ${CONFILE}|grep 'container'|perl -lane 'if ( $_=~/container\s*\=\s*\"(\S+)\"/ ) { $_=~/container\s*\=\s*\"(\S+)\"/; print $1 unless ( $1=~/^\s*$/ or $1=~/\.sif/ or $1=~/\.img/ ) ; }' > $TMPFILE + +cd ${OUTDIR} + +while IFS= read -r line; do + name=$line + name=${name/:/-} + name=${name//\//-} + echo $name + singularity pull ${name}.img docker://$line +done < $TMPFILE + +cd $CURDIR From 232e421d8ea0de465f412ba7a0e19bb1f2b4659c Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Tue, 1 Oct 2024 18:55:30 -0700 Subject: [PATCH 33/58] [GL_RefAnnotTable] add container + local instructions --- .../GL_RefAnnotTable-A/README.md | 215 +++++++++++++++--- ...ll_singularity.sh => prepull_apptainer.sh} | 4 +- .../config/software/by_docker_image.config | 6 + .../workflow_code/install-org-db.R | 18 +- 4 files changed, 200 insertions(+), 43 deletions(-) rename GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/{prepull_singularity.sh => prepull_apptainer.sh} (88%) create mode 100644 GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/config/software/by_docker_image.config diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index bf582001..4aa7f8b5 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -1,85 +1,231 @@ # GL_RefAnnotTable-A Workflow Information and Usage Instructions -## General Workflow Info +## Table of Contents +- [General Workflow Info](#general-workflow-info) +- [Utilizing the Workflow](#utilizing-the-workflow) + - [Approach 1: Using Apptainer](#approach-1-using-apptainer) + - [1. Install Apptainer](#1-install-apptainer) + - [2. Download the Workflow Files](#2-download-the-workflow-files) + - [3. Fetch Apptainer Image](#3-fetch-apptainer-image) + - [4. Run the Workflow](#4-run-the-workflow) + - [5. Run the Annotations Database Creation Function as a Stand-Alone Script](#5-run-the-annotations-database-creation-function-as-a-stand-alone-script) + - [Approach 2: Using a Local R Environment](#approach-2-using-a-local-r-environment) + - [1. Install R and Required R Packages](#1-install-r-and-required-r-packages) + - [2. Download the Workflow Files](#2-download-the-workflow-files-1) + - [3. Set Execution Permissions for Workflow Scripts](#3-set-execution-permissions-for-workflow-scripts) + - [4. Run the Workflow](#4-run-the-workflow-1) + - [5. Run the Annotations Database Creation Function as a Stand-Alone Script](#5-run-the-annotations-database-creation-function-as-a-stand-alone-script-1) -### Implementation Tools +
+ +--- -The current GeneLab Reference Annotation Table (GL_RefAnnotTable-A) pipeline is implemented as an R workflow and utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) to run all tools in a containerized environment. This workflow is run using the command line interface (CLI) of any unix-based system. +## General Workflow Info + +The current GeneLab Reference Annotation Table (GL_RefAnnotTable-A) pipeline is implemented as an R workflow that can be run from a command line interface (CLI) using bash. The workflow can be executed using either a Apptainer (formerly Singularity) container or a local R environment. The workflow can be used even if you are unfamiliar with R, but if you want to learn more about R, visit the [R-project about page here](https://www.r-project.org/about.html). Additionally, an introduction to R along with installation help and information about using R for bioinformatics can be found [here at Happy Belly Bioinformatics](https://astrobiomike.github.io/R/basics).
--- + ## Utilizing the Workflow -1. [Install Singularity](#1-install-singularity) -2. [Download the Workflow Files](#2-download-the-workflow-files) -3. [Fetch Singularity Images](#3-fetch-singularity-images) -4. [Run the Workflow](#4-run-the-workflow) -5. [Run the annotations database creation function as a stand-alone script](#5-run-the-annotations-database-creation-function-as-a-stand-alone-script) +The GL_RefAnnotTable-A workflow can be run using two approaches: + +1. **[Using Apptainer](#approach-1-using-apptainer)**. + +2. **[Using a local R environment](#approach-2-using-a-local-r-environment)**. + +Please follow the instructions for the approach that best matches your setup and preferences. Each method is explained in the sections below.
--- -### 1. Install Singularity +### Approach 1: Using Apptainer + +This approach allows you to run the workflow within a containerized environment, ensuring consistency and reproducibility. + +
+ +--- -Singularity is a container platform that allows usage of containerized software. This enables the GL_RefAnnotTable-A workflow to retrieve and use all software required for processing without the need to install the software directly on the user's system. +#### 1. Install Apptainer -We recommend installing Singularity on a system wide level as per the associated [documentation](https://docs.sylabs.io/guides/3.10/admin-guide/admin_quickstart.html). +Apptainer can be installed either through [Anaconda](https://anaconda.org/conda-forge/singularity) or as documented on the [Apptainer documentation page](https://apptainer.org/docs/admin/main/installation.html). -> Note: Singularity is also available through [Anaconda](https://anaconda.org/conda-forge/singularity). +> **Note**: If you prefer to use Anaconda, we recommend installing Miniconda for your system, as instructed by [Happy Belly Bioinformatics](https://astrobiomike.github.io/unix/conda-intro#getting-and-installing-conda). +> +> Once conda is installed on your system, you can install Apptainer by running: +> +> ```bash +> conda install -c conda-forge apptainer +> ```
--- -### 2. Download the Workflow Files +#### 2. Download the Workflow Files -All files required for utilizing the GL_RefAnnotTable-A workflow for generating reference annotation tables are in the [workflow_code](workflow_code) directory. To get a copy of latest GL_RefAnnotTable-A version on to your system, run the following commands: +Download the latest version of the GL_RefAnnotTable-A workflow: ```bash curl -LO https://github.com/nasa/GeneLab_Data_Processing/releases/download/GL_RefAnnotTable-A_1.1.0/GL_RefAnnotTable-A_1.1.0.zip unzip GL_RefAnnotTable-A_1.1.0.zip +cd GL_RefAnnotTable-A_1.1.0 +``` + +
+ +--- + +#### 3. Fetch Apptainer Image + +To fetch the Apptainer images needed for the workflow, run: + +```bash +bash bin/prepull_apptainer.sh config/software/by_docker_image.config +``` +> Note: This command should be run in the directory containing the GL_RefAnnotTable-A_1.1.0 folder downloaded in [step 2](#2-download-the-workflow-files). Depending on your network speed, this may take approximately 20 minutes. + +Once complete, an apptainer folder containing the Apptainer images will be created. Export this folder as an Apptainer configuration environment variable: + +```bash +export APPTAINER_CACHEDIR=$(pwd)/apptainer +``` + +
+ +--- + +#### 4. Run the Workflow + +While in the `GL_RefAnnotTable-A_1.1.0` directory, you can now run the workflow. Below is an example for generating an annotation table for Mus musculus (mouse): + +```bash +apptainer exec -B $(pwd):/work \ +$APPTAINER_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \ +bash -c "cd /work && Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'" +``` + +**Input data:** + +- No input files are required. +- Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above. +- To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) +- Optional: a reference table CSV can be supplied as a second positional argument instead of using the default [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) + +**Output data:** + +- *-GL-annotations.tsv (Tab delineated table of gene annotations) +- *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation) + +
+ +--- + +#### 5. Run the Annotations Database Creation Function as a Stand-Alone Script + +If the reference table does not specify an annotations database for the target organism in the annotations column, the `install_annotations` function (defined in `install-org-db.R`) will be executed. This function can also be run as a stand-alone script: + +```bash +apptainer exec -B $(pwd):/work \ + $APPTAINER_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \ + bash -c "cd /work && Rscript install-org-db.R 'Bacillus subtilis'" ``` +**Input data:** + +- The target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in the example above. The correct argument for each organism can be found in the 'species' column of [GL-DPPD-7110-A_annotations.csv](https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) +- Optional: A local reference table can be supplied as a second positional argument. If not provided, the script will download the current version of GL-DPPD-7110-A_annotations.csv from Github by default. + +**Output data:** + +- org.*.eg.db/ (Species-specific annotation database, as a local R package) +
--- -### 3. Fetch Singularity Images +### Approach 2: Using a Local R Environment + +This approach allows you to run the workflow directly in your local R environment without using Apptainer containers. + +
+ +--- + +#### 1. Install R and Required R Packages + +We recommend installing R via the [Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/): -Although Singularity can fetch images from a url, doing so may cause issues as detailed [here](https://github.com/nextflow-io/nextflow/issues/1210). +1. Select the [CRAN Mirror](https://cran.r-project.org/mirrors.html) closest to your location. +2. Navigate to the download page for your operating system. +3. Download and install R (e.g., R-4.4.0). -To avoid this issue, run the following command to fetch the Singularity images prior to running the GL_RefAnnotTable-A workflow: -> Note: This command should be run in the location containing the `GL_RefAnnotTable-A_1.1.0` directory that was downloaded in [step 2](#2-download-the-workflow-files) above. Depending on your network speed, fetching the images will take ~20 minutes. +Once R is installed, open a terminal and start R: ```bash -bash GL_RefAnnotTable-A_1.1.0/bin/prepull_singularity.sh GL_RefAnnotTable-A_1.1.0/config/software/by_docker_image.config +R ``` -Once complete, a `singularity` folder containing the Singularity images will be created. Run the following command to export this folder as a Singularity configuration environment variable: +Within an active R environment, run the following commands to install the required R packages: + +```R +install.packages("tidyverse") + +install.packages("BiocManager") + +BiocManager::install("STRINGdb") +BiocManager::install("PANTHER.db") +BiocManager::install("rtracklayer") +BiocManager::install("AnnotationForge") +BiocManager::install("biomaRt") +BiocManager::install("GO.db") +``` + +
+ +--- + +#### 2. Download the Workflow Files + +All files required for utilizing the GL_RefAnnotTable-A workflow for generating reference annotation tables are in the [workflow_code](workflow_code) directory. To get a copy of latest GL_RefAnnotTable version on to your system, run the following command: + +```bash +curl -LO https://github.com/nasa/GeneLab_Data_Processing/releases/download/GL_RefAnnotTable-A_1.1.0/GL_RefAnnotTable-A_1.1.0.zip +``` + +
+ +--- + +#### 3. Set Execution Permissions for Workflow Scripts + +Once you've downloaded the GL_RefAnnotTable-A workflow directory as a zip file, unzip the workflow then `cd` into the GL_RefAnnotTable-A_1.1.0 directory on the CLI. Next, run the following command to set the execution permissions for the R script: ```bash -export SINGULARITY_CACHEDIR=$(pwd)/singularity +unzip GL_RefAnnotTable-A_1.1.0.zip +cd GL_RefAnnotTable-A_1.1.0 +chmod -R u+x *R ```
--- -### 4. Run the Workflow +#### 4. Run the Workflow -While in the location containing the `GL_RefAnnotTable-A_1.1.0` directory that was downloaded in [step 2](#2-download-the-workflow-files), you are now able to run the workflow. Below is an example of how to run the workflow to build an annotation table for Mus musculus (mouse): +While in the GL_RefAnnotTable workflow directory, you are now able to run the workflow. Below is an example of how to run the workflow to build an annotation table for Mus musculus (mouse): ```bash -singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ - $SINGULARITY_CACHEDIR/gl-refannottable_v1.0.0.sif \ - bash -c "cd /work && Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'" +Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' ``` **Input data:** -- No input files are required. Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above. To see a list of all available organisms, run the command without positional arguments. The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) +- No input files are required. Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above. To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. The correct argument for each organism can also be found in the 'species' column of [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) - Optional: a reference table CSV can be supplied as a second positional argument instead of using the default [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) @@ -92,24 +238,21 @@ singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ --- -### 5. Run the annotations database creation function as a stand-alone script +#### 5. Run the Annotations Database Creation Function as a Stand-Alone Script -When the workflow is run, if the reference table does not specify an annotations database for the target_organism in the `annotations` column, the `install_annotations` function, defined in the `install-org-db.R` script, will be executed. This script will locally create and install an annotations database R package using AnnotationForge. This function can also be run as a stand-alone script from the command line: +If the reference table does not specify an annotations database for the target organism in the 'annotations' column, the `install_annotations` function (defined in `install-org-db.R`) will be executed. This function can also be run as a stand-alone script: ```bash -singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ - $SINGULARITY_CACHEDIR/gl-refannottable_v1.0.0.sif \ - bash -c "cd /work && Rscript install-org-db.R 'Bacillus subtilis' /path/to/GL-DPPD-7110-A_annotations.csv" +Rscript install-org-db.R 'Bacillus subtilis' ``` **Input data:** -- The target organism must be specified as the first positional command line argument, `Bacillus subtilis` is used in the example above. The correct argument for each organism can be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) - -- The path to a local reference table must also be supplied as the second positional argument +- The target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in the example above. The correct argument for each organism can be found in the 'species' column of [GL-DPPD-7110-A_annotations.csv](https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) +- Optional: A local reference table can be supplied as a second positional argument. If not provided, the script will download the current version of GL-DPPD-7110-A_annotations.csv from Github by default. **Output data:** - org.*.eg.db/ (species-specific annotation database, as a local R package) -
\ No newline at end of file +
diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_singularity.sh b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_apptainer.sh similarity index 88% rename from GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_singularity.sh rename to GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_apptainer.sh index de057d1b..b378dc2a 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_singularity.sh +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_apptainer.sh @@ -4,7 +4,7 @@ # Addresses issue: https://github.com/nextflow-io/nextflow/issues/1210 CONFILE=${1:-nextflow.config} -OUTDIR=${2:-./singularity} +OUTDIR=${2:-./apptainer} if [ ! -e $CONFILE ]; then echo "$CONFILE does not exist" @@ -26,7 +26,7 @@ while IFS= read -r line; do name=${name/:/-} name=${name//\//-} echo $name - singularity pull ${name}.img docker://$line + apptainer pull ${name}.img docker://$line done < $TMPFILE cd $CURDIR diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/config/software/by_docker_image.config b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/config/software/by_docker_image.config new file mode 100644 index 00000000..93cc12ba --- /dev/null +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/config/software/by_docker_image.config @@ -0,0 +1,6 @@ +// Config that specifies containers for nextflow processes +process { + withName: 'GL_REFANNOTTABLE_A' { + container = "quay.io/nasa_genelab/gl-refannottable-a:1.1.0" + } +} diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R index 7873f214..698d5d5e 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R @@ -3,12 +3,20 @@ # Function: Get annotations db from ref table. If no annotations db is defined, create the package name from genus, species, (and strain for microbes), # Try to Bioconductor install annotations db. If fail then build the package using AnnotationForge, install it into the current directory. # Requires ~80GB for NCBIFilesDir file caching -install_annotations <- function(target_organism, refTablePath) { - if (!file.exists(refTablePath)) { - stop("Reference table file does not exist at the specified path: ", refTablePath) - } +install_annotations <- function(target_organism, refTablePath = NULL) { + # Default URL for the specific version of the reference CSV + default_url <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv" + + # Use the provided path if available, otherwise use the default URL + csv_source <- ifelse(is.null(refTablePath), default_url, refTablePath) + + # Attempt to read the CSV file + tryCatch({ + ref_table <- read.csv(csv_source) + }, error = function(e) { + stop("Failed to read the reference table: ", e$message) + }) - ref_table <- read.csv(refTablePath) target_taxid <- ref_table %>% filter(species == target_organism) %>% pull(taxon) From 99daa554703709977baf07b636c63f25ee6a614f Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Tue, 1 Oct 2024 19:08:36 -0700 Subject: [PATCH 34/58] [GL_RefAnnotTable] Fix typos --- .../GL_RefAnnotTable-A/README.md | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index 4aa7f8b5..a38a4398 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -82,14 +82,14 @@ cd GL_RefAnnotTable-A_1.1.0 #### 3. Fetch Apptainer Image -To fetch the Apptainer images needed for the workflow, run: +To fetch the Apptainer image needed for the workflow, run: ```bash bash bin/prepull_apptainer.sh config/software/by_docker_image.config ``` > Note: This command should be run in the directory containing the GL_RefAnnotTable-A_1.1.0 folder downloaded in [step 2](#2-download-the-workflow-files). Depending on your network speed, this may take approximately 20 minutes. -Once complete, an apptainer folder containing the Apptainer images will be created. Export this folder as an Apptainer configuration environment variable: +Once complete, an apptainer folder containing the Apptainer image will be created. Export this folder as an Apptainer configuration environment variable: ```bash export APPTAINER_CACHEDIR=$(pwd)/apptainer @@ -111,9 +111,7 @@ bash -c "cd /work && Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus muscu **Input data:** -- No input files are required. -- Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above. -- To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) +- No input files are required. Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above. To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) - Optional: a reference table CSV can be supplied as a second positional argument instead of using the default [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) **Output data:** @@ -225,8 +223,7 @@ Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' **Input data:** -- No input files are required. Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above. To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. The correct argument for each organism can also be found in the 'species' column of [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) - +- No input files are required. Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above. To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) - Optional: a reference table CSV can be supplied as a second positional argument instead of using the default [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) **Output data:** From 12b0587197a46b8a68dfa296f3593913116727be Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Tue, 1 Oct 2024 22:20:32 -0700 Subject: [PATCH 35/58] [GL_RefAnnotTable] Fix interactive install-org-db --- .../workflow_code/install-org-db.R | 28 +++++++++++++++++-- 1 file changed, 26 insertions(+), 2 deletions(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R index 698d5d5e..f7e6f459 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R @@ -1,5 +1,14 @@ # install-org-db.R +# Set R library path to current working directory +lib_path <- file.path(getwd()) +.libPaths(lib_path) + +# Load required libraries +library(tidyverse) +library(AnnotationForge) +library(BiocManager) + # Function: Get annotations db from ref table. If no annotations db is defined, create the package name from genus, species, (and strain for microbes), # Try to Bioconductor install annotations db. If fail then build the package using AnnotationForge, install it into the current directory. # Requires ~80GB for NCBIFilesDir file caching @@ -11,8 +20,8 @@ install_annotations <- function(target_organism, refTablePath = NULL) { csv_source <- ifelse(is.null(refTablePath), default_url, refTablePath) # Attempt to read the CSV file - tryCatch({ - ref_table <- read.csv(csv_source) + ref_table <- tryCatch({ + read.csv(csv_source) }, error = function(e) { stop("Failed to read the reference table: ", e$message) }) @@ -60,6 +69,7 @@ install_annotations <- function(target_organism, refTablePath = NULL) { } else { cat(paste0("\nAttempting to install '", target_org_db, "' from Bioconductor...\n")) BiocManager::install(target_org_db, ask = FALSE) + if (requireNamespace(target_org_db, quietly = TRUE)) { cat(paste0("'", target_org_db, "' has been successfully installed from Bioconductor.\n")) } else { @@ -93,3 +103,17 @@ install_annotations <- function(target_organism, refTablePath = NULL) { cat(paste0("Using Annotation Database '", target_org_db, "'.\n")) return(target_org_db) } + +if (!interactive()) { + # Parse command line arguments + args <- commandArgs(trailingOnly = TRUE) + + if (length(args) < 1) { + stop("Usage: Rscript install-org-db.R [refTablePath]") + } + + target_organism <- args[1] + refTablePath <- if (length(args) > 1) args[2] else NULL + + install_annotations(target_organism, refTablePath) +} \ No newline at end of file From 86814bd32e6b377d366c25db1312a1d710cba2a2 Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Thu, 10 Oct 2024 12:49:34 -0700 Subject: [PATCH 36/58] [GL_RefAnnotTable] switch from apptainer to singularity --- .../GL_RefAnnotTable-A/README.md | 198 ++++++------------ ...ll_apptainer.sh => prepull_singularity.sh} | 4 +- 2 files changed, 71 insertions(+), 131 deletions(-) rename GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/{prepull_apptainer.sh => prepull_singularity.sh} (88%) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index a38a4398..ae2c9b0e 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -1,112 +1,96 @@ # GL_RefAnnotTable-A Workflow Information and Usage Instructions ## Table of Contents -- [General Workflow Info](#general-workflow-info) + +- [General Workflow Information](#general-workflow-information) - [Utilizing the Workflow](#utilizing-the-workflow) - - [Approach 1: Using Apptainer](#approach-1-using-apptainer) - - [1. Install Apptainer](#1-install-apptainer) - - [2. Download the Workflow Files](#2-download-the-workflow-files) - - [3. Fetch Apptainer Image](#3-fetch-apptainer-image) - - [4. Run the Workflow](#4-run-the-workflow) - - [5. Run the Annotations Database Creation Function as a Stand-Alone Script](#5-run-the-annotations-database-creation-function-as-a-stand-alone-script) - - [Approach 2: Using a Local R Environment](#approach-2-using-a-local-r-environment) - - [1. Install R and Required R Packages](#1-install-r-and-required-r-packages) - - [2. Download the Workflow Files](#2-download-the-workflow-files-1) - - [3. Set Execution Permissions for Workflow Scripts](#3-set-execution-permissions-for-workflow-scripts) - - [4. Run the Workflow](#4-run-the-workflow-1) - - [5. Run the Annotations Database Creation Function as a Stand-Alone Script](#5-run-the-annotations-database-creation-function-as-a-stand-alone-script-1) - -
+ - [1. Download the Workflow Files](#1-download-the-workflow-files) + - [2. Run the Workflow](#2-run-the-workflow) + - [Approach 1: Using Singularity](#approach-1-using-singularity) + - [Step 1: Install Singularity](#step-1-install-singularity) + - [Step 2: Fetch the Singularity Image](#step-2-fetch-the-singularity-image) + - [Step 3: Run the Workflow](#step-3-run-the-workflow) + - [Step 4: Run the Annotations Database Creation Function as a Stand-Alone Script](#step-4-run-the-annotations-database-creation-function-as-a-stand-alone-script) + - [Approach 2: Using a Local R Environment](#approach-2-using-a-local-r-environment) + - [Step 1: Install R and Required R Packages](#step-1-install-r-and-required-r-packages) + - [Step 2: Run the Workflow](#step-2-run-the-workflow) + - [Step 3: Run the Annotations Database Creation Function as a Stand-Alone Script](#step-3-run-the-annotations-database-creation-function-as-a-stand-alone-script) --- -## General Workflow Info - -The current GeneLab Reference Annotation Table (GL_RefAnnotTable-A) pipeline is implemented as an R workflow that can be run from a command line interface (CLI) using bash. The workflow can be executed using either a Apptainer (formerly Singularity) container or a local R environment. The workflow can be used even if you are unfamiliar with R, but if you want to learn more about R, visit the [R-project about page here](https://www.r-project.org/about.html). Additionally, an introduction to R along with installation help and information about using R for bioinformatics can be found [here at Happy Belly Bioinformatics](https://astrobiomike.github.io/R/basics). +## General Workflow Information -
+The current GeneLab Reference Annotation Table (GL_RefAnnotTable-A) pipeline is implemented as an R workflow that can be run from a command line interface (CLI) using bash. The workflow can be executed using either a Singularity container or a local R environment. The workflow can be used even if you are unfamiliar with R, but if you want to learn more about R, visit the [R-project about page here](https://www.r-project.org/about.html). Additionally, an introduction to R along with installation help and information about using R for bioinformatics can be found [here at Happy Belly Bioinformatics](https://astrobiomike.github.io/R/basics). --- ## Utilizing the Workflow -The GL_RefAnnotTable-A workflow can be run using two approaches: - -1. **[Using Apptainer](#approach-1-using-apptainer)**. +To utilize the GL_RefAnnotTable-A workflow, follow the instructions below to download the necessary workflow files. Once downloaded, the workflow can be executed using two approaches: -2. **[Using a local R environment](#approach-2-using-a-local-r-environment)**. +1. **[Using Singularity](#approach-1-using-singularity)** +2. **[Using a Local R Environment](#approach-2-using-a-local-r-environment)** -Please follow the instructions for the approach that best matches your setup and preferences. Each method is explained in the sections below. - -
+Please follow the instructions for the approach that best matches your setup and preferences. Each method is explained in detail below. --- -### Approach 1: Using Apptainer +### 1. Download the Workflow Files -This approach allows you to run the workflow within a containerized environment, ensuring consistency and reproducibility. +Download the latest version of the GL_RefAnnotTable-A workflow: -
+```bash +curl -LO https://github.com/nasa/GeneLab_Data_Processing/releases/download/GL_RefAnnotTable-A_1.1.0/GL_RefAnnotTable-A_1.1.0.zip +unzip GL_RefAnnotTable-A_1.1.0.zip +``` --- -#### 1. Install Apptainer - -Apptainer can be installed either through [Anaconda](https://anaconda.org/conda-forge/singularity) or as documented on the [Apptainer documentation page](https://apptainer.org/docs/admin/main/installation.html). +### 2. Run the Workflow -> **Note**: If you prefer to use Anaconda, we recommend installing Miniconda for your system, as instructed by [Happy Belly Bioinformatics](https://astrobiomike.github.io/unix/conda-intro#getting-and-installing-conda). -> -> Once conda is installed on your system, you can install Apptainer by running: -> -> ```bash -> conda install -c conda-forge apptainer -> ``` +The GL_RefAnnotTable-A workflow can be run using two approaches: -
+- **[Approach 1: Using Singularity](#approach-1-using-singularity)** +- **[Approach 2: Using a Local R Environment](#approach-2-using-a-local-r-environment)** --- -#### 2. Download the Workflow Files +#### Approach 1: Using Singularity -Download the latest version of the GL_RefAnnotTable-A workflow: +This approach allows you to run the workflow within a containerized environment, ensuring consistency and reproducibility. -```bash -curl -LO https://github.com/nasa/GeneLab_Data_Processing/releases/download/GL_RefAnnotTable-A_1.1.0/GL_RefAnnotTable-A_1.1.0.zip -unzip GL_RefAnnotTable-A_1.1.0.zip -cd GL_RefAnnotTable-A_1.1.0 -``` +##### Step 1: Install Singularity -
+Singularity is a containerization platform for running applications portably and reproducibly. We use container images hosted on Quay.io to encapsulate all the necessary software and dependencies required by the GL_RefAnnotTable-A workflow. This setup allows you to run the workflow without installing any software directly on your system. Other containerization tools like Docker or Apptainer can also be used to pull and run these images. ---- +We recommend installing Singularity system-wide as per the official [Singularity installation documentation](https://docs.sylabs.io/guides/3.10/admin-guide/admin_quickstart.html). -#### 3. Fetch Apptainer Image +> **Note**: While Singularity is also available through [Anaconda](https://anaconda.org/conda-forge/singularity), we recommend installing Singularity system-wide following the official installation documentation. -To fetch the Apptainer image needed for the workflow, run: +##### Step 2: Fetch the Singularity Image -```bash -bash bin/prepull_apptainer.sh config/software/by_docker_image.config -``` -> Note: This command should be run in the directory containing the GL_RefAnnotTable-A_1.1.0 folder downloaded in [step 2](#2-download-the-workflow-files). Depending on your network speed, this may take approximately 20 minutes. +To pull the Singularity image needed for the workflow, you can use the provided script as directed below or pull the image directly. -Once complete, an apptainer folder containing the Apptainer image will be created. Export this folder as an Apptainer configuration environment variable: +> **Note**: This command should be run in the location containing the `GL_RefAnnotTable-A_1.1.0` directory that was downloaded in [step 1](#1-download-the-workflow-files). Depending on your network speed, fetching the images will take approximately 20 minutes. ```bash -export APPTAINER_CACHEDIR=$(pwd)/apptainer +bash GL_RefAnnotTable-A_1.1.0/bin/prepull_singularity.sh GL_RefAnnotTable-A_1.1.0/config/software/by_docker_image.config ``` -
+Once complete, a `singularity` folder containing the Singularity images will be created. Run the following command to export this folder as an environment variable: ---- +```bash +export SINGULARITY_CACHEDIR=$(pwd)/singularity +``` -#### 4. Run the Workflow +##### Step 3: Run the Workflow -While in the `GL_RefAnnotTable-A_1.1.0` directory, you can now run the workflow. Below is an example for generating an annotation table for Mus musculus (mouse): +While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder, you can now run the workflow. Below is an example for generating the annotation table for *Mus musculus* (mouse): ```bash -apptainer exec -B $(pwd):/work \ -$APPTAINER_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \ -bash -c "cd /work && Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'" +singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ +$SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.sif \ +Rscript /work/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' ``` **Input data:** @@ -119,18 +103,14 @@ bash -c "cd /work && Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus muscu - *-GL-annotations.tsv (Tab delineated table of gene annotations) - *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation) -
- ---- +##### Step 4: Run the Annotations Database Creation Function as a Stand-Alone Script -#### 5. Run the Annotations Database Creation Function as a Stand-Alone Script - -If the reference table does not specify an annotations database for the target organism in the annotations column, the `install_annotations` function (defined in `install-org-db.R`) will be executed. This function can also be run as a stand-alone script: +If the reference table does not specify an annotations database for the target organism in the 'annotations' column, the `install_annotations` function (defined in `install-org-db.R`) will be executed. This function can also be run as a stand-alone script: ```bash -apptainer exec -B $(pwd):/work \ - $APPTAINER_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \ - bash -c "cd /work && Rscript install-org-db.R 'Bacillus subtilis'" +singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ +$SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.sif \ +Rscript /work/install-org-db.R 'Bacillus subtilis' ``` **Input data:** @@ -142,39 +122,33 @@ apptainer exec -B $(pwd):/work \ - org.*.eg.db/ (Species-specific annotation database, as a local R package) -
- --- -### Approach 2: Using a Local R Environment +#### Approach 2: Using a Local R Environment -This approach allows you to run the workflow directly in your local R environment without using Apptainer containers. +This approach allows you to run the workflow directly in your local R environment without using containers. -
- ---- - -#### 1. Install R and Required R Packages +##### Step 1: Install R and Required R Packages We recommend installing R via the [Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/): -1. Select the [CRAN Mirror](https://cran.r-project.org/mirrors.html) closest to your location. -2. Navigate to the download page for your operating system. -3. Download and install R (e.g., R-4.4.0). +1. Select the [CRAN Mirror](https://cran.r-project.org/mirrors.html) closest to your location. +2. Navigate to the download page for your operating system. +3. Download and install R (e.g., R-4.4.0). + +Once R is installed, you need to install the required R packages. -Once R is installed, open a terminal and start R: +Open a terminal and start R: ```bash R ``` -Within an active R environment, run the following commands to install the required R packages: +Within the R environment, run the following commands to install the required packages: ```R install.packages("tidyverse") - install.packages("BiocManager") - BiocManager::install("STRINGdb") BiocManager::install("PANTHER.db") BiocManager::install("rtracklayer") @@ -183,42 +157,12 @@ BiocManager::install("biomaRt") BiocManager::install("GO.db") ``` -
- ---- +##### Step 2: Run the Workflow -#### 2. Download the Workflow Files - -All files required for utilizing the GL_RefAnnotTable-A workflow for generating reference annotation tables are in the [workflow_code](workflow_code) directory. To get a copy of latest GL_RefAnnotTable version on to your system, run the following command: +While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder, you can now run the workflow. Below is an example of how to run the workflow to build an annotation table for *Mus musculus* (mouse): ```bash -curl -LO https://github.com/nasa/GeneLab_Data_Processing/releases/download/GL_RefAnnotTable-A_1.1.0/GL_RefAnnotTable-A_1.1.0.zip -``` - -
- ---- - -#### 3. Set Execution Permissions for Workflow Scripts - -Once you've downloaded the GL_RefAnnotTable-A workflow directory as a zip file, unzip the workflow then `cd` into the GL_RefAnnotTable-A_1.1.0 directory on the CLI. Next, run the following command to set the execution permissions for the R script: - -```bash -unzip GL_RefAnnotTable-A_1.1.0.zip -cd GL_RefAnnotTable-A_1.1.0 -chmod -R u+x *R -``` - -
- ---- - -#### 4. Run the Workflow - -While in the GL_RefAnnotTable workflow directory, you are now able to run the workflow. Below is an example of how to run the workflow to build an annotation table for Mus musculus (mouse): - -```bash -Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' +Rscript GL_RefAnnotTable-A_1.1.0/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' ``` **Input data:** @@ -231,16 +175,12 @@ Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' - *-GL-annotations.tsv (Tab delineated table of gene annotations) - *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation) -
- ---- - -#### 5. Run the Annotations Database Creation Function as a Stand-Alone Script +##### Step 3: Run the Annotations Database Creation Function as a Stand-Alone Script If the reference table does not specify an annotations database for the target organism in the 'annotations' column, the `install_annotations` function (defined in `install-org-db.R`) will be executed. This function can also be run as a stand-alone script: ```bash -Rscript install-org-db.R 'Bacillus subtilis' +Rscript GL_RefAnnotTable-A_1.1.0/install-org-db.R 'Bacillus subtilis' ``` **Input data:** @@ -252,4 +192,4 @@ Rscript install-org-db.R 'Bacillus subtilis' - org.*.eg.db/ (species-specific annotation database, as a local R package) -
+--- \ No newline at end of file diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_apptainer.sh b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_singularity.sh similarity index 88% rename from GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_apptainer.sh rename to GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_singularity.sh index b378dc2a..e3150750 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_apptainer.sh +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_singularity.sh @@ -4,7 +4,7 @@ # Addresses issue: https://github.com/nextflow-io/nextflow/issues/1210 CONFILE=${1:-nextflow.config} -OUTDIR=${2:-./apptainer} +OUTDIR=${2:-./singularity} if [ ! -e $CONFILE ]; then echo "$CONFILE does not exist" @@ -26,7 +26,7 @@ while IFS= read -r line; do name=${name/:/-} name=${name//\//-} echo $name - apptainer pull ${name}.img docker://$line + singulairty pull ${name}.img docker://$line done < $TMPFILE cd $CURDIR From d4b1c09fee25001939a99ae0b059231cde589c3c Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Thu, 10 Oct 2024 16:35:30 -0700 Subject: [PATCH 37/58] fix typo --- .../GL_RefAnnotTable-A/workflow_code/bin/prepull_singularity.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_singularity.sh b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_singularity.sh index e3150750..de057d1b 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_singularity.sh +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/bin/prepull_singularity.sh @@ -26,7 +26,7 @@ while IFS= read -r line; do name=${name/:/-} name=${name//\//-} echo $name - singulairty pull ${name}.img docker://$line + singularity pull ${name}.img docker://$line done < $TMPFILE cd $CURDIR From f8392223dc3ef58792516903086429be349e3389 Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Thu, 10 Oct 2024 16:44:26 -0700 Subject: [PATCH 38/58] [GL_RefAnnotTable] fix .img image name --- .../Workflow_Documentation/GL_RefAnnotTable-A/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index ae2c9b0e..5bd34418 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -89,7 +89,7 @@ While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder, you can ```bash singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ -$SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.sif \ +$SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \ Rscript /work/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' ``` @@ -109,7 +109,7 @@ If the reference table does not specify an annotations database for the target o ```bash singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ -$SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.sif \ +$SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \ Rscript /work/install-org-db.R 'Bacillus subtilis' ``` From bd917b481b51a44b42ca40f11984da0c69bbff0c Mon Sep 17 00:00:00 2001 From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com> Date: Tue, 22 Oct 2024 11:55:45 -0700 Subject: [PATCH 39/58] Formatting updates --- .../GL_RefAnnotTable-A/README.md | 124 +++++++++++------- 1 file changed, 80 insertions(+), 44 deletions(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index 5bd34418..44e5b78d 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -16,25 +16,20 @@ - [Step 2: Run the Workflow](#step-2-run-the-workflow) - [Step 3: Run the Annotations Database Creation Function as a Stand-Alone Script](#step-3-run-the-annotations-database-creation-function-as-a-stand-alone-script) +
+ --- ## General Workflow Information The current GeneLab Reference Annotation Table (GL_RefAnnotTable-A) pipeline is implemented as an R workflow that can be run from a command line interface (CLI) using bash. The workflow can be executed using either a Singularity container or a local R environment. The workflow can be used even if you are unfamiliar with R, but if you want to learn more about R, visit the [R-project about page here](https://www.r-project.org/about.html). Additionally, an introduction to R along with installation help and information about using R for bioinformatics can be found [here at Happy Belly Bioinformatics](https://astrobiomike.github.io/R/basics). +
+ --- ## Utilizing the Workflow -To utilize the GL_RefAnnotTable-A workflow, follow the instructions below to download the necessary workflow files. Once downloaded, the workflow can be executed using two approaches: - -1. **[Using Singularity](#approach-1-using-singularity)** -2. **[Using a Local R Environment](#approach-2-using-a-local-r-environment)** - -Please follow the instructions for the approach that best matches your setup and preferences. Each method is explained in detail below. - ---- - ### 1. Download the Workflow Files Download the latest version of the GL_RefAnnotTable-A workflow: @@ -44,107 +39,135 @@ curl -LO https://github.com/nasa/GeneLab_Data_Processing/releases/download/GL_Re unzip GL_RefAnnotTable-A_1.1.0.zip ``` +
+ --- ### 2. Run the Workflow -The GL_RefAnnotTable-A workflow can be run using two approaches: +The GL_RefAnnotTable-A workflow can be run using one of two approaches: - **[Approach 1: Using Singularity](#approach-1-using-singularity)** - **[Approach 2: Using a Local R Environment](#approach-2-using-a-local-r-environment)** +Please follow the instructions for the approach that best matches your setup and preferences. Each method is explained in detail below. + --- -#### Approach 1: Using Singularity +### Approach 1: Using Singularity This approach allows you to run the workflow within a containerized environment, ensuring consistency and reproducibility. -##### Step 1: Install Singularity +#### Step 1: Install Singularity -Singularity is a containerization platform for running applications portably and reproducibly. We use container images hosted on Quay.io to encapsulate all the necessary software and dependencies required by the GL_RefAnnotTable-A workflow. This setup allows you to run the workflow without installing any software directly on your system. Other containerization tools like Docker or Apptainer can also be used to pull and run these images. +Singularity is a containerization platform for running applications portably and reproducibly. We use container images hosted on Quay.io to encapsulate all the necessary software and dependencies required by the GL_RefAnnotTable-A workflow. This setup allows you to run the workflow without installing any software directly on your system. +> ***Note**: Other containerization tools like Docker or Apptainer can also be used to pull and run these images.* We recommend installing Singularity system-wide as per the official [Singularity installation documentation](https://docs.sylabs.io/guides/3.10/admin-guide/admin_quickstart.html). -> **Note**: While Singularity is also available through [Anaconda](https://anaconda.org/conda-forge/singularity), we recommend installing Singularity system-wide following the official installation documentation. +> ***Note**: While Singularity is also available through [Anaconda](https://anaconda.org/conda-forge/singularity), we recommend installing Singularity system-wide following the official installation documentation.* -##### Step 2: Fetch the Singularity Image +
+ +#### Step 2: Fetch the Singularity Image To pull the Singularity image needed for the workflow, you can use the provided script as directed below or pull the image directly. -> **Note**: This command should be run in the location containing the `GL_RefAnnotTable-A_1.1.0` directory that was downloaded in [step 1](#1-download-the-workflow-files). Depending on your network speed, fetching the images will take approximately 20 minutes. +> ***Note**: This command should be run in the location containing the `GL_RefAnnotTable-A_1.1.0` directory that was downloaded in [step 1](#1-download-the-workflow-files). Depending on your network speed, fetching the images will take approximately 20 minutes.* + ```bash bash GL_RefAnnotTable-A_1.1.0/bin/prepull_singularity.sh GL_RefAnnotTable-A_1.1.0/config/software/by_docker_image.config ``` - -Once complete, a `singularity` folder containing the Singularity images will be created. Run the following command to export this folder as an environment variable: + +Once complete, a `singularity` folder containing the Singularity images will be created. Run the following command to export this folder as an environment variable: + ```bash export SINGULARITY_CACHEDIR=$(pwd)/singularity ``` +
-##### Step 3: Run the Workflow +#### Step 3: Run the Workflow -While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder, you can now run the workflow. Below is an example for generating the annotation table for *Mus musculus* (mouse): +While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder, you can now run the workflow. Below is an example for generating the annotation table for *Mus musculus* (mouse): + ```bash singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ $SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \ Rscript /work/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' ``` - + **Input data:** -- No input files are required. Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above. To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) -- Optional: a reference table CSV can be supplied as a second positional argument instead of using the default [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) +- No input files are required. Specify the species name of the target organism using a positional command line argument. `Mus musculus` is used in the example above. + > **Notes**: + > To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. + > The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) +- *Optional*: A local reference table CSV file can be supplied as a second positional argument. If not provided, the script will download the current version of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) table by default. + **Output data:** - *-GL-annotations.tsv (Tab delineated table of gene annotations) - *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation) -##### Step 4: Run the Annotations Database Creation Function as a Stand-Alone Script +
+ +#### *Optional*: Run the Annotations Database Creation Function as a Stand-Alone Script -If the reference table does not specify an annotations database for the target organism in the 'annotations' column, the `install_annotations` function (defined in `install-org-db.R`) will be executed. This function can also be run as a stand-alone script: +If the reference table does not specify an annotations database for the target organism in the 'annotations' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) file, the `install_annotations` function (defined in `install-org-db.R`) will be executed by default. This function can also be run as a stand-alone script: + ```bash singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ $SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \ Rscript /work/install-org-db.R 'Bacillus subtilis' ``` + **Input data:** -- The target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in the example above. The correct argument for each organism can be found in the 'species' column of [GL-DPPD-7110-A_annotations.csv](https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) -- Optional: A local reference table can be supplied as a second positional argument. If not provided, the script will download the current version of GL-DPPD-7110-A_annotations.csv from Github by default. +- The species name of the target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in the example above. + > **Note**: The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) +- *Optional*: A local reference table CSV file can be supplied as a second positional argument. If not provided, the script will download the current version of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) table by default. + **Output data:** - org.*.eg.db/ (Species-specific annotation database, as a local R package) +
+ --- -#### Approach 2: Using a Local R Environment +### Approach 2: Using a Local R Environment This approach allows you to run the workflow directly in your local R environment without using containers. -##### Step 1: Install R and Required R Packages +#### Step 1: Install R and Required R Packages We recommend installing R via the [Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/): 1. Select the [CRAN Mirror](https://cran.r-project.org/mirrors.html) closest to your location. + 2. Navigate to the download page for your operating system. -3. Download and install R (e.g., R-4.4.0). + +3. Download and install R (e.g., R-4.4.0). -Once R is installed, you need to install the required R packages. +Once R is installed, install the required R packages as follows: -Open a terminal and start R: +Open a terminal and start R: + ```bash R -``` +``` -Within the R environment, run the following commands to install the required packages: + +Within the R environment, run the following commands to install the required packages: + ```R install.packages("tidyverse") @@ -157,27 +180,38 @@ BiocManager::install("biomaRt") BiocManager::install("GO.db") ``` -##### Step 2: Run the Workflow +
-While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder, you can now run the workflow. Below is an example of how to run the workflow to build an annotation table for *Mus musculus* (mouse): +#### Step 2: Run the Workflow + +While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder, you can now run the workflow. Below is an example of how to run the workflow to build an annotation table for *Mus musculus* (mouse): + ```bash Rscript GL_RefAnnotTable-A_1.1.0/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' ``` + **Input data:** -- No input files are required. Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above. To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) -- Optional: a reference table CSV can be supplied as a second positional argument instead of using the default [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) +- No input files are required. Specify the species name of the target organism using a positional command line argument. `Mus musculus` is used in the example above. + > **Notes**: + > To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. + > The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) +- *Optional*: A local reference table CSV file can be supplied as a second positional argument. If not provided, the script will download the current version of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) table by default. + **Output data:** - *-GL-annotations.tsv (Tab delineated table of gene annotations) - *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation) -##### Step 3: Run the Annotations Database Creation Function as a Stand-Alone Script +
+ +#### *Optional*: Run the Annotations Database Creation Function as a Stand-Alone Script -If the reference table does not specify an annotations database for the target organism in the 'annotations' column, the `install_annotations` function (defined in `install-org-db.R`) will be executed. This function can also be run as a stand-alone script: +If the reference table does not specify an annotations database for the target organism in the 'annotations' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) file, the `install_annotations` function (defined in `install-org-db.R`) will be executed by default. This function can also be run as a stand-alone script: + ```bash Rscript GL_RefAnnotTable-A_1.1.0/install-org-db.R 'Bacillus subtilis' @@ -185,11 +219,13 @@ Rscript GL_RefAnnotTable-A_1.1.0/install-org-db.R 'Bacillus subtilis' **Input data:** -- The target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in the example above. The correct argument for each organism can be found in the 'species' column of [GL-DPPD-7110-A_annotations.csv](https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) -- Optional: A local reference table can be supplied as a second positional argument. If not provided, the script will download the current version of GL-DPPD-7110-A_annotations.csv from Github by default. +- The species name of the target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in the example above. + > **Note**: The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) +- *Optional*: A local reference table CSV file can be supplied as a second positional argument. If not provided, the script will download the current version of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) table by default. + **Output data:** -- org.*.eg.db/ (species-specific annotation database, as a local R package) +- org.*.eg.db/ (Species-specific annotation database, as a local R package) ---- \ No newline at end of file +--- From 499538d8f9880fa9ff1b55ef766331115eb459cd Mon Sep 17 00:00:00 2001 From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com> Date: Tue, 22 Oct 2024 12:25:48 -0700 Subject: [PATCH 40/58] Formatting updates --- .../GL_RefAnnotTable-A/README.md | 54 ++++++++++++------- 1 file changed, 34 insertions(+), 20 deletions(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index 44e5b78d..747b49ad 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -10,11 +10,11 @@ - [Step 1: Install Singularity](#step-1-install-singularity) - [Step 2: Fetch the Singularity Image](#step-2-fetch-the-singularity-image) - [Step 3: Run the Workflow](#step-3-run-the-workflow) - - [Step 4: Run the Annotations Database Creation Function as a Stand-Alone Script](#step-4-run-the-annotations-database-creation-function-as-a-stand-alone-script) + - [Optional: Run the Annotations Database Creation Function as a Stand-Alone Script](#optional-run-the-annotations-database-creation-function-as-a-stand-alone-script) - [Approach 2: Using a Local R Environment](#approach-2-using-a-local-r-environment) - [Step 1: Install R and Required R Packages](#step-1-install-r-and-required-r-packages) - [Step 2: Run the Workflow](#step-2-run-the-workflow) - - [Step 3: Run the Annotations Database Creation Function as a Stand-Alone Script](#step-3-run-the-annotations-database-creation-function-as-a-stand-alone-script) + - [Optional: Run the Annotations Database Creation Function as a Stand-Alone Script](#optional-run-the-annotations-database-creation-function-as-a-stand-alone-script)
@@ -52,18 +52,25 @@ The GL_RefAnnotTable-A workflow can be run using one of two approaches: Please follow the instructions for the approach that best matches your setup and preferences. Each method is explained in detail below. +
+ --- ### Approach 1: Using Singularity This approach allows you to run the workflow within a containerized environment, ensuring consistency and reproducibility. +
+ #### Step 1: Install Singularity Singularity is a containerization platform for running applications portably and reproducibly. We use container images hosted on Quay.io to encapsulate all the necessary software and dependencies required by the GL_RefAnnotTable-A workflow. This setup allows you to run the workflow without installing any software directly on your system. + > ***Note**: Other containerization tools like Docker or Apptainer can also be used to pull and run these images.* + -We recommend installing Singularity system-wide as per the official [Singularity installation documentation](https://docs.sylabs.io/guides/3.10/admin-guide/admin_quickstart.html). +We recommend installing Singularity system-wide as per the official [Singularity installation documentation](https://docs.sylabs.io/guides/3.10/admin-guide/admin_quickstart.html). + > ***Note**: While Singularity is also available through [Anaconda](https://anaconda.org/conda-forge/singularity), we recommend installing Singularity system-wide following the official installation documentation.* @@ -71,16 +78,16 @@ We recommend installing Singularity system-wide as per the official [Singularity #### Step 2: Fetch the Singularity Image -To pull the Singularity image needed for the workflow, you can use the provided script as directed below or pull the image directly. +To pull the Singularity image needed for the workflow, you can use the provided script as directed below or pull the image directly. -> ***Note**: This command should be run in the location containing the `GL_RefAnnotTable-A_1.1.0` directory that was downloaded in [step 1](#1-download-the-workflow-files). Depending on your network speed, fetching the images will take approximately 20 minutes.* +> ***Note**: This command should be run in the location containing the `GL_RefAnnotTable-A_1.1.0` directory that was downloaded in [step 1](#1-download-the-workflow-files). Depending on your network speed, fetching the images will take approximately 20 minutes.* ```bash bash GL_RefAnnotTable-A_1.1.0/bin/prepull_singularity.sh GL_RefAnnotTable-A_1.1.0/config/software/by_docker_image.config ``` -Once complete, a `singularity` folder containing the Singularity images will be created. Run the following command to export this folder as an environment variable: +Once complete, a `singularity` folder containing the Singularity images will be created. Run the following command to export this folder as an environment variable: ```bash @@ -98,13 +105,14 @@ singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ $SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \ Rscript /work/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' ``` - +
+ **Input data:** - No input files are required. Specify the species name of the target organism using a positional command line argument. `Mus musculus` is used in the example above. - > **Notes**: - > To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. - > The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) + > **Notes**: + > - To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. + > - The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) - *Optional*: A local reference table CSV file can be supplied as a second positional argument. If not provided, the script will download the current version of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) table by default. @@ -117,7 +125,7 @@ Rscript /work/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' #### *Optional*: Run the Annotations Database Creation Function as a Stand-Alone Script -If the reference table does not specify an annotations database for the target organism in the 'annotations' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) file, the `install_annotations` function (defined in `install-org-db.R`) will be executed by default. This function can also be run as a stand-alone script: +If the reference table does not specify an annotations database for the target organism in the 'annotations' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) file, the `install_annotations` function (defined in `install-org-db.R`) will be executed by default. This function can also be run as a stand-alone script: ```bash @@ -126,11 +134,12 @@ $SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \ Rscript /work/install-org-db.R 'Bacillus subtilis' ``` +
**Input data:** -- The species name of the target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in the example above. - > **Note**: The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) +- The species name of the target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in the example above. + > **Note**: The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) - *Optional*: A local reference table CSV file can be supplied as a second positional argument. If not provided, the script will download the current version of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) table by default. @@ -146,6 +155,8 @@ Rscript /work/install-org-db.R 'Bacillus subtilis' This approach allows you to run the workflow directly in your local R environment without using containers. +
+ #### Step 1: Install R and Required R Packages We recommend installing R via the [Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/): @@ -184,20 +195,21 @@ BiocManager::install("GO.db") #### Step 2: Run the Workflow -While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder, you can now run the workflow. Below is an example of how to run the workflow to build an annotation table for *Mus musculus* (mouse): +While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder, you can now run the workflow. Below is an example of how to run the workflow to build an annotation table for *Mus musculus* (mouse): ```bash Rscript GL_RefAnnotTable-A_1.1.0/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' ``` - + +
**Input data:** - No input files are required. Specify the species name of the target organism using a positional command line argument. `Mus musculus` is used in the example above. - > **Notes**: - > To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. - > The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) + > **Notes**: + > - To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. + > - The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) - *Optional*: A local reference table CSV file can be supplied as a second positional argument. If not provided, the script will download the current version of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) table by default. @@ -210,17 +222,19 @@ Rscript GL_RefAnnotTable-A_1.1.0/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus m #### *Optional*: Run the Annotations Database Creation Function as a Stand-Alone Script -If the reference table does not specify an annotations database for the target organism in the 'annotations' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) file, the `install_annotations` function (defined in `install-org-db.R`) will be executed by default. This function can also be run as a stand-alone script: +If the reference table does not specify an annotations database for the target organism in the 'annotations' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) file, the `install_annotations` function (defined in `install-org-db.R`) will be executed by default. This function can also be run as a stand-alone script: ```bash Rscript GL_RefAnnotTable-A_1.1.0/install-org-db.R 'Bacillus subtilis' ``` +
+ **Input data:** - The species name of the target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in the example above. - > **Note**: The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) + > **Note**: The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) - *Optional*: A local reference table CSV file can be supplied as a second positional argument. If not provided, the script will download the current version of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) table by default. From 75dd660e1d188f1a1ceb7fe75cb64689f925100c Mon Sep 17 00:00:00 2001 From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com> Date: Tue, 22 Oct 2024 12:31:18 -0700 Subject: [PATCH 41/58] Typo and link fixes --- .../GL_RefAnnotTable-A/README.md | 26 ++++++++++++------- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index 747b49ad..b8f284aa 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -10,11 +10,11 @@ - [Step 1: Install Singularity](#step-1-install-singularity) - [Step 2: Fetch the Singularity Image](#step-2-fetch-the-singularity-image) - [Step 3: Run the Workflow](#step-3-run-the-workflow) - - [Optional: Run the Annotations Database Creation Function as a Stand-Alone Script](#optional-run-the-annotations-database-creation-function-as-a-stand-alone-script) + - [Optional: Run the Annotations Database Creation Function as a Stand-Alone Script via Singularity](#optional-run-the-annotations-database-creation-function-as-a-stand-alone-script-via-singularity) - [Approach 2: Using a Local R Environment](#approach-2-using-a-local-r-environment) - [Step 1: Install R and Required R Packages](#step-1-install-r-and-required-r-packages) - [Step 2: Run the Workflow](#step-2-run-the-workflow) - - [Optional: Run the Annotations Database Creation Function as a Stand-Alone Script](#optional-run-the-annotations-database-creation-function-as-a-stand-alone-script) + - [Optional: Run the Annotations Database Creation Function as a Stand-Alone Script via R](#optional-run-the-annotations-database-creation-function-as-a-stand-alone-script-via-r)
@@ -111,19 +111,21 @@ Rscript /work/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' - No input files are required. Specify the species name of the target organism using a positional command line argument. `Mus musculus` is used in the example above. > **Notes**: - > - To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. - > - The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) + > - To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. + > - The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) + - *Optional*: A local reference table CSV file can be supplied as a second positional argument. If not provided, the script will download the current version of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) table by default. **Output data:** - *-GL-annotations.tsv (Tab delineated table of gene annotations) + - *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation)
-#### *Optional*: Run the Annotations Database Creation Function as a Stand-Alone Script +#### *Optional*: Run the Annotations Database Creation Function as a Stand-Alone Script via Singularity If the reference table does not specify an annotations database for the target organism in the 'annotations' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) file, the `install_annotations` function (defined in `install-org-db.R`) will be executed by default. This function can also be run as a stand-alone script: @@ -139,7 +141,8 @@ Rscript /work/install-org-db.R 'Bacillus subtilis' **Input data:** - The species name of the target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in the example above. - > **Note**: The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) + > **Note**: The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) + - *Optional*: A local reference table CSV file can be supplied as a second positional argument. If not provided, the script will download the current version of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) table by default. @@ -208,19 +211,21 @@ Rscript GL_RefAnnotTable-A_1.1.0/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus m - No input files are required. Specify the species name of the target organism using a positional command line argument. `Mus musculus` is used in the example above. > **Notes**: - > - To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. - > - The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) + > - To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. + > - The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) + - *Optional*: A local reference table CSV file can be supplied as a second positional argument. If not provided, the script will download the current version of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) table by default. **Output data:** - *-GL-annotations.tsv (Tab delineated table of gene annotations) + - *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation)
-#### *Optional*: Run the Annotations Database Creation Function as a Stand-Alone Script +#### *Optional*: Run the Annotations Database Creation Function as a Stand-Alone Script via R If the reference table does not specify an annotations database for the target organism in the 'annotations' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) file, the `install_annotations` function (defined in `install-org-db.R`) will be executed by default. This function can also be run as a stand-alone script: @@ -234,7 +239,8 @@ Rscript GL_RefAnnotTable-A_1.1.0/install-org-db.R 'Bacillus subtilis' **Input data:** - The species name of the target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in the example above. - > **Note**: The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) + > **Note**: The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) + - *Optional*: A local reference table CSV file can be supplied as a second positional argument. If not provided, the script will download the current version of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) table by default. From f54a529378afcb35381664b7bcf00ce03f350b91 Mon Sep 17 00:00:00 2001 From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com> Date: Tue, 22 Oct 2024 21:20:02 -0700 Subject: [PATCH 42/58] Formatting updates --- .../GL_RefAnnotTable-A/README.md | 90 ++++++++----------- 1 file changed, 37 insertions(+), 53 deletions(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index b8f284aa..e3990af5 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -10,11 +10,13 @@ - [Step 1: Install Singularity](#step-1-install-singularity) - [Step 2: Fetch the Singularity Image](#step-2-fetch-the-singularity-image) - [Step 3: Run the Workflow](#step-3-run-the-workflow) - - [Optional: Run the Annotations Database Creation Function as a Stand-Alone Script via Singularity](#optional-run-the-annotations-database-creation-function-as-a-stand-alone-script-via-singularity) - [Approach 2: Using a Local R Environment](#approach-2-using-a-local-r-environment) - [Step 1: Install R and Required R Packages](#step-1-install-r-and-required-r-packages) - [Step 2: Run the Workflow](#step-2-run-the-workflow) - - [Optional: Run the Annotations Database Creation Function as a Stand-Alone Script via R](#optional-run-the-annotations-database-creation-function-as-a-stand-alone-script-via-r) + - [Workflow Input/Output Data](#workflow-input-output-data) + - [3. Run the Annotations Database Creation Function as a Stand-Alone Script](#3-run-the-annotations-database-creation-function-as-a-stand-alone-script) + - [Using Singularity](#using-singularity) + - [Using a Local R Environment](#using-a-local-r-environment)
@@ -93,11 +95,12 @@ Once complete, a `singularity` folder containing the Singularity images will be ```bash export SINGULARITY_CACHEDIR=$(pwd)/singularity ``` +
#### Step 3: Run the Workflow -While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder, you can now run the workflow. Below is an example for generating the annotation table for *Mus musculus* (mouse): +While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder that was downloaded in [step 1](#1-download-the-workflow-files), you can now run the workflow. Below is an example for generating the annotation table for *Mus musculus* (mouse): ```bash @@ -105,50 +108,6 @@ singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ $SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \ Rscript /work/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' ``` -
- -**Input data:** - -- No input files are required. Specify the species name of the target organism using a positional command line argument. `Mus musculus` is used in the example above. - > **Notes**: - > - To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. - > - The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) - -- *Optional*: A local reference table CSV file can be supplied as a second positional argument. If not provided, the script will download the current version of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) table by default. - - -**Output data:** - -- *-GL-annotations.tsv (Tab delineated table of gene annotations) - -- *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation) - -
- -#### *Optional*: Run the Annotations Database Creation Function as a Stand-Alone Script via Singularity - -If the reference table does not specify an annotations database for the target organism in the 'annotations' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) file, the `install_annotations` function (defined in `install-org-db.R`) will be executed by default. This function can also be run as a stand-alone script: - - -```bash -singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ -$SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \ -Rscript /work/install-org-db.R 'Bacillus subtilis' -``` - -
- -**Input data:** - -- The species name of the target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in the example above. - > **Note**: The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) - -- *Optional*: A local reference table CSV file can be supplied as a second positional argument. If not provided, the script will download the current version of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) table by default. - - -**Output data:** - -- org.*.eg.db/ (Species-specific annotation database, as a local R package)
@@ -198,7 +157,7 @@ BiocManager::install("GO.db") #### Step 2: Run the Workflow -While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder, you can now run the workflow. Below is an example of how to run the workflow to build an annotation table for *Mus musculus* (mouse): +While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder that was downloaded in [step 1](#1-download-the-workflow-files), you can now run the workflow. Below is an example of how to run the workflow to build an annotation table for *Mus musculus* (mouse): ```bash @@ -207,9 +166,17 @@ Rscript GL_RefAnnotTable-A_1.1.0/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus m
+ --- + + ### Workflow Input/Output Data + +The input and output data are the same for both [Approach 1: Using Singularity](#approach-1-using-singularity) and [Approach 2: Using a Local R Environment](#approach-2-using-a-local-r-environment). + +
+ **Input data:** -- No input files are required. Specify the species name of the target organism using a positional command line argument. `Mus musculus` is used in the example above. +- No input files are required. Specify the species name of the target organism using a positional command line argument. `Mus musculus` is used in both the Singularity and the local R environment examples above. > **Notes**: > - To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. > - The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) @@ -223,12 +190,27 @@ Rscript GL_RefAnnotTable-A_1.1.0/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus m - *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation) -
+
-#### *Optional*: Run the Annotations Database Creation Function as a Stand-Alone Script via R +--- + +### 3. Run the Annotations Database Creation Function as a Stand-Alone Script + +If the reference table does not specify an annotations database for the target organism in the 'annotations' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) file, the `install_annotations` function (defined in `install-org-db.R`) will be executed by default. This function can also be run as a stand-alone script: + +
+ +#### Using Singularity -If the reference table does not specify an annotations database for the target organism in the 'annotations' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) file, the `install_annotations` function (defined in `install-org-db.R`) will be executed by default. This function can also be run as a stand-alone script: +```bash +singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ +$SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \ +Rscript /work/install-org-db.R 'Bacillus subtilis' +``` +
+ +#### Using a Local R Environment ```bash Rscript GL_RefAnnotTable-A_1.1.0/install-org-db.R 'Bacillus subtilis' @@ -238,7 +220,7 @@ Rscript GL_RefAnnotTable-A_1.1.0/install-org-db.R 'Bacillus subtilis' **Input data:** -- The species name of the target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in the example above. +- The species name of the target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in both the Singularity and local R examples above. > **Note**: The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) - *Optional*: A local reference table CSV file can be supplied as a second positional argument. If not provided, the script will download the current version of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) table by default. @@ -248,4 +230,6 @@ Rscript GL_RefAnnotTable-A_1.1.0/install-org-db.R 'Bacillus subtilis' - org.*.eg.db/ (Species-specific annotation database, as a local R package) +
+ --- From f015225fcee1d2743d9bc6e78892dc90a738df70 Mon Sep 17 00:00:00 2001 From: Alexis <71944751+torres-alexis@users.noreply.github.com> Date: Wed, 23 Oct 2024 13:42:34 -0700 Subject: [PATCH 43/58] remove lib path --- .../GL_RefAnnotTable-A/workflow_code/install-org-db.R | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R index f7e6f459..c1ad5613 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R @@ -1,9 +1,5 @@ # install-org-db.R -# Set R library path to current working directory -lib_path <- file.path(getwd()) -.libPaths(lib_path) - # Load required libraries library(tidyverse) library(AnnotationForge) @@ -116,4 +112,4 @@ if (!interactive()) { refTablePath <- if (length(args) > 1) args[2] else NULL install_annotations(target_organism, refTablePath) -} \ No newline at end of file +} From fce6a73a7389c12d1115c1b117533d478c3e6d31 Mon Sep 17 00:00:00 2001 From: Alexis <71944751+torres-alexis@users.noreply.github.com> Date: Wed, 23 Oct 2024 13:42:49 -0700 Subject: [PATCH 44/58] Update GL-DPPD-7110-A_build-genome-annots-tab.R remove lib path --- .../workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R | 3 --- 1 file changed, 3 deletions(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R index f6b043a7..f390d4b0 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R @@ -3,9 +3,6 @@ # GeneLab script for generating organism-specific gene annotation tables # Example usage: Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' -# Set R library path to current working directory -lib_path <- file.path(getwd()) -.libPaths(lib_path) # Define variables associated with current pipeline and annotation table versions GL_DPPD_ID <- "GL-DPPD-7110-A" From 3cc61cffd8096840a89cb6cd44af6fca8bd24942 Mon Sep 17 00:00:00 2001 From: Alexis <71944751+torres-alexis@users.noreply.github.com> Date: Wed, 23 Oct 2024 14:27:30 -0700 Subject: [PATCH 45/58] Add possible paths to install-org-db execution function --- .../GL-DPPD-7110-A_build-genome-annots-tab.R | 39 ++++++++++++------- 1 file changed, 26 insertions(+), 13 deletions(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R index f390d4b0..38e6f08c 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R @@ -145,19 +145,41 @@ GTF <- data.frame(GTF) # Define a function to load the specified org.db package for a given target organism install_and_load_org_db <- function(target_organism, target_org_db, ref_tab_path) { + # Folder names for the script location: Parent directories or . for executing from parent dir or cd. + ## No functionality to pull in the path of an executing R script is available + possible_folders <- c("workflow_code", "GL_RefAnnotTable-A_1.1.0", ".") + + # Get the current working directory and attempt to locate the correct folder + script_dir <- getwd() + + install_script_path <- NULL + + for (folder in possible_folders) { + potential_path <- file.path(script_dir, folder, "install-org-db.R") + if (file.exists(potential_path)) { + install_script_path <- potential_path + break + } + } + + # If the install script path was not found, stop with an error + if (is.null(install_script_path)) { + stop("Cannot find 'install-org-db.R' in the expected folders: 'workflow_code' or 'GL_RefAnnotTable-A_1.1.0'") + } + + # If target_org_db is provided, try to install it from Bioconductor if (!is.na(target_org_db) && target_org_db != "") { - # Attempt to install the package from Bioconductor BiocManager::install(target_org_db, ask = FALSE) # Check if the package was successfully loaded if (!requireNamespace(target_org_db, quietly = TRUE)) { - # If not, attempt to create it locally using a helper script - source("install-org-db.R") + # Source the install script to create the database locally + source(install_script_path) target_org_db <- install_annotations(target_organism, ref_tab_path) } } else { # If target_org_db is NA or empty, create it locally using the helper script - source("install-org-db.R") + source(install_script_path) target_org_db <- install_annotations(target_organism, ref_tab_path) } @@ -165,15 +187,6 @@ install_and_load_org_db <- function(target_organism, target_org_db, ref_tab_path library(target_org_db, character.only = TRUE) } -# Define list of supported organisms which do not use annotations from an org.db -no_org_db <- c("Lactobacillus acidophilus", "Mycobacterium marinum", "Oryza sativa", "Pseudomonas aeruginosa", - "Serratia liquefaciens", "Staphylococcus aureus", "Streptococcus mutans", "Vibrio fischeri") - -# Run the function unless the target_organism is in no_org_db -if (!(target_organism %in% no_org_db) && (target_organism %in% currently_accepted_orgs)) { - install_and_load_org_db(target_organism, target_org_db, ref_tab_path) -} - ############################################ ######## Build annotation table ############ From 8619121e019a0e12d8894c0847f75151aeaf3711 Mon Sep 17 00:00:00 2001 From: Alexis <71944751+torres-alexis@users.noreply.github.com> Date: Wed, 23 Oct 2024 14:30:43 -0700 Subject: [PATCH 46/58] Update GL-DPPD-7110-A_build-genome-annots-tab.R Store workflow version workflow_code folder name as a variable --- .../workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R index 38e6f08c..5c62dab1 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R @@ -6,6 +6,7 @@ # Define variables associated with current pipeline and annotation table versions GL_DPPD_ID <- "GL-DPPD-7110-A" +workflow_version <- "GL_RefAnnotTable-A_1.1.0" ref_tab_path <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv" readme_path <- "https://github.com/nasa/GeneLab_Data_Processing/tree/master/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md" @@ -147,7 +148,7 @@ GTF <- data.frame(GTF) install_and_load_org_db <- function(target_organism, target_org_db, ref_tab_path) { # Folder names for the script location: Parent directories or . for executing from parent dir or cd. ## No functionality to pull in the path of an executing R script is available - possible_folders <- c("workflow_code", "GL_RefAnnotTable-A_1.1.0", ".") + possible_folders <- c("workflow_code", workflow_version, ".") # Get the current working directory and attempt to locate the correct folder script_dir <- getwd() From 8bbf66d43a274adf33132e3497289d91a3d140e6 Mon Sep 17 00:00:00 2001 From: Alexis <71944751+torres-alexis@users.noreply.github.com> Date: Wed, 23 Oct 2024 14:37:08 -0700 Subject: [PATCH 47/58] Update GL-DPPD-7110-A.md Add workflow version variable (for finding install-org-db path) to pipeline documentation --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index bf6ffa64..561fe596 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -188,6 +188,8 @@ lib_path <- file.path(getwd()) # Define variables associated with current pipeline and annotation table versions GL_DPPD_ID <- "GL-DPPD-7110-A" +workflow_version <- "GL_RefAnnotTable-A_1.1.0" + ref_tab_path <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv" readme_path <- "https://github.com/nasa/GeneLab_Data_Processing/tree/master/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md" @@ -213,6 +215,7 @@ library(rtracklayer) **Output Data:** - `GL_DPPD_ID` (variable specifying the GeneLab Data Processing Pipeline Document ID) +- `workflow_version (variable specifying the current version of the workflow) - `ref_tab_path` (variable specifying the path to the reference table CSV file) - `readme_path` (variable specifying the path to the README file) - `currently_accepted_orgs` (variable specifying the list of currently supported organisms) From 4bec1931ace9529019f86859f843e3c9a3150a4f Mon Sep 17 00:00:00 2001 From: Alexis <71944751+torres-alexis@users.noreply.github.com> Date: Wed, 23 Oct 2024 20:13:52 -0700 Subject: [PATCH 48/58] Update GL-DPPD-7110-A_build-genome-annots-tab.R readd line --- .../GL-DPPD-7110-A_build-genome-annots-tab.R | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R index 5c62dab1..afbccce8 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R @@ -188,6 +188,14 @@ install_and_load_org_db <- function(target_organism, target_org_db, ref_tab_path library(target_org_db, character.only = TRUE) } +# Define list of supported organisms which do not use annotations from an org.db +no_org_db <- c("Lactobacillus acidophilus", "Mycobacterium marinum", "Oryza sativa", "Pseudomonas aeruginosa", + "Serratia liquefaciens", "Staphylococcus aureus", "Streptococcus mutans", "Vibrio fischeri") + +# Run the function unless the target_organism is in no_org_db +if (!(target_organism %in% no_org_db) && (target_organism %in% currently_accepted_orgs)) { + install_and_load_org_db(target_organism, target_org_db, ref_tab_path) +} ############################################ ######## Build annotation table ############ From 6da57fa4f3fe8121cb6f4ba45e4b5f0a5ae2f084 Mon Sep 17 00:00:00 2001 From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com> Date: Thu, 24 Oct 2024 11:36:04 -0700 Subject: [PATCH 49/58] Updating signature matrix --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index bf6ffa64..670b49e2 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -4,7 +4,7 @@ --- -**Date:** August 12, 2024 +**Date:** October XX, 2024 **Revision:** -A **Document Number:** GL-DPPD-7110-A @@ -12,11 +12,10 @@ Alexis Torres and Crystal Han (GeneLab Data Processing Team) **Approved by:** -Sylvain Costes (OSDR Project Manager) -Samrawit Gebre (GeneLab Deputy Project Manager and Acting GeneLab Configuration Manager) -Lauren Sanders (OSDR Project Scientist) -Amanda Saravia-Butler (GeneLab Science Lead) -Barbara Novak (GeneLab Data Processing Lead) +Samrawit Gebre (OSDR Project Manager) +Lauren Sanders (OSDR Project Scientist) +Amanda Saravia-Butler (GeneLab Science Lead) +Barbara Novak (GeneLab Data Processing Lead) --- From e3dfb4b10e0dbdd094170a73969ec4eec87302f8 Mon Sep 17 00:00:00 2001 From: asaravia-butler <70983120+asaravia-butler@users.noreply.github.com> Date: Thu, 24 Oct 2024 11:36:24 -0700 Subject: [PATCH 50/58] Formatting updates --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index 670b49e2..48829be4 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -12,9 +12,9 @@ Alexis Torres and Crystal Han (GeneLab Data Processing Team) **Approved by:** -Samrawit Gebre (OSDR Project Manager) -Lauren Sanders (OSDR Project Scientist) -Amanda Saravia-Butler (GeneLab Science Lead) +Samrawit Gebre (OSDR Project Manager) +Lauren Sanders (OSDR Project Scientist) +Amanda Saravia-Butler (GeneLab Science Lead) Barbara Novak (GeneLab Data Processing Lead) --- From d1ea649bde96b2e317adf3bc4ddbdc88a480227b Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Mon, 28 Oct 2024 23:44:04 -0700 Subject: [PATCH 51/58] remove custom org dbs from annotation table --- .../GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv index 5ce006d9..12a2c8b8 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv @@ -1,23 +1,23 @@ name,species,strain,ensemblVersion,ref_source,fasta,gtf,taxon,annotations,genelab_annots_link,genelab_annots_info_link ARABIDOPSIS,Arabidopsis thaliana,,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.59.gtf.gz,3702,org.At.tair.db,https://figshare.com/ndownloader/files/48354355,https://figshare.com/ndownloader/files/48354352 -BACSU,Bacillus subtilis,subsp. subtilis 168,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/dna/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.59.gtf.gz,224308,org.Bsubtilissubspsubtilis168.eg.db,https://figshare.com/ndownloader/files/48354346,https://figshare.com/ndownloader/files/48354349 -BRADI,Brachypodium distachyon,,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/brachypodium_distachyon/dna/Brachypodium_distachyon.Brachypodium_distachyon_v3.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/brachypodium_distachyon/Brachypodium_distachyon.Brachypodium_distachyon_v3.0.59.gtf.gz,15368,org.Bdistachyon.eg.db,https://figshare.com/ndownloader/files/48354370,https://figshare.com/ndownloader/files/48354361 +BACSU,Bacillus subtilis,subsp. subtilis 168,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/dna/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.59.gtf.gz,224308,,https://figshare.com/ndownloader/files/48354346,https://figshare.com/ndownloader/files/48354349 +BRADI,Brachypodium distachyon,,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/brachypodium_distachyon/dna/Brachypodium_distachyon.Brachypodium_distachyon_v3.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/brachypodium_distachyon/Brachypodium_distachyon.Brachypodium_distachyon_v3.0.59.gtf.gz,15368,,https://figshare.com/ndownloader/files/48354370,https://figshare.com/ndownloader/files/48354361 BRARP,Brassica rapa,,59,ensembl_plants,http://ftp.ensemblgenomes.org/pub/plants/release-59/fasta/brassica_rapa/dna/Brassica_rapa.Brapa_1.0.dna.toplevel.fa.gz,http://ftp.ensemblgenomes.org/pub/plants/release-59/gtf/brassica_rapa/Brassica_rapa.Brapa_1.0.59.gtf.gz,,,, WORM,Caenorhabditis elegans,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.112.gtf.gz,6239,org.Ce.eg.db,https://figshare.com/ndownloader/files/48354373,https://figshare.com/ndownloader/files/48354364 ZEBRAFISH,Danio rerio,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/danio_rerio/Danio_rerio.GRCz11.112.gtf.gz,7955,org.Dr.eg.db,https://figshare.com/ndownloader/files/48354388,https://figshare.com/ndownloader/files/48354367 FLY,Drosophila melanogaster,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.112.gtf.gz,7227,org.Dm.eg.db,https://figshare.com/ndownloader/files/48354382,https://figshare.com/ndownloader/files/48354376 ERCC,,,,ThermoFisher,https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip,https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip,,,, -ECOLI,Escherichia coli,str. K-12 substr. MG1655,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/dna/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.59.gtf.gz,511145,org.EcolistrK12substrMG1655.eg.db,https://figshare.com/ndownloader/files/48354379,https://figshare.com/ndownloader/files/48354394 +ECOLI,Escherichia coli,str. K-12 substr. MG1655,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/dna/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.59.gtf.gz,511145,,https://figshare.com/ndownloader/files/48354379,https://figshare.com/ndownloader/files/48354394 HUMAN,Homo sapiens,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz,9606,org.Hs.eg.db,https://figshare.com/ndownloader/files/48354445,https://figshare.com/ndownloader/files/48354448 ,Lactobacillus acidophilus,NCFM,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.gtf.gz,272621,,https://figshare.com/ndownloader/files/49061254,https://figshare.com/ndownloader/files/49061257 MOUSE,Mus musculus,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/mus_musculus/Mus_musculus.GRCm39.112.gtf.gz,10090,org.Mm.eg.db,https://figshare.com/ndownloader/files/48354460,https://figshare.com/ndownloader/files/48354457 ,Mycobacterium marinum,M,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,216594,,https://figshare.com/ndownloader/files/49061260,https://figshare.com/ndownloader/files/49061263 ORYSJ,Oryza sativa,Japonica,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/oryza_sativa/Oryza_sativa.IRGSP-1.0.59.gtf.gz,39947,,https://figshare.com/ndownloader/files/48354451,https://figshare.com/ndownloader/files/48354454 -ORYLA,Oryzias latipes,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/oryzias_latipes/dna/Oryzias_latipes.ASM223467v1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/oryzias_latipes/Oryzias_latipes.ASM223467v1.112.gtf.gz,8090,org.Olatipes.eg.db,https://figshare.com/ndownloader/files/48354463,https://figshare.com/ndownloader/files/48354466 +ORYLA,Oryzias latipes,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/oryzias_latipes/dna/Oryzias_latipes.ASM223467v1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/oryzias_latipes/Oryzias_latipes.ASM223467v1.112.gtf.gz,8090,,https://figshare.com/ndownloader/files/48354463,https://figshare.com/ndownloader/files/48354466 ,Pseudomonas aeruginosa,UCBPP-PA14,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.gtf.gz,208963,,https://figshare.com/ndownloader/files/49061266,https://figshare.com/ndownloader/files/49061269 RAT,Rattus norvegicus,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/rattus_norvegicus/dna/Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/rattus_norvegicus/Rattus_norvegicus.mRatBN7.2.112.gtf.gz,10116,org.Rn.eg.db,https://figshare.com/ndownloader/files/48354472,https://figshare.com/ndownloader/files/48354475 YEAST,Saccharomyces cerevisiae,S288C,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.112.gtf.gz,559292,org.Sc.sgd.db,https://figshare.com/ndownloader/files/48354469,https://figshare.com/ndownloader/files/48354478 -SALTY,Salmonella enterica,serovar Typhimurium str. LT2,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.gtf.gz,99287,org.SentericaserovarTyphimuriumstrLT2.eg.db,https://figshare.com/ndownloader/files/49061272,https://figshare.com/ndownloader/files/49061275 +SALTY,Salmonella enterica,serovar Typhimurium str. LT2,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.gtf.gz,99287,,https://figshare.com/ndownloader/files/49061272,https://figshare.com/ndownloader/files/49061275 ,Serratia liquefaciens,ATCC 27592,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.gtf.gz,1346614,,https://figshare.com/ndownloader/files/49061278,https://figshare.com/ndownloader/files/49061281 ,Staphylococcus aureus,MRSA252,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.gtf.gz,282458,,https://figshare.com/ndownloader/files/49061284,https://figshare.com/ndownloader/files/49061287 ,Streptococcus mutans,UA159,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.gtf.gz,210007,,https://figshare.com/ndownloader/files/49061290,https://figshare.com/ndownloader/files/49061293 From f06cf149b9048c2eb6935c8398802ef3ef49e52f Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Wed, 30 Oct 2024 13:38:21 -0700 Subject: [PATCH 52/58] move timeout to top of scripts, add to readme --- .../GL_RefAnnotTable-A/README.md | 4 ++++ .../GL-DPPD-7110-A_build-genome-annots-tab.R | 15 ++++++--------- .../workflow_code/install-org-db.R | 2 +- 3 files changed, 11 insertions(+), 10 deletions(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index e3990af5..2b84e200 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -54,6 +54,8 @@ The GL_RefAnnotTable-A workflow can be run using one of two approaches: Please follow the instructions for the approach that best matches your setup and preferences. Each method is explained in detail below. +> **Note**: If you encounter timeout errors, you can increase the default timeout (3600 seconds) by modifying the `options(timeout=3600)` line at the top of the `GL-DPPD-7110-A_build-genome-annots-tab.R` script. +
--- @@ -198,6 +200,8 @@ The input and output data are the same for both [Approach 1: Using Singularity]( If the reference table does not specify an annotations database for the target organism in the 'annotations' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) file, the `install_annotations` function (defined in `install-org-db.R`) will be executed by default. This function can also be run as a stand-alone script: +> **Note**: If you encounter timeout errors, you can increase the default timeout (3600 seconds) by modifying the `options(timeout=3600)` line at the top of the `install-org-db.R` script. +
#### Using Singularity diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R index afbccce8..1254920a 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R @@ -2,7 +2,7 @@ # Written by Mike Lee # GeneLab script for generating organism-specific gene annotation tables # Example usage: Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' - +options(timeout = 3600) # Define variables associated with current pipeline and annotation table versions GL_DPPD_ID <- "GL-DPPD-7110-A" @@ -80,9 +80,6 @@ library(rtracklayer) ############## Define variables and output file names ################### ######################################################################### -# Set timeout time to ensure annotation file downloads will complete -options(timeout = 600) - ref_table <- tryCatch( read.csv(ref_tab_path), error = function(e) { @@ -133,9 +130,6 @@ if ( file.exists(out_table_filename) ) { ######## Load annotation databases ######### ############################################# -# Set timeout time to ensure annotation file downloads will complete -options(timeout = 600) - ####### GTF ########## # Create the GTF dataframe from its path, unique gene identities in the reference assembly are under 'gene_id' @@ -186,15 +180,18 @@ install_and_load_org_db <- function(target_organism, target_org_db, ref_tab_path # Load the package into the R session library(target_org_db, character.only = TRUE) + + # Return the target_org_db name + return(target_org_db) } # Define list of supported organisms which do not use annotations from an org.db no_org_db <- c("Lactobacillus acidophilus", "Mycobacterium marinum", "Oryza sativa", "Pseudomonas aeruginosa", "Serratia liquefaciens", "Staphylococcus aureus", "Streptococcus mutans", "Vibrio fischeri") -# Run the function unless the target_organism is in no_org_db +# Run the function unless the target_organism is in no_org_db and update target_org_db with the result if (!(target_organism %in% no_org_db) && (target_organism %in% currently_accepted_orgs)) { - install_and_load_org_db(target_organism, target_org_db, ref_tab_path) + target_org_db <- install_and_load_org_db(target_organism, target_org_db, ref_tab_path) } ############################################ diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R index c1ad5613..fb8fe1a2 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R @@ -1,5 +1,5 @@ # install-org-db.R - +options(timeout=3600) # Load required libraries library(tidyverse) library(AnnotationForge) From 5088539f028bfe75a466474b488ef86e56e83e2c Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Wed, 30 Oct 2024 13:41:59 -0700 Subject: [PATCH 53/58] add no-home + bind local path to same container path --- .../Workflow_Documentation/GL_RefAnnotTable-A/README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index 2b84e200..fdbe9f11 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -106,9 +106,9 @@ While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder that was ```bash -singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ +singularity exec --no-home -B $(pwd)/GL_RefAnnotTable-A_1.1.0:$(pwd)/GL_RefAnnotTable-A_1.1.0 \ $SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \ -Rscript /work/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' +Rscript GL_RefAnnotTable-A_1.1.0/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' ```
@@ -207,9 +207,9 @@ If the reference table does not specify an annotations database for the target o #### Using Singularity ```bash -singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \ +singularity exec --no-home -B $(pwd)/GL_RefAnnotTable-A_1.1.0:$(pwd)/GL_RefAnnotTable-A_1.1.0 \ $SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \ -Rscript /work/install-org-db.R 'Bacillus subtilis' +Rscript GL_RefAnnotTable-A_1.1.0/install-org-db.R 'Bacillus subtilis' ```
From 63683814c083e4ac7fab50e5735f6165a6b2ddbf Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Wed, 30 Oct 2024 19:08:22 -0700 Subject: [PATCH 54/58] add cols bioconductor_annotations, custom_annotations, change dppd var workflow_version --- .../GL-DPPD-7110-A/GL-DPPD-7110-A.md | 8 ++-- .../GL-DPPD-7110-A_annotations.csv | 48 +++++++++---------- .../GL-DPPD-7110-A_build-genome-annots-tab.R | 2 +- .../workflow_code/install-org-db.R | 2 +- 4 files changed, 30 insertions(+), 30 deletions(-) diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md index 561fe596..0fba4029 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md @@ -188,7 +188,7 @@ lib_path <- file.path(getwd()) # Define variables associated with current pipeline and annotation table versions GL_DPPD_ID <- "GL-DPPD-7110-A" -workflow_version <- "GL_RefAnnotTable-A_1.1.0" +workflow_version <- "" ref_tab_path <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv" readme_path <- "https://github.com/nasa/GeneLab_Data_Processing/tree/master/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md" @@ -215,7 +215,7 @@ library(rtracklayer) **Output Data:** - `GL_DPPD_ID` (variable specifying the GeneLab Data Processing Pipeline Document ID) -- `workflow_version (variable specifying the current version of the workflow) +- `workflow_version` (variable specifying the [current version of the workflow](https://github.com/nasa/GeneLab_Data_Processing/tree/DEV_GeneLab_Reference_Annotations_vGL-DPPD-7110-A/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A)) - `ref_tab_path` (variable specifying the path to the reference table CSV file) - `readme_path` (variable specifying the path to the README file) - `currently_accepted_orgs` (variable specifying the list of currently supported organisms) @@ -244,7 +244,7 @@ target_info <- ref_table %>% # Extract the relevant columns from the reference table target_taxid <- target_info$taxon # Taxonomic identifier -target_org_db <- target_info$annotations # org.eg.db R package +target_org_db <- target_info$bioconductor_annotations # org.eg.db R package gtf_link <- target_info$gtf # Path to reference assembly GTF target_short_name <- target_info$name # PANTHER / UNIPROT short name; blank if not available ref_source <- target_info$ref_source # Reference files source @@ -284,7 +284,7 @@ if ( file.exists(out_table_filename) ) { **Output Data:** - `target_taxid` (variable specifying the taxonomic identifier for the target organism) -- `target_org_db` (variable specifying the name of the org.db R package for the target organism) +- `target_org_db` (variable specifying the name of the org.eg.db R package for the target organism if it is hosted by Bioconductor) - `gtf_link` (variable specifying the URL to the GTF file for the target organism) - `target_short_name` (variable specifying the PANTHER/UNIPROT short name for the target organism) - `ref_source` (variable specifying the source of the reference files, e.g., "ensembl", "ensembl_plants", "ensembl_bacteria", "ncbi") diff --git a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv index 12a2c8b8..c2a881e2 100644 --- a/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv +++ b/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv @@ -1,24 +1,24 @@ -name,species,strain,ensemblVersion,ref_source,fasta,gtf,taxon,annotations,genelab_annots_link,genelab_annots_info_link -ARABIDOPSIS,Arabidopsis thaliana,,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.59.gtf.gz,3702,org.At.tair.db,https://figshare.com/ndownloader/files/48354355,https://figshare.com/ndownloader/files/48354352 -BACSU,Bacillus subtilis,subsp. subtilis 168,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/dna/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.59.gtf.gz,224308,,https://figshare.com/ndownloader/files/48354346,https://figshare.com/ndownloader/files/48354349 -BRADI,Brachypodium distachyon,,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/brachypodium_distachyon/dna/Brachypodium_distachyon.Brachypodium_distachyon_v3.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/brachypodium_distachyon/Brachypodium_distachyon.Brachypodium_distachyon_v3.0.59.gtf.gz,15368,,https://figshare.com/ndownloader/files/48354370,https://figshare.com/ndownloader/files/48354361 -BRARP,Brassica rapa,,59,ensembl_plants,http://ftp.ensemblgenomes.org/pub/plants/release-59/fasta/brassica_rapa/dna/Brassica_rapa.Brapa_1.0.dna.toplevel.fa.gz,http://ftp.ensemblgenomes.org/pub/plants/release-59/gtf/brassica_rapa/Brassica_rapa.Brapa_1.0.59.gtf.gz,,,, -WORM,Caenorhabditis elegans,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.112.gtf.gz,6239,org.Ce.eg.db,https://figshare.com/ndownloader/files/48354373,https://figshare.com/ndownloader/files/48354364 -ZEBRAFISH,Danio rerio,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/danio_rerio/Danio_rerio.GRCz11.112.gtf.gz,7955,org.Dr.eg.db,https://figshare.com/ndownloader/files/48354388,https://figshare.com/ndownloader/files/48354367 -FLY,Drosophila melanogaster,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.112.gtf.gz,7227,org.Dm.eg.db,https://figshare.com/ndownloader/files/48354382,https://figshare.com/ndownloader/files/48354376 -ERCC,,,,ThermoFisher,https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip,https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip,,,, -ECOLI,Escherichia coli,str. K-12 substr. MG1655,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/dna/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.59.gtf.gz,511145,,https://figshare.com/ndownloader/files/48354379,https://figshare.com/ndownloader/files/48354394 -HUMAN,Homo sapiens,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz,9606,org.Hs.eg.db,https://figshare.com/ndownloader/files/48354445,https://figshare.com/ndownloader/files/48354448 -,Lactobacillus acidophilus,NCFM,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.gtf.gz,272621,,https://figshare.com/ndownloader/files/49061254,https://figshare.com/ndownloader/files/49061257 -MOUSE,Mus musculus,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/mus_musculus/Mus_musculus.GRCm39.112.gtf.gz,10090,org.Mm.eg.db,https://figshare.com/ndownloader/files/48354460,https://figshare.com/ndownloader/files/48354457 -,Mycobacterium marinum,M,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,216594,,https://figshare.com/ndownloader/files/49061260,https://figshare.com/ndownloader/files/49061263 -ORYSJ,Oryza sativa,Japonica,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/oryza_sativa/Oryza_sativa.IRGSP-1.0.59.gtf.gz,39947,,https://figshare.com/ndownloader/files/48354451,https://figshare.com/ndownloader/files/48354454 -ORYLA,Oryzias latipes,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/oryzias_latipes/dna/Oryzias_latipes.ASM223467v1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/oryzias_latipes/Oryzias_latipes.ASM223467v1.112.gtf.gz,8090,,https://figshare.com/ndownloader/files/48354463,https://figshare.com/ndownloader/files/48354466 -,Pseudomonas aeruginosa,UCBPP-PA14,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.gtf.gz,208963,,https://figshare.com/ndownloader/files/49061266,https://figshare.com/ndownloader/files/49061269 -RAT,Rattus norvegicus,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/rattus_norvegicus/dna/Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/rattus_norvegicus/Rattus_norvegicus.mRatBN7.2.112.gtf.gz,10116,org.Rn.eg.db,https://figshare.com/ndownloader/files/48354472,https://figshare.com/ndownloader/files/48354475 -YEAST,Saccharomyces cerevisiae,S288C,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.112.gtf.gz,559292,org.Sc.sgd.db,https://figshare.com/ndownloader/files/48354469,https://figshare.com/ndownloader/files/48354478 -SALTY,Salmonella enterica,serovar Typhimurium str. LT2,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.gtf.gz,99287,,https://figshare.com/ndownloader/files/49061272,https://figshare.com/ndownloader/files/49061275 -,Serratia liquefaciens,ATCC 27592,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.gtf.gz,1346614,,https://figshare.com/ndownloader/files/49061278,https://figshare.com/ndownloader/files/49061281 -,Staphylococcus aureus,MRSA252,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.gtf.gz,282458,,https://figshare.com/ndownloader/files/49061284,https://figshare.com/ndownloader/files/49061287 -,Streptococcus mutans,UA159,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.gtf.gz,210007,,https://figshare.com/ndownloader/files/49061290,https://figshare.com/ndownloader/files/49061293 -,Vibrio fischeri,ES114,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.gtf.gz,312309,,https://figshare.com/ndownloader/files/49061296,https://figshare.com/ndownloader/files/49061299 \ No newline at end of file +name,species,strain,ensemblVersion,ref_source,fasta,gtf,taxon,bioconductor_annotations,custom_annotations,genelab_annots_link,genelab_annots_info_link +ARABIDOPSIS,Arabidopsis thaliana,,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.59.gtf.gz,3702,org.At.tair.db,,https://figshare.com/ndownloader/files/48354355,https://figshare.com/ndownloader/files/48354352 +BACSU,Bacillus subtilis,subsp. subtilis 168,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/dna/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/bacillus_subtilis_subsp_subtilis_str_168_gca_000009045/Bacillus_subtilis_subsp_subtilis_str_168_gca_000009045.ASM904v1.59.gtf.gz,224308,,org.Bsubtilissubspsubtilis168.eg.db,https://figshare.com/ndownloader/files/48354346,https://figshare.com/ndownloader/files/48354349 +BRADI,Brachypodium distachyon,,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/brachypodium_distachyon/dna/Brachypodium_distachyon.Brachypodium_distachyon_v3.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/brachypodium_distachyon/Brachypodium_distachyon.Brachypodium_distachyon_v3.0.59.gtf.gz,15368,,org.Bdistachyon.eg.db,https://figshare.com/ndownloader/files/48354370,https://figshare.com/ndownloader/files/48354361 +BRARP,Brassica rapa,,59,ensembl_plants,http://ftp.ensemblgenomes.org/pub/plants/release-59/fasta/brassica_rapa/dna/Brassica_rapa.Brapa_1.0.dna.toplevel.fa.gz,http://ftp.ensemblgenomes.org/pub/plants/release-59/gtf/brassica_rapa/Brassica_rapa.Brapa_1.0.59.gtf.gz,,,,, +WORM,Caenorhabditis elegans,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.112.gtf.gz,6239,org.Ce.eg.db,,https://figshare.com/ndownloader/files/48354373,https://figshare.com/ndownloader/files/48354364 +ZEBRAFISH,Danio rerio,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.primary_assembly.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/danio_rerio/Danio_rerio.GRCz11.112.gtf.gz,7955,org.Dr.eg.db,,https://figshare.com/ndownloader/files/48354388,https://figshare.com/ndownloader/files/48354367 +FLY,Drosophila melanogaster,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.112.gtf.gz,7227,org.Dm.eg.db,,https://figshare.com/ndownloader/files/48354382,https://figshare.com/ndownloader/files/48354376 +ERCC,,,,ThermoFisher,https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip,https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip,,,,, +ECOLI,Escherichia coli,str. K-12 substr. MG1655,59,ensembl_bacteria,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/dna/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/bacteria/release-59/gtf/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655_gca_000005845/Escherichia_coli_str_k_12_substr_mg1655_gca_000005845.ASM584v2.59.gtf.gz,511145,,org.EcolistrK12substrMG1655.eg.db,https://figshare.com/ndownloader/files/48354379,https://figshare.com/ndownloader/files/48354394 +HUMAN,Homo sapiens,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz,9606,org.Hs.eg.db,,https://figshare.com/ndownloader/files/48354445,https://figshare.com/ndownloader/files/48354448 +,Lactobacillus acidophilus,NCFM,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/985/GCF_000011985.1_ASM1198v1/GCF_000011985.1_ASM1198v1_genomic.gtf.gz,272621,,,https://figshare.com/ndownloader/files/49061254,https://figshare.com/ndownloader/files/49061257 +MOUSE,Mus musculus,,112,ensembl,https://ftp.ensembl.org/pub/release-112/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz,https://ftp.ensembl.org/pub/release-112/gtf/mus_musculus/Mus_musculus.GRCm39.112.gtf.gz,10090,org.Mm.eg.db,,https://figshare.com/ndownloader/files/48354460,https://figshare.com/ndownloader/files/48354457 +,Mycobacterium marinum,M,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/018/345/GCF_000018345.1_ASM1834v1/GCF_000018345.1_ASM1834v1_genomic.gtf.gz,216594,,,https://figshare.com/ndownloader/files/49061260,https://figshare.com/ndownloader/files/49061263 +ORYSJ,Oryza sativa,Japonica,59,ensembl_plants,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/fasta/oryza_sativa/dna/Oryza_sativa.IRGSP-1.0.dna.toplevel.fa.gz,https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-59/gtf/oryza_sativa/Oryza_sativa.IRGSP-1.0.59.gtf.gz,39947,,,https://figshare.com/ndownloader/files/48354451,https://figshare.com/ndownloader/files/48354454 +ORYLA,Oryzias latipes,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/oryzias_latipes/dna/Oryzias_latipes.ASM223467v1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/oryzias_latipes/Oryzias_latipes.ASM223467v1.112.gtf.gz,8090,,org.Olatipes.eg.db,https://figshare.com/ndownloader/files/48354463,https://figshare.com/ndownloader/files/48354466 +,Pseudomonas aeruginosa,UCBPP-PA14,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/014/625/GCF_000014625.1_ASM1462v1/GCF_000014625.1_ASM1462v1_genomic.gtf.gz,208963,,,https://figshare.com/ndownloader/files/49061266,https://figshare.com/ndownloader/files/49061269 +RAT,Rattus norvegicus,,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/rattus_norvegicus/dna/Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/rattus_norvegicus/Rattus_norvegicus.mRatBN7.2.112.gtf.gz,10116,org.Rn.eg.db,,https://figshare.com/ndownloader/files/48354472,https://figshare.com/ndownloader/files/48354475 +YEAST,Saccharomyces cerevisiae,S288C,112,ensembl,http://ftp.ensembl.org/pub/release-112/fasta/saccharomyces_cerevisiae/dna/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa.gz,http://ftp.ensembl.org/pub/release-112/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.112.gtf.gz,559292,org.Sc.sgd.db,,https://figshare.com/ndownloader/files/48354469,https://figshare.com/ndownloader/files/48354478 +SALTY,Salmonella enterica,serovar Typhimurium str. LT2,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/945/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.gtf.gz,99287,,org.SentericaserovarTyphimuriumstrLT2.eg.db,https://figshare.com/ndownloader/files/49061272,https://figshare.com/ndownloader/files/49061275 +,Serratia liquefaciens,ATCC 27592,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/422/085/GCF_000422085.1_ASM42208v1/GCF_000422085.1_ASM42208v1_genomic.gtf.gz,1346614,,,https://figshare.com/ndownloader/files/49061278,https://figshare.com/ndownloader/files/49061281 +,Staphylococcus aureus,MRSA252,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/505/GCF_000011505.1_ASM1150v1/GCF_000011505.1_ASM1150v1_genomic.gtf.gz,282458,,,https://figshare.com/ndownloader/files/49061284,https://figshare.com/ndownloader/files/49061287 +,Streptococcus mutans,UA159,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/465/GCF_000007465.2_ASM746v2/GCF_000007465.2_ASM746v2_genomic.gtf.gz,210007,,,https://figshare.com/ndownloader/files/49061290,https://figshare.com/ndownloader/files/49061293 +,Vibrio fischeri,ES114,,ncbi,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.fna.gz,https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/011/805/GCF_000011805.1_ASM1180v1/GCF_000011805.1_ASM1180v1_genomic.gtf.gz,312309,,,https://figshare.com/ndownloader/files/49061296,https://figshare.com/ndownloader/files/49061299 \ No newline at end of file diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R index 1254920a..058aea71 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R @@ -94,7 +94,7 @@ target_info <- ref_table %>% # Extract the relevant columns from the reference table target_taxid <- target_info$taxon # Taxonomic identifier -target_org_db <- target_info$annotations # org.eg.db R package +target_org_db <- target_info$bioconductor_annotations # org.eg.db R package gtf_link <- target_info$gtf # Path to reference assembly GTF target_short_name <- target_info$name # PANTHER / UNIPROT short name; blank if not available ref_source <- target_info$ref_source # Reference files source diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R index fb8fe1a2..72421811 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R @@ -52,7 +52,7 @@ install_annotations <- function(target_organism, refTablePath = NULL) { # Get package name or build it if not provided target_org_db <- ref_table %>% filter(species == target_organism) %>% - pull(annotations) + pull(bioconductor_annotations) if (is.na(target_org_db) || target_org_db == "") { cat("\nNo annotation database specified. Constructing package name...\n") From b39c63c147cb5ef30f7d255d2a4f65c8a4cdd8d4 Mon Sep 17 00:00:00 2001 From: torres-alexis Date: Wed, 30 Oct 2024 21:05:53 -0700 Subject: [PATCH 55/58] remove --no-home from readme --- .../Workflow_Documentation/GL_RefAnnotTable-A/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index fdbe9f11..92ff4dbd 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -106,7 +106,7 @@ While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder that was ```bash -singularity exec --no-home -B $(pwd)/GL_RefAnnotTable-A_1.1.0:$(pwd)/GL_RefAnnotTable-A_1.1.0 \ +singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:$(pwd)/GL_RefAnnotTable-A_1.1.0 \ $SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \ Rscript GL_RefAnnotTable-A_1.1.0/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' ``` @@ -207,7 +207,7 @@ If the reference table does not specify an annotations database for the target o #### Using Singularity ```bash -singularity exec --no-home -B $(pwd)/GL_RefAnnotTable-A_1.1.0:$(pwd)/GL_RefAnnotTable-A_1.1.0 \ +singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:$(pwd)/GL_RefAnnotTable-A_1.1.0 \ $SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \ Rscript GL_RefAnnotTable-A_1.1.0/install-org-db.R 'Bacillus subtilis' ``` From dcdf589d00ed4e417c88a2392684d69590cb34c0 Mon Sep 17 00:00:00 2001 From: Alexis Torres Date: Thu, 31 Oct 2024 14:32:42 -0700 Subject: [PATCH 56/58] Add r_libs to scrips, readme, standardize notes --- .../GL_RefAnnotTable-A/README.md | 40 ++++++++++++------- .../GL-DPPD-7110-A_build-genome-annots-tab.R | 2 +- .../workflow_code/install-org-db.R | 1 + 3 files changed, 27 insertions(+), 16 deletions(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index 92ff4dbd..aefe1a3b 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -70,13 +70,13 @@ This approach allows you to run the workflow within a containerized environment, Singularity is a containerization platform for running applications portably and reproducibly. We use container images hosted on Quay.io to encapsulate all the necessary software and dependencies required by the GL_RefAnnotTable-A workflow. This setup allows you to run the workflow without installing any software directly on your system. -> ***Note**: Other containerization tools like Docker or Apptainer can also be used to pull and run these images.* +> **Note**: Other containerization tools like Docker or Apptainer can also be used to pull and run these images. We recommend installing Singularity system-wide as per the official [Singularity installation documentation](https://docs.sylabs.io/guides/3.10/admin-guide/admin_quickstart.html). -> ***Note**: While Singularity is also available through [Anaconda](https://anaconda.org/conda-forge/singularity), we recommend installing Singularity system-wide following the official installation documentation.* +> **Note**: While Singularity is also available through [Anaconda](https://anaconda.org/conda-forge/singularity), we recommend installing Singularity system-wide following the official installation documentation.
@@ -84,17 +84,19 @@ We recommend installing Singularity system-wide as per the official [Singularity To pull the Singularity image needed for the workflow, you can use the provided script as directed below or pull the image directly. -> ***Note**: This command should be run in the location containing the `GL_RefAnnotTable-A_1.1.0` directory that was downloaded in [step 1](#1-download-the-workflow-files). Depending on your network speed, fetching the images will take approximately 20 minutes.* - +> **Note**: This command should be run in the location containing the `GL_RefAnnotTable-A_1.1.0` directory that was downloaded in [step 1](#1-download-the-workflow-files). Depending on your network speed, fetching the images will take approximately 20 minutes. ```bash bash GL_RefAnnotTable-A_1.1.0/bin/prepull_singularity.sh GL_RefAnnotTable-A_1.1.0/config/software/by_docker_image.config ``` -Once complete, a `singularity` folder containing the Singularity images will be created. Run the following command to export this folder as an environment variable: - +Once complete, a `singularity` folder containing the Singularity images will be created. Next, set up the required environment variables: ```bash +# Set R library path to current working directory +export R_LIBS_USER=$(pwd)/R_libs + +# Set Singularity cache directory export SINGULARITY_CACHEDIR=$(pwd)/singularity ``` @@ -102,13 +104,15 @@ export SINGULARITY_CACHEDIR=$(pwd)/singularity #### Step 3: Run the Workflow -While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder that was downloaded in [step 1](#1-download-the-workflow-files), you can now run the workflow. Below is an example for generating the annotation table for *Mus musculus* (mouse): - +> **Note**: The annotation database creation process requires FTP access through port 21. If you encounter connection issues, please verify that port 21 is not blocked by your network/firewall settings or try running the workflow on a system with unrestricted FTP access. + +While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder that was downloaded in [step 1](#1-download-the-workflow-files), you can now run the workflow. Below is an example for generating the annotation table for *Mus musculus* (mouse): ```bash -singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:$(pwd)/GL_RefAnnotTable-A_1.1.0 \ -$SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \ -Rscript GL_RefAnnotTable-A_1.1.0/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' +singularity exec \ + --bind $(pwd):$(pwd) \ + $SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \ + Rscript GL_RefAnnotTable-A_1.1.0/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' ```
@@ -206,12 +210,18 @@ If the reference table does not specify an annotations database for the target o #### Using Singularity +> **Note**: The annotation database creation process requires FTP access through port 21. If you encounter connection issues, please verify that port 21 is not blocked by your network/firewall settings. + ```bash -singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:$(pwd)/GL_RefAnnotTable-A_1.1.0 \ -$SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \ -Rscript GL_RefAnnotTable-A_1.1.0/install-org-db.R 'Bacillus subtilis' +# Set R library path if not already set +export R_LIBS_USER=$(pwd)/R_libs + +singularity exec \ + --bind $(pwd):$(pwd) \ + $SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \ + Rscript GL_RefAnnotTable-A_1.1.0/install-org-db.R 'Bacillus subtilis' ``` - +
#### Using a Local R Environment diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R index 058aea71..46923217 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/GL-DPPD-7110-A_build-genome-annots-tab.R @@ -3,7 +3,7 @@ # GeneLab script for generating organism-specific gene annotation tables # Example usage: Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus' options(timeout = 3600) - +.libPaths(Sys.getenv("R_LIBS_USER")) # Define variables associated with current pipeline and annotation table versions GL_DPPD_ID <- "GL-DPPD-7110-A" workflow_version <- "GL_RefAnnotTable-A_1.1.0" diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R index 72421811..00f03548 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R @@ -1,5 +1,6 @@ # install-org-db.R options(timeout=3600) +.libPaths(Sys.getenv("R_LIBS_USER")) # Load required libraries library(tidyverse) library(AnnotationForge) From 283c0ab489f5a2cb492c5deea5cf62d0d8537af5 Mon Sep 17 00:00:00 2001 From: Barbara Novak <19824106+bnovak32@users.noreply.github.com> Date: Thu, 7 Nov 2024 17:30:18 -0800 Subject: [PATCH 57/58] Update README.md Added command to create the R_libs folder if it doesn't already exist. --- .../Workflow_Documentation/GL_RefAnnotTable-A/README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index aefe1a3b..9bfea792 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -95,6 +95,8 @@ Once complete, a `singularity` folder containing the Singularity images will be ```bash # Set R library path to current working directory export R_LIBS_USER=$(pwd)/R_libs +# Create the specified R library path if it doesn't already exist +mkdir -p $R_LIBS # Set Singularity cache directory export SINGULARITY_CACHEDIR=$(pwd)/singularity From c4b100e05b0a2ee0943d6da594088a0e1ffb5fc3 Mon Sep 17 00:00:00 2001 From: Barbara Novak <19824106+bnovak32@users.noreply.github.com> Date: Thu, 7 Nov 2024 17:35:34 -0800 Subject: [PATCH 58/58] Update README.md Added R_LIB_USER folder creation and SINGULARITY_CACHEDIR setting --- .../Workflow_Documentation/GL_RefAnnotTable-A/README.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md index 9bfea792..44e5de46 100644 --- a/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md +++ b/GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md @@ -96,7 +96,7 @@ Once complete, a `singularity` folder containing the Singularity images will be # Set R library path to current working directory export R_LIBS_USER=$(pwd)/R_libs # Create the specified R library path if it doesn't already exist -mkdir -p $R_LIBS +mkdir -p $R_LIBS_USER # Set Singularity cache directory export SINGULARITY_CACHEDIR=$(pwd)/singularity @@ -217,6 +217,11 @@ If the reference table does not specify an annotations database for the target o ```bash # Set R library path if not already set export R_LIBS_USER=$(pwd)/R_libs +# Create the specified R library path if it doesn't already exist +mkdir -p $R_LIBS_USER + +# Set Singularity cache directory if not already set +export SINGULARITY_CACHEDIR=$(pwd)/singularity singularity exec \ --bind $(pwd):$(pwd) \