Skip to content

Commit 7c011a2

Browse files
committed
[GL_RefAnnotTable] Misc fixes
- Add software updates to CHANGELOG - Add input and output variables to DPPD document - Prepend species name to output files for non-ENSEMBL reference organisms to make sure it is in the file names - Fix unclear variable names and wording in some functions in the script - Move GO column to the end of the annotation tables when applicable
1 parent bbf7a78 commit 7c011a2

File tree

4 files changed

+180
-26
lines changed

4 files changed

+180
-26
lines changed

GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md

Lines changed: 135 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -221,6 +221,7 @@ target_org_db <- target_info$annotations # org.eg.db R package
221221
target_species_designation <- target_info$species # Full species name
222222
gtf_link <- target_info$gtf # Path to reference assembly GTF
223223
target_short_name <- target_info$name # PANTHER / UNIPROT short name; blank if not available
224+
ref_source <- target_info$ref_source # Reference files source
224225

225226
# Error handling for missing values
226227
if (is.na(target_taxid) || is.na(target_org_db) || is.na(target_species_designation) || is.na(gtf_link)) {
@@ -231,6 +232,11 @@ if (is.na(target_taxid) || is.na(target_org_db) || is.na(target_species_designat
231232
base_gtf_filename <- basename(gtf_link)
232233
base_output_name <- str_replace(base_gtf_filename, ".gtf.gz", "")
233234

235+
# Add the species name to base_output_name if the reference source is not ENSEMBL
236+
if (!(ref_source %in% c("ensembl_plants", "ensembl_bacteria", "ensembl"))) {
237+
base_output_name <- paste(str_replace(target_species_designation, " ", "_"), base_output_name, sep = "_")
238+
}
239+
234240
out_table_filename <- paste0(base_output_name, "-GL-annotations.tsv")
235241
out_log_filename <- paste0(base_output_name, "-GL-build-info.txt")
236242

@@ -243,6 +249,21 @@ if ( file.exists(out_table_filename) ) {
243249
quit()
244250
}
245251
```
252+
**Input Data:**
253+
254+
- ref_tab_path (path to the reference table CSV file containing organism-specific information)
255+
- target_organism (name of the target organism for which annotations are being generated)
256+
257+
**Output Data:**
258+
259+
- target_taxid (taxonomic identifier for the target organism)
260+
- target_org_db (name of the org.db R package for the target organism)
261+
- target_species_designation (full species name of the target organism)
262+
- gtf_link (URL to the GTF file for the target organism)
263+
- target_short_name (PANTHER/UNIPROT short name for the target organism)
264+
- ref_source (source of the reference files, e.g., "ensembl", "ensembl_plants", "ensembl_bacteria", "ncbi")
265+
- out_table_filename (name of the output annotation table file)
266+
- out_log_filename (name of the output log file)
246267

247268
<br>
248269

@@ -293,6 +314,21 @@ if (!requireNamespace(target_org_db, quietly = TRUE)) {
293314
}
294315
```
295316

317+
**Input Data:**
318+
319+
- target_org_db (name of the org.db R package for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
320+
- target_species_designation (full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
321+
- ref_table (reference table containing organism-specific information, output from [step 1](#1-define-variables-and-output-file-names))
322+
- target_organism (name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
323+
- target_taxid (taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
324+
325+
**Output Data:**
326+
327+
- target_org_db (updated name of the org.db R package, if it was created locally)
328+
- Locally installed org.db package (if the package is not available on Bioconductor, a new package is created and installed)
329+
330+
<br>
331+
296332
---
297333

298334
## 3. Load Annotation Databases
@@ -322,6 +358,20 @@ if (!(target_organism %in% no_org_db) && (target_organism %in% currently_accepte
322358
}
323359
```
324360

361+
**Input Data:**
362+
363+
- gtf_link (URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
364+
- target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor))
365+
- target_organism (name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
366+
- currently_accepted_orgs (list of currently supported organisms, defined at the beginning of the script)
367+
- ref_tab_path ([path to the reference table CSV](GL-DPPD-7110-A_annotations.csv))
368+
369+
**Output Data:**
370+
371+
- GTF (data frame containing the GTF file for the target organism)
372+
- no_org_db (list of organisms that do not use org.db annotations due to inconsistent gene names across GTF and org.db)
373+
- Loaded org.db package (the organism-specific annotation package is loaded into the R session, if applicable)
374+
325375
<br>
326376

327377
---
@@ -394,6 +444,17 @@ if (target_organism == "Salmonella enterica") {
394444
}
395445
```
396446

447+
**Input Data:**
448+
449+
- GTF (data frame containing the parsed GTF file for the target organism, output from [step 3](#3-load-annotation-databases))
450+
- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names))
451+
- gtf_keytype_mappings (list of keys to extract from the GTF, for each organism)
452+
453+
**Output Data:**
454+
455+
- annot_gtf (initial annotation table derived from the GTF file, containing only the relevant columns for the target organism)
456+
- primary_keytype (the name of the primary key type being used, e.g., "ENSEMBL", "TAIR", "LOCUS", based on the GTF gene_id entries)
457+
397458
<br>
398459

399460
---
@@ -448,12 +509,12 @@ orgdb_keytype <- if (!is.null(orgdb_keytype_mappings[[target_organism]])) {
448509
orgdb_keytype_mappings[["default"]][["keytype"]]
449510
}
450511

451-
# Function to clean and match ACCNUM keys for BRADI
452-
clean_and_match_accnum <- function(annot_table, org_db, query_col, keytype_col, target_column) {
453-
# Clean the ACCNUM keys in the GTF annotations
512+
# Function to remove version numbers from ACCNUM keys and match them for BRADI
513+
match_accnum <- function(annot_table, org_db, query_col, keytype_col, target_column) {
514+
# Remove version numbers from the ACCNUM keys in the GTF annotations
454515
cleaned_annot_keys <- sub("\\..*", "", annot_table[[query_col]])
455516

456-
# Retrieve and clean the org.db keys
517+
# Retrieve and remove version numbers from the org.db keys
457518
orgdb_keys <- keys(org_db, keytype = keytype_col)
458519
cleaned_orgdb_keys <- sub("\\..*", "", orgdb_keys)
459520

@@ -472,8 +533,8 @@ for (keytype in wanted_org_db_keytypes) {
472533
# Check if keytype is a valid column in the target org.db
473534
if (keytype %in% columns(get(target_org_db, envir = .GlobalEnv))) {
474535
if (target_organism == "Brachypodium distachyon" && orgdb_query == "ACCNUM") {
475-
# For BRADI: use the clean_and_match_accnum function to map to org.db ACCNUM entries
476-
org_matches <- clean_and_match_accnum(annot_orgdb, get(target_org_db, envir = .GlobalEnv), query_col = orgdb_query, keytype_col = orgdb_keytype, target_column = keytype)
536+
# For BRADI: use the match_accnum function to map to org.db ACCNUM entries
537+
org_matches <- match_accnum(annot_orgdb, get(target_org_db, envir = .GlobalEnv), query_col = orgdb_query, keytype_col = orgdb_keytype, target_column = keytype)
477538
} else {
478539
# Default mapping for other organisms
479540
org_matches <- mapIds(get(target_org_db, envir = .GlobalEnv), keys = annot_orgdb[[orgdb_query]], keytype = orgdb_keytype, column = keytype, multiVals = "list")
@@ -497,6 +558,20 @@ if (target_organism == "Saccharomyces cerevisiae") {
497558
}
498559
```
499560

561+
**Input Data:**
562+
563+
- annot_gtf (initial annotation table derived from the GTF file, output from [step 4](#4-build-initial-annotation-table))
564+
- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names))
565+
- no_org_db (list of organisms that do not use annotations from an org.db, output from [step 3](#3-load-annotation-databases))
566+
- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
567+
- target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor))
568+
569+
**Output Data:**
570+
571+
- annot_orgdb (updated annotation table with additional keys from the organism-specific org.db)
572+
- orgdb_query (the key type used to map to the org.db)
573+
- orgdb_keytype (the name of the key type in the org.db)
574+
500575
<br>
501576

502577
---
@@ -609,6 +684,20 @@ if (target_organism == "Bacillus subtilis") {
609684
annot_stringdb <- as.data.frame(annot_stringdb)
610685
```
611686

687+
**Input Data:**
688+
689+
- annot_orgdb (annotation table with GTF and org.db annotations, output from [step 5](#5-add-orgdb-keys))
690+
- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names))
691+
- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
692+
- target_taxid (taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
693+
694+
**Output Data:**
695+
696+
- annot_stringdb (updated annotation table with added STRING IDs)
697+
- no_stringdb (list of organisms that do not use STRING annotations)
698+
- stringdb_query (the key type used for mapping to STRING database)
699+
- uses_old_locus (list of organisms where GTF gene_id entries do not match those in STRING, so entries in OLD_LOCUS are used to query STRING)
700+
612701
<br>
613702

614703
---
@@ -658,6 +747,20 @@ if (!(target_organism %in% no_panther_db)) {
658747
}
659748
```
660749

750+
**Input Data:**
751+
752+
- annot_orgdb (annotation table with GTF and org.db annotations, output from [step 5](#5-add-orgdb-keys))
753+
- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names))
754+
- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
755+
- target_taxid (taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
756+
757+
**Output Data:**
758+
759+
- annot_stringdb (updated annotation table with added STRING IDs)
760+
- no_stringdb (list of organisms that do not use STRING annotations)
761+
- stringdb_query (the key type used for mapping to STRING database)
762+
- uses_old_locus (list of organisms where the 'gene_id' column in the GTF dataframe does not match STRING identifiers, so the 'old_locus_tag' column from the GTF dataframe is used to query STRING instead)
763+
661764
<br>
662765

663766
---
@@ -670,6 +773,13 @@ annot <- annot_pantherdb %>%
670773
group_by(!!sym(primary_keytype)) %>%
671774
summarise(across(everything(), ~paste(unique(na.omit(.))[unique(na.omit(.)) != ""], collapse = "|")), .groups = 'drop')
672775

776+
# If "GO" column exists, move it to the end to keep columns in consistent order across organisms
777+
if ("GO" %in% names(annot)) {
778+
go_column <- annot$GO
779+
annot$GO <- NULL
780+
annot$GO <- go_column
781+
}
782+
673783
# Sort the annotation table based on primary keytype gene IDs
674784
annot <- annot %>% arrange(.[[1]])
675785

@@ -696,6 +806,23 @@ write("\n\nAll session info:\n", out_log_filename, append = TRUE)
696806
write(capture.output(sessionInfo()), out_log_filename, append = TRUE)
697807
```
698808

809+
**Input Data:**
810+
811+
- annot_pantherdb (annotation table with GTF, org.db, STRING, and PANTHER annotations, output from [step 7](#7-add-gene-ontology-go-slim-ids))
812+
- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
813+
- out_table_filename (name of the output annotation table file, output from [step 1](#1-define-variables-and-output-file-names))
814+
- out_log_filename (name of the output log file, output from [step 1](#1-define-variables-and-output-file-names))
815+
- GL_DPPD_ID (GeneLab Data Processing Pipeline Document ID, from step 0)
816+
- gtf_link (URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
817+
- target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor))
818+
- no_org_db (list of organisms that do not use org.db annotations, output from [step 3](#3-load-annotation-databases))
819+
820+
**Output Data:**
821+
822+
- annot (final annotation table with annotations from the GTF, org.db, STRING, and PANTHER)
823+
- ***-GL-annotations.tsv** (annot saved as a tab-delimited table file)
824+
- ***-GL-build-info.txt** (annotation table build information log file)
825+
699826
<br>
700827

701828
---
@@ -706,5 +833,5 @@ write(capture.output(sessionInfo()), out_log_filename, append = TRUE)
706833

707834
**Pipeline Output data:**
708835

709-
- *-GL-annotations.tsv (Tab delineated table of gene annotations, used to add gene annotations in other GeneLab processing pipelines)
710-
- *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation)
836+
- ***-GL-annotations.tsv** (Tab-delineated table of gene annotations, used to add gene annotations in other GeneLab processing pipelines)
837+
- ***-GL-build-info.txt** (Text file containing information used to create the annotation table, including tool and tool versions and date of creation)

0 commit comments

Comments
 (0)