Skip to content

Commit c72d4bb

Browse files
Merge pull request #118 from torres-alexis/DEV_GeneLab_Reference_Annotations_vGL-DPPD-7110-A
Software version updates, input output variables defined, updated annotation tables
2 parents bbf7a78 + 81d06dd commit c72d4bb

File tree

4 files changed

+209
-36
lines changed

4 files changed

+209
-36
lines changed

GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md

Lines changed: 164 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -151,21 +151,29 @@ The default columns in the annotation table are:
151151
| org.Sc.sgd.db | 3.19.1 | [https://bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html](https://www.bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html) |
152152
| AnnotationForge | 1.46.0 | [https://bioconductor.org/packages/AnnotationForge](https://bioconductor.org/packages/AnnotationForge) |
153153
| biomaRt | 2.60.1 | [https://bioconductor.org/packages/biomaRt](https://bioconductor.org/packages/biomaRt) |
154-
| GO.db | 2.0.0 | [https://bioconductor.org/packages/GO.db](https://bioconductor.org/packages/GO.db) |
154+
| GO.db | 3.19.1 | [https://bioconductor.org/packages/GO.db](https://bioconductor.org/packages/GO.db) |
155155

156156
---
157157

158158
# Annotation table build overview with example commands
159159

160-
> Current GeneLab annotation tables are available on [figshare](https://figshare.com/), exact links for each reference organism are provided in the [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) file.
161-
>
162-
> **[Ensembl Reference Versions](https://www.ensembl.org/index.html):**
163-
> - Animals: Ensembl release 112
164-
> - Plants: Ensembl plants release 59
165-
> - Bacteria: Ensembl bacteria release 59
166-
>
167-
> **PANTHER:** 18.0
168-
> > *Note: The values in the 'name' column of [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) (e.g., MOUSE, HUMAN, ARABIDOPSIS) are derived from the short names used in PANTHER. These short names are subject to change.*
160+
Current GeneLab annotation tables are available on [figshare](https://figshare.com/), exact links for each reference organism are provided in the [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) file.
161+
162+
**[Ensembl Reference Versions](https://www.ensembl.org/index.html):**
163+
- Animals: Ensembl release 112
164+
- Plants: Ensembl plants release 59
165+
- Bacteria: Ensembl bacteria release 59
166+
167+
**Database Versions:**
168+
- STRINGdb: 12.0
169+
- PANTHERdb: 18.0
170+
> Note: The values in the 'name' column of [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) (e.g., HUMAN, MOUSE, RAT) are derived from the short names used in PANTHER. These short names are subject to change.
171+
- GO.db:
172+
- GO ontology file updated on 2024-01-17
173+
- Entrez gene data updated on 2024-03-12
174+
- DB schema version 2.1
175+
176+
169177

170178
---
171179

@@ -194,6 +202,18 @@ library(STRINGdb)
194202
library(PANTHER.db)
195203
library(rtracklayer)
196204
```
205+
**Input Data:**
206+
207+
- None (This is an initial setup step using predefined variables)
208+
209+
**Output Data:**
210+
211+
- GL_DPPD_ID (GeneLab Data Processing Pipeline Document ID)
212+
- ref_tab_path (path to the reference table CSV file)
213+
- readme_path (path to the README file)
214+
- currently_accepted_orgs (list of currently supported organisms)
215+
216+
<br>
197217

198218
---
199219

@@ -221,6 +241,7 @@ target_org_db <- target_info$annotations # org.eg.db R package
221241
target_species_designation <- target_info$species # Full species name
222242
gtf_link <- target_info$gtf # Path to reference assembly GTF
223243
target_short_name <- target_info$name # PANTHER / UNIPROT short name; blank if not available
244+
ref_source <- target_info$ref_source # Reference files source
224245

225246
# Error handling for missing values
226247
if (is.na(target_taxid) || is.na(target_org_db) || is.na(target_species_designation) || is.na(gtf_link)) {
@@ -231,6 +252,11 @@ if (is.na(target_taxid) || is.na(target_org_db) || is.na(target_species_designat
231252
base_gtf_filename <- basename(gtf_link)
232253
base_output_name <- str_replace(base_gtf_filename, ".gtf.gz", "")
233254

255+
# Add the species name to base_output_name if the reference source is not ENSEMBL
256+
if (!(ref_source %in% c("ensembl_plants", "ensembl_bacteria", "ensembl"))) {
257+
base_output_name <- paste(str_replace(target_species_designation, " ", "_"), base_output_name, sep = "_")
258+
}
259+
234260
out_table_filename <- paste0(base_output_name, "-GL-annotations.tsv")
235261
out_log_filename <- paste0(base_output_name, "-GL-build-info.txt")
236262

@@ -243,6 +269,21 @@ if ( file.exists(out_table_filename) ) {
243269
quit()
244270
}
245271
```
272+
**Input Data:**
273+
274+
- ref_tab_path (path to the reference table CSV file, output from [step 0](#0-set-up-environment))
275+
- target_organism (name of the target organism for which annotations are being generated)
276+
277+
**Output Data:**
278+
279+
- target_taxid (taxonomic identifier for the target organism)
280+
- target_org_db (name of the org.db R package for the target organism)
281+
- target_species_designation (full species name of the target organism)
282+
- gtf_link (URL to the GTF file for the target organism)
283+
- target_short_name (PANTHER/UNIPROT short name for the target organism)
284+
- ref_source (source of the reference files, e.g., "ensembl", "ensembl_plants", "ensembl_bacteria", "ncbi")
285+
- out_table_filename (name of the output annotation table file)
286+
- out_log_filename (name of the output log file)
246287

247288
<br>
248289

@@ -293,6 +334,21 @@ if (!requireNamespace(target_org_db, quietly = TRUE)) {
293334
}
294335
```
295336

337+
**Input Data:**
338+
339+
- target_org_db (name of the org.db R package for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
340+
- target_species_designation (full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
341+
- ref_table (reference table containing organism-specific information, output from [step 1](#1-define-variables-and-output-file-names))
342+
- target_organism (name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
343+
- target_taxid (taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
344+
345+
**Output Data:**
346+
347+
- target_org_db (updated name of the org.db R package, if it was created locally)
348+
- Locally installed org.db package (if the package is not available on Bioconductor, a new package is created and installed)
349+
350+
<br>
351+
296352
---
297353

298354
## 3. Load Annotation Databases
@@ -322,6 +378,19 @@ if (!(target_organism %in% no_org_db) && (target_organism %in% currently_accepte
322378
}
323379
```
324380

381+
**Input Data:**
382+
383+
- gtf_link (URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
384+
- target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor))
385+
- target_organism (name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
386+
- currently_accepted_orgs (list of currently supported organisms, output from [step 0](#0-set-up-environment))
387+
- ref_tab_path (path to the reference table CSV file, output from [step 0](#0-set-up-environment))
388+
389+
**Output Data:**
390+
391+
- GTF (data frame containing the GTF file for the target organism)
392+
- no_org_db (list of organisms that do not use org.db annotations due to inconsistent gene names across GTF and org.db)
393+
325394
<br>
326395

327396
---
@@ -394,6 +463,17 @@ if (target_organism == "Salmonella enterica") {
394463
}
395464
```
396465

466+
**Input Data:**
467+
468+
- GTF (data frame containing the parsed GTF file for the target organism, output from [step 3](#3-load-annotation-databases))
469+
- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names))
470+
- gtf_keytype_mappings (list of keys to extract from the GTF, for each organism)
471+
472+
**Output Data:**
473+
474+
- annot_gtf (initial annotation table derived from the GTF file, containing only the relevant columns for the target organism)
475+
- primary_keytype (the name of the primary key type being used, e.g., "ENSEMBL", "TAIR", "LOCUS", based on the GTF gene_id entries)
476+
397477
<br>
398478

399479
---
@@ -448,12 +528,12 @@ orgdb_keytype <- if (!is.null(orgdb_keytype_mappings[[target_organism]])) {
448528
orgdb_keytype_mappings[["default"]][["keytype"]]
449529
}
450530

451-
# Function to clean and match ACCNUM keys for BRADI
452-
clean_and_match_accnum <- function(annot_table, org_db, query_col, keytype_col, target_column) {
453-
# Clean the ACCNUM keys in the GTF annotations
531+
# Function to remove version numbers from ACCNUM keys and match them for BRADI
532+
match_accnum <- function(annot_table, org_db, query_col, keytype_col, target_column) {
533+
# Remove version numbers from the ACCNUM keys in the GTF annotations
454534
cleaned_annot_keys <- sub("\\..*", "", annot_table[[query_col]])
455535

456-
# Retrieve and clean the org.db keys
536+
# Retrieve and remove version numbers from the org.db keys
457537
orgdb_keys <- keys(org_db, keytype = keytype_col)
458538
cleaned_orgdb_keys <- sub("\\..*", "", orgdb_keys)
459539

@@ -472,8 +552,8 @@ for (keytype in wanted_org_db_keytypes) {
472552
# Check if keytype is a valid column in the target org.db
473553
if (keytype %in% columns(get(target_org_db, envir = .GlobalEnv))) {
474554
if (target_organism == "Brachypodium distachyon" && orgdb_query == "ACCNUM") {
475-
# For BRADI: use the clean_and_match_accnum function to map to org.db ACCNUM entries
476-
org_matches <- clean_and_match_accnum(annot_orgdb, get(target_org_db, envir = .GlobalEnv), query_col = orgdb_query, keytype_col = orgdb_keytype, target_column = keytype)
555+
# For BRADI: use the match_accnum function to map to org.db ACCNUM entries
556+
org_matches <- match_accnum(annot_orgdb, get(target_org_db, envir = .GlobalEnv), query_col = orgdb_query, keytype_col = orgdb_keytype, target_column = keytype)
477557
} else {
478558
# Default mapping for other organisms
479559
org_matches <- mapIds(get(target_org_db, envir = .GlobalEnv), keys = annot_orgdb[[orgdb_query]], keytype = orgdb_keytype, column = keytype, multiVals = "list")
@@ -497,6 +577,20 @@ if (target_organism == "Saccharomyces cerevisiae") {
497577
}
498578
```
499579

580+
**Input Data:**
581+
582+
- annot_gtf (initial annotation table derived from the GTF file, output from [step 4](#4-build-initial-annotation-table))
583+
- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names))
584+
- no_org_db (list of organisms that do not use annotations from an org.db, output from [step 3](#3-load-annotation-databases))
585+
- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
586+
- target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor))
587+
588+
**Output Data:**
589+
590+
- annot_orgdb (updated annotation table with additional keys from the organism-specific org.db)
591+
- orgdb_query (the key type used to map to the org.db)
592+
- orgdb_keytype (the name of the key type in the org.db)
593+
500594
<br>
501595

502596
---
@@ -609,6 +703,20 @@ if (target_organism == "Bacillus subtilis") {
609703
annot_stringdb <- as.data.frame(annot_stringdb)
610704
```
611705

706+
**Input Data:**
707+
708+
- annot_orgdb (annotation table with GTF and org.db annotations, output from [step 5](#5-add-orgdb-keys))
709+
- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names))
710+
- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
711+
- target_taxid (taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
712+
713+
**Output Data:**
714+
715+
- annot_stringdb (updated annotation table with added STRING IDs)
716+
- no_stringdb (list of organisms that do not use STRING annotations)
717+
- stringdb_query (the key type used for mapping to STRING database)
718+
- uses_old_locus (list of organisms where GTF gene_id entries do not match those in STRING, so entries in OLD_LOCUS are used to query STRING)
719+
612720
<br>
613721

614722
---
@@ -658,6 +766,20 @@ if (!(target_organism %in% no_panther_db)) {
658766
}
659767
```
660768

769+
**Input Data:**
770+
771+
- annot_orgdb (annotation table with GTF and org.db annotations, output from [step 5](#5-add-orgdb-keys))
772+
- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names))
773+
- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
774+
- target_taxid (taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
775+
776+
**Output Data:**
777+
778+
- annot_stringdb (updated annotation table with added STRING IDs)
779+
- no_stringdb (list of organisms that do not use STRING annotations)
780+
- stringdb_query (the key type used for mapping to STRING database)
781+
- uses_old_locus (list of organisms where the 'gene_id' column in the GTF dataframe does not match STRING identifiers, so the 'old_locus_tag' column from the GTF dataframe is used to query STRING instead)
782+
661783
<br>
662784

663785
---
@@ -670,6 +792,13 @@ annot <- annot_pantherdb %>%
670792
group_by(!!sym(primary_keytype)) %>%
671793
summarise(across(everything(), ~paste(unique(na.omit(.))[unique(na.omit(.)) != ""], collapse = "|")), .groups = 'drop')
672794

795+
# If "GO" column exists, move it to the end to keep columns in consistent order across organisms
796+
if ("GO" %in% names(annot)) {
797+
go_column <- annot$GO
798+
annot$GO <- NULL
799+
annot$GO <- go_column
800+
}
801+
673802
# Sort the annotation table based on primary keytype gene IDs
674803
annot <- annot %>% arrange(.[[1]])
675804

@@ -696,6 +825,23 @@ write("\n\nAll session info:\n", out_log_filename, append = TRUE)
696825
write(capture.output(sessionInfo()), out_log_filename, append = TRUE)
697826
```
698827

828+
**Input Data:**
829+
830+
- annot_pantherdb (annotation table with GTF, org.db, STRING, and PANTHER annotations, output from [step 7](#7-add-gene-ontology-go-slim-ids))
831+
- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
832+
- out_table_filename (name of the output annotation table file, output from [step 1](#1-define-variables-and-output-file-names))
833+
- out_log_filename (name of the output log file, output from [step 1](#1-define-variables-and-output-file-names))
834+
- GL_DPPD_ID (GeneLab Data Processing Pipeline Document ID, output from [step 0](#0-set-up-environment))
835+
- gtf_link (URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
836+
- target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor))
837+
- no_org_db (list of organisms that do not use org.db annotations, output from [step 3](#3-load-annotation-databases))
838+
839+
**Output Data:**
840+
841+
- annot (final annotation table with annotations from the GTF, org.db, STRING, and PANTHER)
842+
- ***-GL-annotations.tsv** (annot saved as a tab-delimited table file)
843+
- ***-GL-build-info.txt** (annotation table build information log file)
844+
699845
<br>
700846

701847
---
@@ -706,5 +852,5 @@ write(capture.output(sessionInfo()), out_log_filename, append = TRUE)
706852

707853
**Pipeline Output data:**
708854

709-
- *-GL-annotations.tsv (Tab delineated table of gene annotations, used to add gene annotations in other GeneLab processing pipelines)
710-
- *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation)
855+
- ***-GL-annotations.tsv** (Tab-delineated table of gene annotations, used to add gene annotations in other GeneLab processing pipelines)
856+
- ***-GL-build-info.txt** (Text file containing information used to create the annotation table, including tool and tool versions and date of creation)

0 commit comments

Comments
 (0)