Skip to content

Commit 3f731d5

Browse files
Create GL-DPPD-7110-A.md
1 parent 6945252 commit 3f731d5

File tree

1 file changed

+373
-0
lines changed
  • GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A

1 file changed

+373
-0
lines changed
Lines changed: 373 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,373 @@
1+
# GeneLab pipeline for generating reference annotation tables
2+
3+
> **This page holds an overview and instructions for how GeneLab generates reference annotation tables. The GeneLab reference annotation table used to add annotations to processed data files are indicated in the exact processing scripts provided for each GLDS dataset under the respective omics datatype subdirectory.**
4+
5+
---
6+
7+
**Date:** Month XX, 2024
8+
**Revision:** -A
9+
**Document Number:** GL-DPPD-7110-A
10+
11+
**Submitted by:**
12+
Alexis Torres and Crystal Han (GeneLab Data Processing Team)
13+
14+
**Approved by:**
15+
Sylvain Costes (OSDR Project Manager)
16+
Samrawit Gebre (GeneLab Deputy Project Manager and Acting Genelab Configuration Manager)
17+
Lauren Sanders (OSDR Project Scientist)
18+
Amanda Saravia-Butler (GeneLab Science Lead)
19+
Barbara Novak (GeneLab Data Processing Lead)
20+
21+
---
22+
23+
## Updates from previous version
24+
25+
26+
27+
---
28+
29+
# Table of contents
30+
31+
- [GeneLab pipeline for generating reference annotation tables](#genelab-pipeline-for-generating-reference-annotation-tables)
32+
- [Table of contents](#table-of-contents)
33+
- [Software used](#software-used)
34+
- [Annotation table build overview with example commands](#annotation-table-build-overview-with-example-commands)
35+
- [0. Set Up Environment](#0-set-up-environment)
36+
- [1. Define Variables and Output File Names](#1-define-variables-and-output-file-names)
37+
- [2. Load Annotation Databases and Retrieve Unique Gene IDs](#2-load-annotation-databases-and-retrieve-unique-gene-ids)
38+
- [3. Build Initial Annotation Table](#3-build-initial-annotation-table)
39+
- [4. Add STRING IDs](#4-add-string-ids)
40+
- [5. Add Gene Ontology (GO) slim IDs](#5-add-gene-ontology-go-slim-ids)
41+
- [6. Export Annotation Table and Build Info](#6-export-annotation-table-and-build-info)
42+
43+
44+
---
45+
46+
# Software used
47+
48+
|Program|Version|Relevant Links|
49+
|:------|:------:|:-------------|
50+
|R|4.2.1|[https://www.r-project.org/](https://www.r-project.org/)|
51+
|Bioconductor|3.15|[https://bioconductor.org](https://bioconductor.org)|
52+
|tidyverse|1.3.2|[https://www.tidyverse.org](https://www.tidyverse.org)|
53+
|STRINGdb|2.8.4|[https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html](https://www.bioconductor.org/packages/release/bioc/html/STRINGdb.html)|
54+
|PANTHER.db|1.0.11|[https://bioconductor.org/packages/release/data/annotation/html/PANTHER.db.html](https://bioconductor.org/packages/release/data/annotation/html/PANTHER.db.html)|
55+
|rtracklayer|1.56.1|[https://bioconductor.org/packages/release/bioc/html/rtracklayer.html](https://bioconductor.org/packages/release/bioc/html/rtracklayer.html)
56+
|org.Hs.eg.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Hs.eg.db.html)|
57+
|org.Mm.eg.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html)|
58+
|org.Rn.eg.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.Rn.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Rn.eg.db.html)
59+
|org.Dm.eg.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.Dm.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Dm.eg.db.html)|
60+
|org.Ce.eg.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.Ce.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Ce.eg.db.html)|
61+
|org.At.tair.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.At.tair.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.At.tair.db.html)|
62+
|org.EcK12.eg.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.EcK12.eg.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.EcK12.eg.db.html)|
63+
|org.Sc.sgd.db|3.15.0|[https://bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html](https://bioconductor.org/packages/release/data/annotation/html/org.Sc.sgd.db.html)|
64+
65+
---
66+
67+
# Annotation table build overview with example commands
68+
69+
> Current GeneLab annotation tables are available on [figshare](https://figshare.com/), exact links for each reference organism are provided in the [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) file.
70+
>
71+
> **[Ensembl Reference Files](https://www.ensembl.org/index.html) Used:**
72+
> - Animals: Ensembl release 111
73+
> - Plants: Ensembl plants release 58
74+
> - Bacteria: Ensembl bacteria release 58
75+
76+
77+
---
78+
79+
This example below is done for *Mus musculus*. All code is executed in R.
80+
81+
## 0. Set Up Environment
82+
83+
```R
84+
target_organism == "MOUSE"
85+
86+
GL_DPPD_ID <- "GL-DPPD-7110-A"
87+
88+
## Import libraries ##
89+
library(tidyverse)
90+
library(STRINGdb)
91+
library(PANTHER.db)
92+
library(rtracklayer)
93+
94+
95+
## Set the primary annotation keytype, TAIR for Arabidopsis, ENSEMBL for all other organisms ##
96+
if ( target_organism == "ARABIDOPSIS" ) {
97+
98+
primary_keytype <- "TAIR"
99+
100+
} else {
101+
102+
primary_keytype <- "ENSEMBL"
103+
104+
}
105+
106+
107+
## Define annotation keys to retrieve ##
108+
wanted_keys_vec <- c("SYMBOL", "GENENAME", "REFSEQ", "ENTREZID")
109+
110+
111+
## Define links to tables containing species-specific annotation info ##
112+
ref_tab_link <-
113+
"https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv"
114+
115+
116+
## Set timeout time to allow more time for annotation file downloads to complete ##
117+
options(timeout = 600)
118+
```
119+
120+
---
121+
122+
## 1. Define Variables and Output File Names
123+
124+
```R
125+
## Read in tables containing species-specific annotation info ##
126+
ref_table <- read.csv(ref_tab_link)
127+
128+
## Retrieve and define target organism taxid, annotation database name, and scientific name ##
129+
target_taxid <- ref_table %>%
130+
filter(name == target_organism) %>%
131+
pull(taxon)
132+
133+
target_org_db <- ref_table %>%
134+
filter(name == target_organism) %>%
135+
pull(annotations)
136+
137+
target_species_designation <- ref_table %>%
138+
filter(name == target_organism) %>%
139+
pull(species)
140+
141+
## Define link to Ensembl annotation gtf file for the target organism ##
142+
gtf_link <- ref_table %>%
143+
filter(species == target_species_designation) %>%
144+
pull(gtf)
145+
146+
## Create output file names ##
147+
base_gtf_filename <- basename(gtf_link)
148+
base_output_name <- str_replace(base_gtf_filename, ".gtf.gz", "")
149+
150+
out_table_filename <- paste0(base_output_name, "-GL-annotations.tsv")
151+
out_log_filename <- paste0(base_output_name, "-GL-build-info.txt")
152+
```
153+
154+
<br>
155+
156+
---
157+
158+
## 2. Load Annotation Databases and Retrieve Unique Gene IDs
159+
160+
```R
161+
## Import Ensembl annotation gtf file for the target organism ##
162+
gtf_obj <- import(gtf_link)
163+
164+
## Define unique Ensembl IDs ##
165+
unique_IDs <- gtf_obj$gene_id %>% unique()
166+
167+
## Remove gtf object to conserve RAM, since it is no longer needed ##
168+
rm(gtf_obj)
169+
170+
## Define target organism annotation database ##
171+
ann.dbi <- target_org_db
172+
173+
## Install target organism annotation database if not already installed, then load the annotation database library ##
174+
if ( ! require(ann.dbi, character.only = TRUE)) {
175+
176+
BiocManager::install(ann.dbi, ask = FALSE)
177+
178+
}
179+
180+
library(ann.dbi, character.only = TRUE)
181+
```
182+
183+
<br>
184+
185+
---
186+
187+
## 3. Build Initial Annotation Table
188+
189+
```R
190+
## Begin annotation table using unique IDs of the primary keytype ##
191+
annot <- data.frame(unique_IDs)
192+
colnames(annot) <- primary_keytype
193+
194+
## Retrieve and add additional annotation keys as table columns ##
195+
for ( key in wanted_keys_vec ) {
196+
197+
if ( key %in% columns(eval(parse(text = ann.dbi), env = .GlobalEnv))) {
198+
199+
new_list <- mapIds(eval(parse(text = ann.dbi), env = .GlobalEnv), keys = unique_IDs, keytype = primary_keytype, column = key, multiVals = "list")
200+
201+
# they come as lists when we accept the multiple hits, so converting to character strings here
202+
annot[[key]] <- sapply(new_list, paste, collapse = "|")
203+
204+
} else {
205+
206+
# if the annotation DB didn't have any of the wanted key types, that column will be missing
207+
# adding in here as an empty column
208+
annot[key] <- NA
209+
210+
}
211+
}
212+
213+
```
214+
215+
<br>
216+
217+
---
218+
219+
## 4. Add STRING IDs
220+
221+
```R
222+
## Retrieve target organism STRING protein-protein interaction database and create STRING ID map to the primary keytype ##
223+
string_db <- STRINGdb$new(version = "11.5", species = target_taxid, score_threshold = 0)
224+
string_map <- string_db$map(annot, primary_keytype, removeUnmappedRows = FALSE, takeFirst = FALSE)
225+
226+
## Create a table using the gene IDs of the primary keytype as row names and a column containing STRING IDs. ##
227+
## For genes containing multiple STRING IDs, combine all STRING IDs for each gene into one row and separate each ID with a '|' ##
228+
tab_with_multiple_STRINGids_combined <-
229+
data.frame(row.names = annot[[primary_keytype]])
230+
231+
for ( curr_gene_ID in row.names(tab_with_multiple_STRINGids_combined) ) {
232+
233+
curr_STRING_ids <- string_map %>%
234+
filter(!!rlang::sym(primary_keytype) == curr_gene_ID) %>%
235+
pull(STRING_id) %>% paste(collapse = "|")
236+
237+
tab_with_multiple_STRINGids_combined[curr_gene_ID, "STRING_id"] <- curr_STRING_ids
238+
239+
}
240+
241+
## Move the primary keytype gene IDs back to being a column in the STRING ID table (since they were switched to row names above) ##
242+
tab_with_multiple_STRINGids_combined <-
243+
tab_with_multiple_STRINGids_combined %>%
244+
rownames_to_column(primary_keytype)
245+
246+
## Add the STRING ID column to the annotation table ##
247+
248+
annot <- dplyr::left_join(annot,
249+
tab_with_multiple_STRINGids_combined,
250+
by = primary_keytype)
251+
```
252+
253+
<br>
254+
255+
---
256+
257+
## 5. Add Gene Ontology (GO) slim IDs
258+
259+
```R
260+
## Retrieve target organism PANTHER GO slim annotations database ##
261+
pthOrganisms(PANTHER.db) <- target_organism
262+
263+
## Use ENTREZ IDs to map genes to respective PANTHER GO slim annotation(s) ##
264+
# Note: Since there can be none (indicated in the annotation table as "NA"), one, or
265+
# multiple ENTREZ IDs for a gene, this section contains 3 distinct parts to handle
266+
# each of those scenarios and create a new column in the annotation table containing the GO slim IDs
267+
268+
for ( curr_row in 1:dim(annot)[1] ) {
269+
270+
curr_entry <- annot[curr_row, "ENTREZID"]
271+
272+
## For genes without an ENTREZ ID ##
273+
if ( curr_entry == "NA" ) {
274+
275+
annot[curr_row, "GOSLIM_IDS"] <- "NA"
276+
277+
} else if ( ! grepl("|", curr_entry, fixed = TRUE) ) {
278+
279+
## For genes with one ENTREZ ID ##
280+
curr_GO_IDs <- mapIds(PANTHER.db, keys = curr_entry, keytype = "ENTREZ", column = "GOSLIM_ID", multiVals = "list") %>% unlist() %>% as.vector()
281+
282+
## Add "NA" to the GO slim column for ENTREZ IDs that do not contain a respective GO slim ID ##
283+
if ( is.null(curr_GO_IDs) ) {
284+
285+
curr_GO_IDs <- "NA"
286+
}
287+
288+
annot[curr_row, "GOSLIM_IDS"] <- paste(curr_GO_IDs, collapse = "|")
289+
290+
} else {
291+
292+
## For genes with multiple ENTREZ ID ##
293+
# Note: In this scenario, the ENTREZ IDs for each gene are first split with a '|' to
294+
# separate the IDs, then the GO slim ID(s) for each ENTREZ ID are collected and
295+
# combined, then duplicates are removed, and the final list of GO slim IDs for
296+
# each gene are added in a single row, separated with a '|'
297+
298+
## Split the ENTREZ IDs ##
299+
curr_entry_vec <- strsplit(curr_entry, "|", fixed = TRUE)
300+
301+
## Start a vector of current GO slim IDs ##
302+
curr_GO_IDs <- vector()
303+
304+
## Collect and combine GO slim ID(s) for each ENTREZ ID ##
305+
for ( curr_entry in curr_entry_vec ) {
306+
307+
new_GO_IDs <- mapIds(PANTHER.db, keys = curr_entry, keytype = "ENTREZ", column = "GOSLIM_ID", multiVals = "list") %>% unlist() %>% as.vector()
308+
309+
## Add new GO slim IDs to the GO slim IDs vector ##
310+
curr_GO_IDs <- c(curr_GO_IDs, new_GO_IDs)
311+
312+
}
313+
314+
## Remove duplicate GO slim IDs ##
315+
curr_GO_IDs <- unique(curr_GO_IDs)
316+
317+
## Add "NA" to the GO slim vector for ENTREZ IDs that do not contain a respective GO slim ID ##
318+
if ( length(curr_GO_IDs) == 0 ) {
319+
320+
curr_GO_IDs <- "NA"
321+
}
322+
323+
## Add additional GO slim IDs to the GOSLIM ID column in the annotation table ##
324+
annot[curr_row, "GOSLIM_IDS"] <- paste(curr_GO_IDs, collapse = "|")
325+
326+
}
327+
328+
}
329+
```
330+
331+
<br>
332+
333+
---
334+
335+
## 6. Export Annotation Table and Build Info
336+
337+
```R
338+
## Sort the annotation table based on primary keytype gene IDs ##
339+
annot <- annot %>% arrange(.[[1]])
340+
341+
## Replacing any blank cells with NA ##
342+
annot[annot == ""] <- NA
343+
344+
## Export the annotation table using the file name defined in Step 1 ##
345+
write.table(annot, out_table_filename, sep = "\t", quote = FALSE, row.names = FALSE)
346+
347+
## Define the date the annotation table was generated ##
348+
date_generated <- format(Sys.time(), "%d-%B-%Y")
349+
350+
## Export annotation table build info using the file name defined in Step 1 ##
351+
writeLines(paste(c("Based on:\n ", GL_DPPD_ID), collapse = ""), out_log_filename)
352+
write(paste(c("Build done on:\n ", date_generated), collapse = ""), out_log_filename, append = TRUE)
353+
write(paste(c("\nUsed gtf file:\n ", gtf_link), collapse = ""), out_log_filename, append = TRUE)
354+
write(paste(c("\nUsed ", ann.dbi, " version:\n ", packageVersion(ann.dbi) %>% as.character()), collapse = ""), out_log_filename, append = TRUE)
355+
write(paste(c("\nUsed STRINGdb version:\n ", packageVersion("STRINGdb") %>% as.character()), collapse = ""), out_log_filename, append = TRUE)
356+
write(paste(c("\nUsed PANTHER.db version:\n ", packageVersion("PANTHER.db") %>% as.character()), collapse = ""), out_log_filename, append = TRUE)
357+
358+
write("\n\nAll session info:\n", out_log_filename, append = TRUE)
359+
write(capture.output(sessionInfo()), out_log_filename, append = TRUE)
360+
```
361+
362+
<br>
363+
364+
---
365+
366+
**Pipeline Input data:**
367+
368+
- No input files required, but a target organism must be specified as a positional command line argument
369+
370+
**Pipeline Output data:**
371+
372+
- *-GL-annotations.tsv (Tab delineated table of gene annotations, used to add gene annotations in other GeneLab processing pipelines)
373+
- *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation)

0 commit comments

Comments
 (0)