You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Annotation table build overview with example commands
159
159
160
-
> Current GeneLab annotation tables are available on [figshare](https://figshare.com/), exact links for each reference organism are provided in the [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) file.
> > *Note: The values in the 'name' column of [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) (e.g., MOUSE, HUMAN, ARABIDOPSIS) are derived from the short names used in PANTHER. These short names are subject to change.*
160
+
Current GeneLab annotation tables are available on [figshare](https://figshare.com/), exact links for each reference organism are provided in the [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) file.
> Note: The values in the 'name' column of [GL-DPPD-7110-A_annotations.csv](GL-DPPD-7110-A_annotations.csv) (e.g., HUMAN, MOUSE, RAT) are derived from the short names used in PANTHER. These short names are subject to change.
171
+
- GO.db:
172
+
- GO ontology file updated on 2024-01-17
173
+
- Entrez gene data updated on 2024-03-12
174
+
- DB schema version 2.1
175
+
176
+
169
177
170
178
---
171
179
@@ -194,6 +202,18 @@ library(STRINGdb)
194
202
library(PANTHER.db)
195
203
library(rtracklayer)
196
204
```
205
+
**Input Data:**
206
+
207
+
- None (This is an initial setup step using predefined variables)
208
+
209
+
**Output Data:**
210
+
211
+
- GL_DPPD_ID (GeneLab Data Processing Pipeline Document ID)
212
+
- ref_tab_path (path to the reference table CSV file)
213
+
- readme_path (path to the README file)
214
+
- currently_accepted_orgs (list of currently supported organisms)
- gtf_link (URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
384
+
- target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor))
385
+
- target_organism (name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
386
+
- currently_accepted_orgs (list of currently supported organisms, output from [step 0](#0-set-up-environment))
387
+
- ref_tab_path (path to the reference table CSV file, output from [step 0](#0-set-up-environment))
388
+
389
+
**Output Data:**
390
+
391
+
- GTF (data frame containing the GTF file for the target organism)
392
+
- no_org_db (list of organisms that do not use org.db annotations due to inconsistent gene names across GTF and org.db)
393
+
325
394
<br>
326
395
327
396
---
@@ -394,6 +463,17 @@ if (target_organism == "Salmonella enterica") {
394
463
}
395
464
```
396
465
466
+
**Input Data:**
467
+
468
+
- GTF (data frame containing the parsed GTF file for the target organism, output from [step 3](#3-load-annotation-databases))
469
+
- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names))
470
+
- gtf_keytype_mappings (list of keys to extract from the GTF, for each organism)
471
+
472
+
**Output Data:**
473
+
474
+
- annot_gtf (initial annotation table derived from the GTF file, containing only the relevant columns for the target organism)
475
+
- primary_keytype (the name of the primary key type being used, e.g., "ENSEMBL", "TAIR", "LOCUS", based on the GTF gene_id entries)
476
+
397
477
<br>
398
478
399
479
---
@@ -448,12 +528,12 @@ orgdb_keytype <- if (!is.null(orgdb_keytype_mappings[[target_organism]])) {
448
528
orgdb_keytype_mappings[["default"]][["keytype"]]
449
529
}
450
530
451
-
# Function to clean and match ACCNUM keys for BRADI
@@ -497,6 +577,20 @@ if (target_organism == "Saccharomyces cerevisiae") {
497
577
}
498
578
```
499
579
580
+
**Input Data:**
581
+
582
+
- annot_gtf (initial annotation table derived from the GTF file, output from [step 4](#4-build-initial-annotation-table))
583
+
- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names))
584
+
- no_org_db (list of organisms that do not use annotations from an org.db, output from [step 3](#3-load-annotation-databases))
585
+
- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
586
+
- target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor))
587
+
588
+
**Output Data:**
589
+
590
+
- annot_orgdb (updated annotation table with additional keys from the organism-specific org.db)
591
+
- orgdb_query (the key type used to map to the org.db)
592
+
- orgdb_keytype (the name of the key type in the org.db)
593
+
500
594
<br>
501
595
502
596
---
@@ -609,6 +703,20 @@ if (target_organism == "Bacillus subtilis") {
609
703
annot_stringdb<- as.data.frame(annot_stringdb)
610
704
```
611
705
706
+
**Input Data:**
707
+
708
+
- annot_orgdb (annotation table with GTF and org.db annotations, output from [step 5](#5-add-orgdb-keys))
709
+
- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names))
710
+
- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
711
+
- target_taxid (taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
712
+
713
+
**Output Data:**
714
+
715
+
- annot_stringdb (updated annotation table with added STRING IDs)
716
+
- no_stringdb (list of organisms that do not use STRING annotations)
717
+
- stringdb_query (the key type used for mapping to STRING database)
718
+
- uses_old_locus (list of organisms where GTF gene_id entries do not match those in STRING, so entries in OLD_LOCUS are used to query STRING)
719
+
612
720
<br>
613
721
614
722
---
@@ -658,6 +766,20 @@ if (!(target_organism %in% no_panther_db)) {
658
766
}
659
767
```
660
768
769
+
**Input Data:**
770
+
771
+
- annot_orgdb (annotation table with GTF and org.db annotations, output from [step 5](#5-add-orgdb-keys))
772
+
- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names))
773
+
- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
774
+
- target_taxid (taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
775
+
776
+
**Output Data:**
777
+
778
+
- annot_stringdb (updated annotation table with added STRING IDs)
779
+
- no_stringdb (list of organisms that do not use STRING annotations)
780
+
- stringdb_query (the key type used for mapping to STRING database)
781
+
- uses_old_locus (list of organisms where the 'gene_id' column in the GTF dataframe does not match STRING identifiers, so the 'old_locus_tag' column from the GTF dataframe is used to query STRING instead)
- annot_pantherdb (annotation table with GTF, org.db, STRING, and PANTHER annotations, output from [step 7](#7-add-gene-ontology-go-slim-ids))
831
+
- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
832
+
- out_table_filename (name of the output annotation table file, output from [step 1](#1-define-variables-and-output-file-names))
833
+
- out_log_filename (name of the output log file, output from [step 1](#1-define-variables-and-output-file-names))
834
+
- GL_DPPD_ID (GeneLab Data Processing Pipeline Document ID, output from [step 0](#0-set-up-environment))
835
+
- gtf_link (URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
836
+
- target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor))
837
+
- no_org_db (list of organisms that do not use org.db annotations, output from [step 3](#3-load-annotation-databases))
838
+
839
+
**Output Data:**
840
+
841
+
- annot (final annotation table with annotations from the GTF, org.db, STRING, and PANTHER)
842
+
-***-GL-annotations.tsv** (annot saved as a tab-delimited table file)
843
+
-***-GL-build-info.txt** (annotation table build information log file)
0 commit comments