You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Add software updates to CHANGELOG
- Add input and output variables to DPPD document
- Prepend species name to output files for non-ENSEMBL reference organisms to make sure it is in the file names
- Fix unclear variable names and wording in some functions in the script
- Move GO column to the end of the annotation tables when applicable
- gtf_link (URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
364
+
- target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor))
365
+
- target_organism (name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
366
+
- currently_accepted_orgs (list of currently supported organisms, defined at the beginning of the script)
367
+
- ref_tab_path ([path to the reference table CSV](GL-DPPD-7110-A_annotations.csv))
368
+
369
+
**Output Data:**
370
+
371
+
- GTF (data frame containing the GTF file for the target organism)
372
+
- no_org_db (list of organisms that do not use org.db annotations due to inconsistent gene names across GTF and org.db)
373
+
- Loaded org.db package (the organism-specific annotation package is loaded into the R session, if applicable)
374
+
325
375
<br>
326
376
327
377
---
@@ -394,6 +444,17 @@ if (target_organism == "Salmonella enterica") {
394
444
}
395
445
```
396
446
447
+
**Input Data:**
448
+
449
+
- GTF (data frame containing the parsed GTF file for the target organism, output from [step 3](#3-load-annotation-databases))
450
+
- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names))
451
+
- gtf_keytype_mappings (list of keys to extract from the GTF, for each organism)
452
+
453
+
**Output Data:**
454
+
455
+
- annot_gtf (initial annotation table derived from the GTF file, containing only the relevant columns for the target organism)
456
+
- primary_keytype (the name of the primary key type being used, e.g., "ENSEMBL", "TAIR", "LOCUS", based on the GTF gene_id entries)
457
+
397
458
<br>
398
459
399
460
---
@@ -448,12 +509,12 @@ orgdb_keytype <- if (!is.null(orgdb_keytype_mappings[[target_organism]])) {
448
509
orgdb_keytype_mappings[["default"]][["keytype"]]
449
510
}
450
511
451
-
# Function to clean and match ACCNUM keys for BRADI
@@ -497,6 +558,20 @@ if (target_organism == "Saccharomyces cerevisiae") {
497
558
}
498
559
```
499
560
561
+
**Input Data:**
562
+
563
+
- annot_gtf (initial annotation table derived from the GTF file, output from [step 4](#4-build-initial-annotation-table))
564
+
- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names))
565
+
- no_org_db (list of organisms that do not use annotations from an org.db, output from [step 3](#3-load-annotation-databases))
566
+
- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
567
+
- target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor))
568
+
569
+
**Output Data:**
570
+
571
+
- annot_orgdb (updated annotation table with additional keys from the organism-specific org.db)
572
+
- orgdb_query (the key type used to map to the org.db)
573
+
- orgdb_keytype (the name of the key type in the org.db)
574
+
500
575
<br>
501
576
502
577
---
@@ -609,6 +684,20 @@ if (target_organism == "Bacillus subtilis") {
609
684
annot_stringdb<- as.data.frame(annot_stringdb)
610
685
```
611
686
687
+
**Input Data:**
688
+
689
+
- annot_orgdb (annotation table with GTF and org.db annotations, output from [step 5](#5-add-orgdb-keys))
690
+
- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names))
691
+
- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
692
+
- target_taxid (taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
693
+
694
+
**Output Data:**
695
+
696
+
- annot_stringdb (updated annotation table with added STRING IDs)
697
+
- no_stringdb (list of organisms that do not use STRING annotations)
698
+
- stringdb_query (the key type used for mapping to STRING database)
699
+
- uses_old_locus (list of organisms where GTF gene_id entries do not match those in STRING, so entries in OLD_LOCUS are used to query STRING)
700
+
612
701
<br>
613
702
614
703
---
@@ -658,6 +747,20 @@ if (!(target_organism %in% no_panther_db)) {
658
747
}
659
748
```
660
749
750
+
**Input Data:**
751
+
752
+
- annot_orgdb (annotation table with GTF and org.db annotations, output from [step 5](#5-add-orgdb-keys))
753
+
- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names))
754
+
- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
755
+
- target_taxid (taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
756
+
757
+
**Output Data:**
758
+
759
+
- annot_stringdb (updated annotation table with added STRING IDs)
760
+
- no_stringdb (list of organisms that do not use STRING annotations)
761
+
- stringdb_query (the key type used for mapping to STRING database)
762
+
- uses_old_locus (list of organisms where the 'gene_id' column in the GTF dataframe does not match STRING identifiers, so the 'old_locus_tag' column from the GTF dataframe is used to query STRING instead)
- annot_pantherdb (annotation table with GTF, org.db, STRING, and PANTHER annotations, output from [step 7](#7-add-gene-ontology-go-slim-ids))
812
+
- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
813
+
- out_table_filename (name of the output annotation table file, output from [step 1](#1-define-variables-and-output-file-names))
814
+
- out_log_filename (name of the output log file, output from [step 1](#1-define-variables-and-output-file-names))
815
+
- GL_DPPD_ID (GeneLab Data Processing Pipeline Document ID, from step 0)
816
+
- gtf_link (URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
817
+
- target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor))
818
+
- no_org_db (list of organisms that do not use org.db annotations, output from [step 3](#3-load-annotation-databases))
819
+
820
+
**Output Data:**
821
+
822
+
- annot (final annotation table with annotations from the GTF, org.db, STRING, and PANTHER)
823
+
-***-GL-annotations.tsv** (annot saved as a tab-delimited table file)
824
+
-***-GL-build-info.txt** (annotation table build information log file)
0 commit comments