Skip to content

Commit 6299719

Browse files
Input output updates, remove unnecessary variables
1 parent cc11ff9 commit 6299719

File tree

1 file changed

+64
-72
lines changed
  • GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A

1 file changed

+64
-72
lines changed

GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A.md

Lines changed: 64 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -208,10 +208,10 @@ library(rtracklayer)
208208

209209
**Output Data:**
210210

211-
- GL_DPPD_ID (GeneLab Data Processing Pipeline Document ID)
212-
- ref_tab_path (path to the reference table CSV file)
213-
- readme_path (path to the README file)
214-
- currently_accepted_orgs (list of currently supported organisms)
211+
- `GL_DPPD_ID` (variable specifying the GeneLab Data Processing Pipeline Document ID)
212+
- `ref_tab_path` (variable specifying the path to the reference table CSV file)
213+
- `readme_path` (variable specifying the path to the README file)
214+
- `currently_accepted_orgs` (variable specifying the list of currently supported organisms)
215215

216216
<br>
217217

@@ -238,13 +238,12 @@ target_info <- ref_table %>%
238238
# Extract the relevant columns from the reference table
239239
target_taxid <- target_info$taxon # Taxonomic identifier
240240
target_org_db <- target_info$annotations # org.eg.db R package
241-
target_species_designation <- target_info$species # Full species name
242241
gtf_link <- target_info$gtf # Path to reference assembly GTF
243242
target_short_name <- target_info$name # PANTHER / UNIPROT short name; blank if not available
244243
ref_source <- target_info$ref_source # Reference files source
245244

246245
# Error handling for missing values
247-
if (is.na(target_taxid) || is.na(target_org_db) || is.na(target_species_designation) || is.na(gtf_link)) {
246+
if (is.na(target_taxid) || is.na(target_org_db) || is.na(target_organism) || is.na(gtf_link)) {
248247
stop(paste("Error: Missing data for target organism", target_organism, "in reference table."))
249248
}
250249

@@ -271,19 +270,19 @@ if ( file.exists(out_table_filename) ) {
271270
```
272271
**Input Data:**
273272

274-
- ref_tab_path (path to the reference table CSV file, output from [step 0](#0-set-up-environment))
275-
- target_organism (name of the target organism for which annotations are being generated)
273+
- `ref_tab_path` (variable specifying the path to the reference table CSV file, output from [step 0](#0-set-up-environment))
274+
- `target_organism` (variable specifying the full species name of the target organism for which annotations are being generated)
275+
- > *Note: This is provided as a positional argument when the R script is run.*
276276
277277
**Output Data:**
278278

279-
- target_taxid (taxonomic identifier for the target organism)
280-
- target_org_db (name of the org.db R package for the target organism)
281-
- target_species_designation (full species name of the target organism)
282-
- gtf_link (URL to the GTF file for the target organism)
283-
- target_short_name (PANTHER/UNIPROT short name for the target organism)
284-
- ref_source (source of the reference files, e.g., "ensembl", "ensembl_plants", "ensembl_bacteria", "ncbi")
285-
- out_table_filename (name of the output annotation table file)
286-
- out_log_filename (name of the output log file)
279+
- `target_taxid` (variable specifying the taxonomic identifier for the target organism)
280+
- `target_org_db` (variable specifying the name of the org.db R package for the target organism)
281+
- `gtf_link` (variable specifying the URL to the GTF file for the target organism)
282+
- `target_short_name` (variable specifying the PANTHER/UNIPROT short name for the target organism)
283+
- `ref_source` (variable specifying the source of the reference files, e.g., "ensembl", "ensembl_plants", "ensembl_bacteria", "ncbi")
284+
- `out_table_filename` (variable specifying the name of the output annotation table file)
285+
- `out_log_filename` (variable specifying the name of the output log file)
287286

288287
<br>
289288

@@ -299,9 +298,9 @@ BiocManager::install(target_org_db, ask = FALSE)
299298
if (!requireNamespace(target_org_db, quietly = TRUE)) {
300299
tryCatch({
301300
# Parse organism's name in the reference table to create the org.db name (target_org_db)
302-
genus_species <- strsplit(target_species_designation, " ")[[1]]
301+
genus_species <- strsplit(target_organism, " ")[[1]]
303302
if (length(genus_species) < 1) {
304-
stop("Species designation is not correctly formatted: ", target_species_designation)
303+
stop("Species designation is not correctly formatted: ", target_organism)
305304
}
306305
genus <- genus_species[1]
307306
species <- ifelse(length(genus_species) > 1, genus_species[2], "")
@@ -336,15 +335,14 @@ if (!requireNamespace(target_org_db, quietly = TRUE)) {
336335

337336
**Input Data:**
338337

339-
- target_org_db (name of the org.db R package for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
340-
- target_species_designation (full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
341-
- ref_table (reference table containing organism-specific information, output from [step 1](#1-define-variables-and-output-file-names))
342-
- target_organism (name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
343-
- target_taxid (taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
338+
- `target_org_db` (variable specifying the name of the org.db R package for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
339+
- `ref_table` (variable specifying the reference table containing organism-specific information, output from [step 1](#1-define-variables-and-output-file-names))
340+
- `target_organism` (variable specifying the full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
341+
- `target_taxid` (variable specifying the taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
344342

345343
**Output Data:**
346344

347-
- target_org_db (updated name of the org.db R package, if it was created locally)
345+
- `target_org_db` (variable specifying the updated name of the org.db R package, if it was created locally)
348346
- Locally installed org.db package (if the package is not available on Bioconductor, a new package is created and installed)
349347

350348
<br>
@@ -380,16 +378,16 @@ if (!(target_organism %in% no_org_db) && (target_organism %in% currently_accepte
380378

381379
**Input Data:**
382380

383-
- gtf_link (URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
384-
- target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor))
385-
- target_organism (name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
386-
- currently_accepted_orgs (list of currently supported organisms, output from [step 0](#0-set-up-environment))
387-
- ref_tab_path (path to the reference table CSV file, output from [step 0](#0-set-up-environment))
381+
- `gtf_link` (variable specifying the URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
382+
- `target_org_db` (variable specifying the name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) or [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor))
383+
- `target_organism` (variable specifying the full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
384+
- `currently_accepted_orgs` (variable specifying the list of currently supported organisms, output from [step 0](#0-set-up-environment))
385+
- `ref_tab_path` (variable specifying the path to the reference table CSV file, output from [step 0](#0-set-up-environment))
388386

389387
**Output Data:**
390388

391-
- GTF (data frame containing the GTF file for the target organism)
392-
- no_org_db (list of organisms that do not use org.db annotations due to inconsistent gene names across GTF and org.db)
389+
- `GTF` (variable holding the data frame containing the GTF file for the target organism)
390+
- `no_org_db` (variable specifying the list of organisms that do not use org.db annotations due to inconsistent gene names across GTF and org.db)
393391

394392
<br>
395393

@@ -465,14 +463,14 @@ if (target_organism == "Salmonella enterica") {
465463

466464
**Input Data:**
467465

468-
- GTF (data frame containing the parsed GTF file for the target organism, output from [step 3](#3-load-annotation-databases))
469-
- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names))
470-
- gtf_keytype_mappings (list of keys to extract from the GTF, for each organism)
466+
- `GTF` (variable holding the data frame containing the parsed GTF file for the target organism, output from [step 3](#3-load-annotation-databases))
467+
- `target_organism` (variable specifying the full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
468+
- `gtf_keytype_mappings` (variable specifying the list of keys to extract from the GTF, for each organism)
471469

472470
**Output Data:**
473471

474-
- annot_gtf (initial annotation table derived from the GTF file, containing only the relevant columns for the target organism)
475-
- primary_keytype (the name of the primary key type being used, e.g., "ENSEMBL", "TAIR", "LOCUS", based on the GTF gene_id entries)
472+
- `annot_gtf` (variable holding the initial annotation table derived from the GTF file, containing only the relevant columns for the target organism)
473+
- `primary_keytype` (variable specifying the name of the primary key type being used, e.g., "ENSEMBL", "TAIR", "LOCUS", based on the GTF gene_id entries)
476474

477475
<br>
478476

@@ -579,17 +577,17 @@ if (target_organism == "Saccharomyces cerevisiae") {
579577

580578
**Input Data:**
581579

582-
- annot_gtf (initial annotation table derived from the GTF file, output from [step 4](#4-build-initial-annotation-table))
583-
- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names))
584-
- no_org_db (list of organisms that do not use annotations from an org.db, output from [step 3](#3-load-annotation-databases))
585-
- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
586-
- target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor))
580+
- `annot_gtf` (variable holding the initial annotation table derived from the GTF file, output from [step 4](#4-build-initial-annotation-table))
581+
- `target_organism` (variable specifying the full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
582+
- `no_org_db` (variable specifying the list of organisms that do not use annotations from an org.db, output from [step 3](#3-load-annotation-databases))
583+
- `primary_keytype` (variable specifying the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
584+
- `target_org_db` (variable specifying the name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) or [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor))
587585

588586
**Output Data:**
589587

590-
- annot_orgdb (updated annotation table with additional keys from the organism-specific org.db)
591-
- orgdb_query (the key type used to map to the org.db)
592-
- orgdb_keytype (the name of the key type in the org.db)
588+
- `annot_orgdb` (variable holding the updated annotation table with GTF and organism-specific org.db annotations)
589+
- `orgdb_query` (variable specifying the key type used to map to the org.db)
590+
- `orgdb_keytype` (variable specifying the name of the key type in the org.db)
593591

594592
<br>
595593

@@ -624,7 +622,6 @@ stringdb_query <- if (!is.null(stringdb_query_list[[target_organism]])) {
624622
uses_old_locus <- c("Lactobacillus acidophilus", "Mycobacterium marinum", "Serratia liquefaciens", "Streptococcus mutans", "Vibrio fischeri")
625623
# Handle STRING annotation processing based on the target organism
626624
if (target_organism %in% uses_old_locus) {
627-
# If the target organism is one of the NOENTRY organisms, handle the OLD_LOCUS splitting
628625
annot_stringdb <- annot_orgdb %>%
629626
separate_rows(!!sym(stringdb_query), sep = ",", convert = TRUE) %>%
630627
distinct() %>%
@@ -705,17 +702,17 @@ annot_stringdb <- as.data.frame(annot_stringdb)
705702

706703
**Input Data:**
707704

708-
- annot_orgdb (annotation table with GTF and org.db annotations, output from [step 5](#5-add-orgdb-keys))
709-
- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names))
710-
- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
711-
- target_taxid (taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
705+
- `annot_orgdb` (variable holding the annotation table with GTF and organism-specific org.db annotations, output from [step 5](#5-add-orgdb-keys))
706+
- `target_organism` (variable specifying the full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
707+
- `primary_keytype` (variable specifying the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
708+
- `target_taxid` (variable specifying the taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
712709

713710
**Output Data:**
714711

715-
- annot_stringdb (updated annotation table with added STRING IDs)
716-
- no_stringdb (list of organisms that do not use STRING annotations)
717-
- stringdb_query (the key type used for mapping to STRING database)
718-
- uses_old_locus (list of organisms where GTF gene_id entries do not match those in STRING, so entries in OLD_LOCUS are used to query STRING)
712+
- `annot_stringdb` (variable holding the updated annotation table with GTF, organism-specific org.db, and STRING annotations)
713+
- `no_stringdb` (variable specifying the list of organisms that do not use STRING annotations)
714+
- `stringdb_query` (variable specifying the key type used for mapping to STRING database)
715+
- `uses_old_locus` (variable specifying the list of organisms where GTF gene_id entries do not match those in STRING, so entries in OLD_LOCUS are used to query STRING)
719716

720717
<br>
721718

@@ -736,7 +733,6 @@ if (!(target_organism %in% no_panther_db)) {
736733
pantherdb_keytype = "ENTREZ"
737734

738735
# Retrieve target organism PANTHER GO slim annotations database using the UNIPROT / PANTHER short name
739-
target_short_name <- target_species_designation
740736
pthOrganisms(PANTHER.db) <- target_short_name
741737

742738
# Define a function to retrieve GO slim IDs for a given gene's ENTREZIDs, which may include entries separated by a "|"
@@ -768,17 +764,13 @@ if (!(target_organism %in% no_panther_db)) {
768764

769765
**Input Data:**
770766

771-
- annot_orgdb (annotation table with GTF and org.db annotations, output from [step 5](#5-add-orgdb-keys))
772-
- target_organism (target organism's full species name, output from [step 1](#1-define-variables-and-output-file-names))
773-
- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
774-
- target_taxid (taxonomic identifier for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
767+
- `annot_stringdb` (variable holding the annotation table with GTF, organism-specific org.db, and STRING annotations, output from [step 6](#6-add-string-ids))
768+
- `target_organism` (variable specifying the full species name of the target organism, output from [step 1](#1-define-variables-and-output-file-names))
775769

776770
**Output Data:**
777771

778-
- annot_stringdb (updated annotation table with added STRING IDs)
779-
- no_stringdb (list of organisms that do not use STRING annotations)
780-
- stringdb_query (the key type used for mapping to STRING database)
781-
- uses_old_locus (list of organisms where the 'gene_id' column in the GTF dataframe does not match STRING identifiers, so the 'old_locus_tag' column from the GTF dataframe is used to query STRING instead)
772+
- `annot_pantherdb` (variable holding the updated annotation table with GTF, organism-specific org.db, STRING, and PANTHER GO Slim annotations)
773+
- `no_panther_db` (variable specifying the list of organisms that do not use PANTHER annotations)
782774

783775
<br>
784776

@@ -827,19 +819,19 @@ write(capture.output(sessionInfo()), out_log_filename, append = TRUE)
827819

828820
**Input Data:**
829821

830-
- annot_pantherdb (annotation table with GTF, org.db, STRING, and PANTHER annotations, output from [step 7](#7-add-gene-ontology-go-slim-ids))
831-
- primary_keytype (the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
832-
- out_table_filename (name of the output annotation table file, output from [step 1](#1-define-variables-and-output-file-names))
833-
- out_log_filename (name of the output log file, output from [step 1](#1-define-variables-and-output-file-names))
834-
- GL_DPPD_ID (GeneLab Data Processing Pipeline Document ID, output from [step 0](#0-set-up-environment))
835-
- gtf_link (URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
836-
- target_org_db (name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) and [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor))
837-
- no_org_db (list of organisms that do not use org.db annotations, output from [step 3](#3-load-annotation-databases))
822+
- `annot_pantherdb` (variable holding the updated annotation table with GTF, organism-specific org.db, STRING, and PANTHER GO Slim annotations, output from [step 7](#7-add-gene-ontology-go-slim-ids))
823+
- `primary_keytype` (variable specifying the name of the primary key type being used, output from [step 4](#4-build-initial-annotation-table))
824+
- `out_table_filename` (variable specifying the name of the output annotation table file, output from [step 1](#1-define-variables-and-output-file-names))
825+
- `out_log_filename` (variable specifying the name of the output log file, output from [step 1](#1-define-variables-and-output-file-names))
826+
- `GL_DPPD_ID` (variable specifying the GeneLab Data Processing Pipeline Document ID, output from [step 0](#0-set-up-environment))
827+
- `gtf_link` (variable specifying the URL to the GTF file for the target organism, output from [step 1](#1-define-variables-and-output-file-names))
828+
- `target_org_db` (variable specifying the name of the org.eg.db R package for the target organism, output from [steps 1](#1-define-variables-and-output-file-names) or [2](#2-create-the-organism-package-if-it-is-not-hosted-by-bioconductor))
829+
- `no_org_db` (variable specifying the list of organisms that do not use org.db annotations, output from [step 3](#3-load-annotation-databases))
838830

839831
**Output Data:**
840832

841-
- annot (final annotation table with annotations from the GTF, org.db, STRING, and PANTHER)
842-
- ***-GL-annotations.tsv** (annot saved as a tab-delimited table file)
833+
- `annot` (variable holding the final annotation table with GTF, organism-specific org.db, STRING, and PANTHER GO Slim annotations)
834+
- ***-GL-annotations.tsv** (final annotation table saved as a tab-delimited table file)
843835
- ***-GL-build-info.txt** (annotation table build information log file)
844836

845837
<br>

0 commit comments

Comments
 (0)