Skip to content

Commit 0338c93

Browse files
Merge pull request #125 from torres-alexis/DEV_GeneLab_Reference_Annotations_vGL-DPPD-7110-A
[GL_RefAnnotTable] Installation updates
2 parents 40e3652 + f839222 commit 0338c93

File tree

4 files changed

+200
-90
lines changed

4 files changed

+200
-90
lines changed
Lines changed: 125 additions & 85 deletions
Original file line numberDiff line numberDiff line change
@@ -1,155 +1,195 @@
1-
# GL_RefAnnotTable Workflow Information and Usage Instructions
1+
# GL_RefAnnotTable-A Workflow Information and Usage Instructions <!-- omit in toc -->
22

3-
## General workflow info
4-
The current GeneLab Reference Annotation Table (GL_RefAnnotTable-A) pipeline is implemented as an R workflow that can be run from a command line interface (CLI) using bash. The workflow can be used even if you are unfamiliar with R, but if you want to learn more about R, visit the [R-project about page here](https://www.r-project.org/about.html). Additionally, an introduction to R along with installation help and information about using R for bioinformatics can be found [here at Happy Belly Bioinformatics](https://astrobiomike.github.io/R/basics).
3+
## Table of Contents <!-- omit in toc -->
54

6-
## Utilizing the workflow
5+
- [General Workflow Information](#general-workflow-information)
6+
- [Utilizing the Workflow](#utilizing-the-workflow)
7+
- [1. Download the Workflow Files](#1-download-the-workflow-files)
8+
- [2. Run the Workflow](#2-run-the-workflow)
9+
- [Approach 1: Using Singularity](#approach-1-using-singularity)
10+
- [Step 1: Install Singularity](#step-1-install-singularity)
11+
- [Step 2: Fetch the Singularity Image](#step-2-fetch-the-singularity-image)
12+
- [Step 3: Run the Workflow](#step-3-run-the-workflow)
13+
- [Step 4: Run the Annotations Database Creation Function as a Stand-Alone Script](#step-4-run-the-annotations-database-creation-function-as-a-stand-alone-script)
14+
- [Approach 2: Using a Local R Environment](#approach-2-using-a-local-r-environment)
15+
- [Step 1: Install R and Required R Packages](#step-1-install-r-and-required-r-packages)
16+
- [Step 2: Run the Workflow](#step-2-run-the-workflow)
17+
- [Step 3: Run the Annotations Database Creation Function as a Stand-Alone Script](#step-3-run-the-annotations-database-creation-function-as-a-stand-alone-script)
718

8-
1. [Install R and R packages](#1-install-r-and-r-packages)
9-
2. [Download the workflow files](#2-download-the-workflow-files)
10-
3. [Setup Execution Permission for Workflow Scripts](#3-setup-execution-permission-for-workflow-scripts)
11-
4. [Run the workflow](#4-run-the-workflow)
12-
5. [Run the annotations database creation function as a stand-alone script](#5-run-the-annotations-database-creation-function-as-a-stand-alone-script)
13-
6. [Run the Workflow Using Docker or Singularity](#6-run-the-workflow-using-docker-or-singularity)
14-
<br>
19+
---
1520

16-
### 1. Install R and R packages
21+
## General Workflow Information
1722

18-
We recommend installing R via the [Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/) as follows:
23+
The current GeneLab Reference Annotation Table (GL_RefAnnotTable-A) pipeline is implemented as an R workflow that can be run from a command line interface (CLI) using bash. The workflow can be executed using either a Singularity container or a local R environment. The workflow can be used even if you are unfamiliar with R, but if you want to learn more about R, visit the [R-project about page here](https://www.r-project.org/about.html). Additionally, an introduction to R along with installation help and information about using R for bioinformatics can be found [here at Happy Belly Bioinformatics](https://astrobiomike.github.io/R/basics).
1924

20-
1. Select the [CRAN Mirror](https://cran.r-project.org/mirrors.html) closest to your location.
21-
2. Click the link under the "Download and Install R" section that's consistent with your machine.
22-
3. Click on the R-4.4.0 package consistent with your machine to download.
23-
4. Double click on the R-4.4.0.pkg downloaded in step 3 and follow the installation instructions.
25+
---
26+
27+
## Utilizing the Workflow
28+
29+
To utilize the GL_RefAnnotTable-A workflow, follow the instructions below to download the necessary workflow files. Once downloaded, the workflow can be executed using two approaches:
30+
31+
1. **[Using Singularity](#approach-1-using-singularity)**
32+
2. **[Using a Local R Environment](#approach-2-using-a-local-r-environment)**
33+
34+
Please follow the instructions for the approach that best matches your setup and preferences. Each method is explained in detail below.
2435

25-
Once R is installed, open a CLI terminal and run the following command to activate R:
36+
---
37+
38+
### 1. Download the Workflow Files
39+
40+
Download the latest version of the GL_RefAnnotTable-A workflow:
2641

2742
```bash
28-
R
43+
curl -LO https://github.com/nasa/GeneLab_Data_Processing/releases/download/GL_RefAnnotTable-A_1.1.0/GL_RefAnnotTable-A_1.1.0.zip
44+
unzip GL_RefAnnotTable-A_1.1.0.zip
2945
```
30-
`
31-
Within an active R environment, run the following commands to install the required R packages:
3246

33-
```R
34-
install.packages("tidyverse")
47+
---
3548

36-
install.packages("BiocManager")
49+
### 2. Run the Workflow
3750

38-
BiocManager::install("STRINGdb")
39-
BiocManager::install("PANTHER.db")
40-
BiocManager::install("rtracklayer")
41-
BiocManager::install("AnnotationForge")
42-
BiocManager::install("biomaRt")
43-
BiocManager::install("GO.db")
44-
```
51+
The GL_RefAnnotTable-A workflow can be run using two approaches:
4552

46-
<br>
53+
- **[Approach 1: Using Singularity](#approach-1-using-singularity)**
54+
- **[Approach 2: Using a Local R Environment](#approach-2-using-a-local-r-environment)**
4755

48-
### 2. Download the Workflow Files
56+
---
4957

50-
All files required for utilizing the GL_RefAnnotTable-A workflow for generating reference annotation tables are in the [workflow_code](workflow_code) directory. To get a copy of latest GL_RefAnnotTable version on to your system, run the following command:
58+
#### Approach 1: Using Singularity
5159

52-
```bash
53-
curl -LO https://github.com/nasa/GeneLab_Data_Processing/releases/download/GL_RefAnnotTable-A_1.1.0/GL_RefAnnotTable-A_1.1.0.zip
54-
```
60+
This approach allows you to run the workflow within a containerized environment, ensuring consistency and reproducibility.
61+
62+
##### Step 1: Install Singularity
63+
64+
Singularity is a containerization platform for running applications portably and reproducibly. We use container images hosted on Quay.io to encapsulate all the necessary software and dependencies required by the GL_RefAnnotTable-A workflow. This setup allows you to run the workflow without installing any software directly on your system. Other containerization tools like Docker or Apptainer can also be used to pull and run these images.
65+
66+
We recommend installing Singularity system-wide as per the official [Singularity installation documentation](https://docs.sylabs.io/guides/3.10/admin-guide/admin_quickstart.html).
67+
68+
> **Note**: While Singularity is also available through [Anaconda](https://anaconda.org/conda-forge/singularity), we recommend installing Singularity system-wide following the official installation documentation.
5569
56-
<br>
70+
##### Step 2: Fetch the Singularity Image
5771

58-
### 3. Setup Execution Permission for Workflow Scripts
72+
To pull the Singularity image needed for the workflow, you can use the provided script as directed below or pull the image directly.
5973

60-
Once you've downloaded the GL_RefAnnotTable-A workflow directory as a zip file, unzip the workflow then `cd` into the GL_RefAnnotTable-A_1.1.0 directory on the CLI. Next, run the following command to set the execution permissions for the R script:
74+
> **Note**: This command should be run in the location containing the `GL_RefAnnotTable-A_1.1.0` directory that was downloaded in [step 1](#1-download-the-workflow-files). Depending on your network speed, fetching the images will take approximately 20 minutes.
6175
6276
```bash
63-
unzip GL_RefAnnotTable-A_1.1.0.zip
64-
cd GL_RefAnnotTable-A_1.1.0
65-
chmod -R u+x *R
77+
bash GL_RefAnnotTable-A_1.1.0/bin/prepull_singularity.sh GL_RefAnnotTable-A_1.1.0/config/software/by_docker_image.config
6678
```
6779

68-
<br>
80+
Once complete, a `singularity` folder containing the Singularity images will be created. Run the following command to export this folder as an environment variable:
81+
82+
```bash
83+
export SINGULARITY_CACHEDIR=$(pwd)/singularity
84+
```
6985

70-
### 4. Run the Workflow
86+
##### Step 3: Run the Workflow
7187

72-
While in the GL_RefAnnotTable workflow directory, you are now able to run the workflow. Below is an example of how to run the workflow to build an annotation table for Mus musculus (mouse):
88+
While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder, you can now run the workflow. Below is an example for generating the annotation table for *Mus musculus* (mouse):
7389

7490
```bash
75-
Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'
91+
singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \
92+
$SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \
93+
Rscript /work/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'
7694
```
7795

7896
**Input data:**
7997

8098
- No input files are required. Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above. To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
81-
8299
- Optional: a reference table CSV can be supplied as a second positional argument instead of using the default [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
83100

84101
**Output data:**
85102

86103
- *-GL-annotations.tsv (Tab delineated table of gene annotations)
87104
- *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation)
88105

89-
### 5. Run the annotations database creation function as a stand-alone script
106+
##### Step 4: Run the Annotations Database Creation Function as a Stand-Alone Script
90107

91-
When the workflow is run, if the reference table does not specify an annotations database for the target_organism in the `annotations` column, the `install_annotations` function, defined in the `install-org-db.R` script, will be executed. This script will locally create and install an annotations database R package using AnnotationForge. This function can also be run as a stand-alone script from the command line:
108+
If the reference table does not specify an annotations database for the target organism in the 'annotations' column, the `install_annotations` function (defined in `install-org-db.R`) will be executed. This function can also be run as a stand-alone script:
92109

93110
```bash
94-
Rscript install-org-db.R 'Bacillus subtilis' /path/to/GL-DPPD-7110-A_annotations.csv
111+
singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \
112+
$SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \
113+
Rscript /work/install-org-db.R 'Bacillus subtilis'
95114
```
96115

97116
**Input data:**
98117

99-
- The target organism must be specified as the first positional command line argument, `Bacillus subtilis` is used in the example above. The correct argument for each organism can be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
100-
101-
- The path to a local reference table must also be supplied as the second positional argument
118+
- The target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in the example above. The correct argument for each organism can be found in the 'species' column of [GL-DPPD-7110-A_annotations.csv](https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
119+
- Optional: A local reference table can be supplied as a second positional argument. If not provided, the script will download the current version of GL-DPPD-7110-A_annotations.csv from Github by default.
102120

103121
**Output data:**
104122

105-
- org.*.eg.db/ (species-specific annotation database, as a local R package)
123+
- org.*.eg.db/ (Species-specific annotation database, as a local R package)
106124

107-
### 6. Run the Workflow Using Docker or Singularity
125+
---
108126

109-
Rather than running the workflow in your local environment, you can use a Docker or Singularity container. This method ensures that all dependencies are correctly installed.
127+
#### Approach 2: Using a Local R Environment
110128

111-
1. **Pull the container image:**
129+
This approach allows you to run the workflow directly in your local R environment without using containers.
112130

113-
Docker:
114-
```bash
115-
docker pull quay.io/nasa_genelab/gl-refannottable:v1.0.0
116-
```
131+
##### Step 1: Install R and Required R Packages
117132

118-
Singularity:
119-
```bash
120-
singularity pull docker://quay.io/nasa_genelab/gl-refannottable:v1.0.0
121-
```
133+
We recommend installing R via the [Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/):
122134

123-
2. **Download the workflow files:**
135+
1. Select the [CRAN Mirror](https://cran.r-project.org/mirrors.html) closest to your location.
136+
2. Navigate to the download page for your operating system.
137+
3. Download and install R (e.g., R-4.4.0).
124138

125-
```bash
126-
curl -LO https://github.com/nasa/GeneLab_Data_Processing/releases/download/GL_RefAnnotTable-A_1.1.0/GL_RefAnnotTable-A_1.1.0.zip
127-
unzip GL_RefAnnotTable-A_1.1.0.zip
128-
```
139+
Once R is installed, you need to install the required R packages.
129140

130-
3. **Run the workflow:**
141+
Open a terminal and start R:
131142

132-
Docker:
133-
```bash
134-
docker run -it -v $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \
135-
quay.io/nasa_genelab/gl-refannottable:v1.0.0 \
136-
bash -c "cd /work && Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'"
137-
```
143+
```bash
144+
R
145+
```
138146

139-
Singularity:
140-
```bash
141-
singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \
142-
gl-refannottable_v1.0.0.sif \
143-
bash -c "cd /work && Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'"
144-
```
147+
Within the R environment, run the following commands to install the required packages:
148+
149+
```R
150+
install.packages("tidyverse")
151+
install.packages("BiocManager")
152+
BiocManager::install("STRINGdb")
153+
BiocManager::install("PANTHER.db")
154+
BiocManager::install("rtracklayer")
155+
BiocManager::install("AnnotationForge")
156+
BiocManager::install("biomaRt")
157+
BiocManager::install("GO.db")
158+
```
159+
160+
##### Step 2: Run the Workflow
161+
162+
While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder, you can now run the workflow. Below is an example of how to run the workflow to build an annotation table for *Mus musculus* (mouse):
163+
164+
```bash
165+
Rscript GL_RefAnnotTable-A_1.1.0/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'
166+
```
145167

146168
**Input data:**
147169

148170
- No input files are required. Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above. To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
149-
150171
- Optional: a reference table CSV can be supplied as a second positional argument instead of using the default [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
151172

152173
**Output data:**
153174

154175
- *-GL-annotations.tsv (Tab delineated table of gene annotations)
155176
- *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation)
177+
178+
##### Step 3: Run the Annotations Database Creation Function as a Stand-Alone Script
179+
180+
If the reference table does not specify an annotations database for the target organism in the 'annotations' column, the `install_annotations` function (defined in `install-org-db.R`) will be executed. This function can also be run as a stand-alone script:
181+
182+
```bash
183+
Rscript GL_RefAnnotTable-A_1.1.0/install-org-db.R 'Bacillus subtilis'
184+
```
185+
186+
**Input data:**
187+
188+
- The target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in the example above. The correct argument for each organism can be found in the 'species' column of [GL-DPPD-7110-A_annotations.csv](https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
189+
- Optional: A local reference table can be supplied as a second positional argument. If not provided, the script will download the current version of GL-DPPD-7110-A_annotations.csv from Github by default.
190+
191+
**Output data:**
192+
193+
- org.*.eg.db/ (species-specific annotation database, as a local R package)
194+
195+
---
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
2+
#!/usr/bin/env bash
3+
4+
# Addresses issue: https://github.com/nextflow-io/nextflow/issues/1210
5+
6+
CONFILE=${1:-nextflow.config}
7+
OUTDIR=${2:-./singularity}
8+
9+
if [ ! -e $CONFILE ]; then
10+
echo "$CONFILE does not exist"
11+
exit
12+
fi
13+
14+
TMPFILE=`mktemp`
15+
16+
CURDIR=$(pwd)
17+
18+
mkdir -p $OUTDIR
19+
20+
cat ${CONFILE}|grep 'container'|perl -lane 'if ( $_=~/container\s*\=\s*\"(\S+)\"/ ) { $_=~/container\s*\=\s*\"(\S+)\"/; print $1 unless ( $1=~/^\s*$/ or $1=~/\.sif/ or $1=~/\.img/ ) ; }' > $TMPFILE
21+
22+
cd ${OUTDIR}
23+
24+
while IFS= read -r line; do
25+
name=$line
26+
name=${name/:/-}
27+
name=${name//\//-}
28+
echo $name
29+
singularity pull ${name}.img docker://$line
30+
done < $TMPFILE
31+
32+
cd $CURDIR
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
// Config that specifies containers for nextflow processes
2+
process {
3+
withName: 'GL_REFANNOTTABLE_A' {
4+
container = "quay.io/nasa_genelab/gl-refannottable-a:1.1.0"
5+
}
6+
}

GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R

Lines changed: 37 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,31 @@
11
# install-org-db.R
22

3+
# Set R library path to current working directory
4+
lib_path <- file.path(getwd())
5+
.libPaths(lib_path)
6+
7+
# Load required libraries
8+
library(tidyverse)
9+
library(AnnotationForge)
10+
library(BiocManager)
11+
312
# Function: Get annotations db from ref table. If no annotations db is defined, create the package name from genus, species, (and strain for microbes),
413
# Try to Bioconductor install annotations db. If fail then build the package using AnnotationForge, install it into the current directory.
514
# Requires ~80GB for NCBIFilesDir file caching
6-
install_annotations <- function(target_organism, refTablePath) {
7-
if (!file.exists(refTablePath)) {
8-
stop("Reference table file does not exist at the specified path: ", refTablePath)
9-
}
15+
install_annotations <- function(target_organism, refTablePath = NULL) {
16+
# Default URL for the specific version of the reference CSV
17+
default_url <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv"
18+
19+
# Use the provided path if available, otherwise use the default URL
20+
csv_source <- ifelse(is.null(refTablePath), default_url, refTablePath)
21+
22+
# Attempt to read the CSV file
23+
ref_table <- tryCatch({
24+
read.csv(csv_source)
25+
}, error = function(e) {
26+
stop("Failed to read the reference table: ", e$message)
27+
})
1028

11-
ref_table <- read.csv(refTablePath)
1229
target_taxid <- ref_table %>%
1330
filter(species == target_organism) %>%
1431
pull(taxon)
@@ -52,6 +69,7 @@ install_annotations <- function(target_organism, refTablePath) {
5269
} else {
5370
cat(paste0("\nAttempting to install '", target_org_db, "' from Bioconductor...\n"))
5471
BiocManager::install(target_org_db, ask = FALSE)
72+
5573
if (requireNamespace(target_org_db, quietly = TRUE)) {
5674
cat(paste0("'", target_org_db, "' has been successfully installed from Bioconductor.\n"))
5775
} else {
@@ -85,3 +103,17 @@ install_annotations <- function(target_organism, refTablePath) {
85103
cat(paste0("Using Annotation Database '", target_org_db, "'.\n"))
86104
return(target_org_db)
87105
}
106+
107+
if (!interactive()) {
108+
# Parse command line arguments
109+
args <- commandArgs(trailingOnly = TRUE)
110+
111+
if (length(args) < 1) {
112+
stop("Usage: Rscript install-org-db.R <target_organism> [refTablePath]")
113+
}
114+
115+
target_organism <- args[1]
116+
refTablePath <- if (length(args) > 1) args[2] else NULL
117+
118+
install_annotations(target_organism, refTablePath)
119+
}

0 commit comments

Comments
 (0)