Skip to content

Commit 232e421

Browse files
committed
[GL_RefAnnotTable] add container + local instructions
1 parent 4f181bf commit 232e421

File tree

4 files changed

+200
-43
lines changed

4 files changed

+200
-43
lines changed
Lines changed: 179 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,85 +1,231 @@
11
# GL_RefAnnotTable-A Workflow Information and Usage Instructions <!-- omit in toc -->
22

3-
## General Workflow Info <!-- omit in toc -->
3+
## Table of Contents <!-- omit in toc -->
4+
- [General Workflow Info](#general-workflow-info)
5+
- [Utilizing the Workflow](#utilizing-the-workflow)
6+
- [Approach 1: Using Apptainer](#approach-1-using-apptainer)
7+
- [1. Install Apptainer](#1-install-apptainer)
8+
- [2. Download the Workflow Files](#2-download-the-workflow-files)
9+
- [3. Fetch Apptainer Image](#3-fetch-apptainer-image)
10+
- [4. Run the Workflow](#4-run-the-workflow)
11+
- [5. Run the Annotations Database Creation Function as a Stand-Alone Script](#5-run-the-annotations-database-creation-function-as-a-stand-alone-script)
12+
- [Approach 2: Using a Local R Environment](#approach-2-using-a-local-r-environment)
13+
- [1. Install R and Required R Packages](#1-install-r-and-required-r-packages)
14+
- [2. Download the Workflow Files](#2-download-the-workflow-files-1)
15+
- [3. Set Execution Permissions for Workflow Scripts](#3-set-execution-permissions-for-workflow-scripts)
16+
- [4. Run the Workflow](#4-run-the-workflow-1)
17+
- [5. Run the Annotations Database Creation Function as a Stand-Alone Script](#5-run-the-annotations-database-creation-function-as-a-stand-alone-script-1)
418

5-
### Implementation Tools <!-- omit in toc -->
19+
<br>
20+
21+
---
622

7-
The current GeneLab Reference Annotation Table (GL_RefAnnotTable-A) pipeline is implemented as an R workflow and utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) to run all tools in a containerized environment. This workflow is run using the command line interface (CLI) of any unix-based system.
23+
## General Workflow Info
24+
25+
The current GeneLab Reference Annotation Table (GL_RefAnnotTable-A) pipeline is implemented as an R workflow that can be run from a command line interface (CLI) using bash. The workflow can be executed using either a Apptainer (formerly Singularity) container or a local R environment. The workflow can be used even if you are unfamiliar with R, but if you want to learn more about R, visit the [R-project about page here](https://www.r-project.org/about.html). Additionally, an introduction to R along with installation help and information about using R for bioinformatics can be found [here at Happy Belly Bioinformatics](https://astrobiomike.github.io/R/basics).
826

927
<br>
1028

1129
---
30+
1231
## Utilizing the Workflow
1332

14-
1. [Install Singularity](#1-install-singularity)
15-
2. [Download the Workflow Files](#2-download-the-workflow-files)
16-
3. [Fetch Singularity Images](#3-fetch-singularity-images)
17-
4. [Run the Workflow](#4-run-the-workflow)
18-
5. [Run the annotations database creation function as a stand-alone script](#5-run-the-annotations-database-creation-function-as-a-stand-alone-script)
33+
The GL_RefAnnotTable-A workflow can be run using two approaches:
34+
35+
1. **[Using Apptainer](#approach-1-using-apptainer)**.
36+
37+
2. **[Using a local R environment](#approach-2-using-a-local-r-environment)**.
38+
39+
Please follow the instructions for the approach that best matches your setup and preferences. Each method is explained in the sections below.
1940

2041
<br>
2142

2243
---
2344

24-
### 1. Install Singularity
45+
### Approach 1: Using Apptainer
46+
47+
This approach allows you to run the workflow within a containerized environment, ensuring consistency and reproducibility.
48+
49+
<br>
50+
51+
---
2552

26-
Singularity is a container platform that allows usage of containerized software. This enables the GL_RefAnnotTable-A workflow to retrieve and use all software required for processing without the need to install the software directly on the user's system.
53+
#### 1. Install Apptainer
2754

28-
We recommend installing Singularity on a system wide level as per the associated [documentation](https://docs.sylabs.io/guides/3.10/admin-guide/admin_quickstart.html).
55+
Apptainer can be installed either through [Anaconda](https://anaconda.org/conda-forge/singularity) or as documented on the [Apptainer documentation page](https://apptainer.org/docs/admin/main/installation.html).
2956

30-
> Note: Singularity is also available through [Anaconda](https://anaconda.org/conda-forge/singularity).
57+
> **Note**: If you prefer to use Anaconda, we recommend installing Miniconda for your system, as instructed by [Happy Belly Bioinformatics](https://astrobiomike.github.io/unix/conda-intro#getting-and-installing-conda).
58+
>
59+
> Once conda is installed on your system, you can install Apptainer by running:
60+
>
61+
> ```bash
62+
> conda install -c conda-forge apptainer
63+
> ```
3164
3265
<br>
3366
3467
---
3568
36-
### 2. Download the Workflow Files
69+
#### 2. Download the Workflow Files
3770
38-
All files required for utilizing the GL_RefAnnotTable-A workflow for generating reference annotation tables are in the [workflow_code](workflow_code) directory. To get a copy of latest GL_RefAnnotTable-A version on to your system, run the following commands:
71+
Download the latest version of the GL_RefAnnotTable-A workflow:
3972
4073
```bash
4174
curl -LO https://github.com/nasa/GeneLab_Data_Processing/releases/download/GL_RefAnnotTable-A_1.1.0/GL_RefAnnotTable-A_1.1.0.zip
4275
unzip GL_RefAnnotTable-A_1.1.0.zip
76+
cd GL_RefAnnotTable-A_1.1.0
77+
```
78+
79+
<br>
80+
81+
---
82+
83+
#### 3. Fetch Apptainer Image
84+
85+
To fetch the Apptainer images needed for the workflow, run:
86+
87+
```bash
88+
bash bin/prepull_apptainer.sh config/software/by_docker_image.config
89+
```
90+
> Note: This command should be run in the directory containing the GL_RefAnnotTable-A_1.1.0 folder downloaded in [step 2](#2-download-the-workflow-files). Depending on your network speed, this may take approximately 20 minutes.
91+
92+
Once complete, an apptainer folder containing the Apptainer images will be created. Export this folder as an Apptainer configuration environment variable:
93+
94+
```bash
95+
export APPTAINER_CACHEDIR=$(pwd)/apptainer
96+
```
97+
98+
<br>
99+
100+
---
101+
102+
#### 4. Run the Workflow
103+
104+
While in the `GL_RefAnnotTable-A_1.1.0` directory, you can now run the workflow. Below is an example for generating an annotation table for Mus musculus (mouse):
105+
106+
```bash
107+
apptainer exec -B $(pwd):/work \
108+
$APPTAINER_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \
109+
bash -c "cd /work && Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'"
110+
```
111+
112+
**Input data:**
113+
114+
- No input files are required.
115+
- Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above.
116+
- To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
117+
- Optional: a reference table CSV can be supplied as a second positional argument instead of using the default [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
118+
119+
**Output data:**
120+
121+
- *-GL-annotations.tsv (Tab delineated table of gene annotations)
122+
- *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation)
123+
124+
<br>
125+
126+
---
127+
128+
#### 5. Run the Annotations Database Creation Function as a Stand-Alone Script
129+
130+
If the reference table does not specify an annotations database for the target organism in the annotations column, the `install_annotations` function (defined in `install-org-db.R`) will be executed. This function can also be run as a stand-alone script:
131+
132+
```bash
133+
apptainer exec -B $(pwd):/work \
134+
$APPTAINER_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \
135+
bash -c "cd /work && Rscript install-org-db.R 'Bacillus subtilis'"
43136
```
44137

138+
**Input data:**
139+
140+
- The target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in the example above. The correct argument for each organism can be found in the 'species' column of [GL-DPPD-7110-A_annotations.csv](https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
141+
- Optional: A local reference table can be supplied as a second positional argument. If not provided, the script will download the current version of GL-DPPD-7110-A_annotations.csv from Github by default.
142+
143+
**Output data:**
144+
145+
- org.*.eg.db/ (Species-specific annotation database, as a local R package)
146+
45147
<br>
46148

47149
---
48150

49-
### 3. Fetch Singularity Images
151+
### Approach 2: Using a Local R Environment
152+
153+
This approach allows you to run the workflow directly in your local R environment without using Apptainer containers.
154+
155+
<br>
156+
157+
---
158+
159+
#### 1. Install R and Required R Packages
160+
161+
We recommend installing R via the [Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/):
50162

51-
Although Singularity can fetch images from a url, doing so may cause issues as detailed [here](https://github.com/nextflow-io/nextflow/issues/1210).
163+
1. Select the [CRAN Mirror](https://cran.r-project.org/mirrors.html) closest to your location.
164+
2. Navigate to the download page for your operating system.
165+
3. Download and install R (e.g., R-4.4.0).
52166

53-
To avoid this issue, run the following command to fetch the Singularity images prior to running the GL_RefAnnotTable-A workflow:
54-
> Note: This command should be run in the location containing the `GL_RefAnnotTable-A_1.1.0` directory that was downloaded in [step 2](#2-download-the-workflow-files) above. Depending on your network speed, fetching the images will take ~20 minutes.
167+
Once R is installed, open a terminal and start R:
55168

56169
```bash
57-
bash GL_RefAnnotTable-A_1.1.0/bin/prepull_singularity.sh GL_RefAnnotTable-A_1.1.0/config/software/by_docker_image.config
170+
R
58171
```
59172

60-
Once complete, a `singularity` folder containing the Singularity images will be created. Run the following command to export this folder as a Singularity configuration environment variable:
173+
Within an active R environment, run the following commands to install the required R packages:
174+
175+
```R
176+
install.packages("tidyverse")
177+
178+
install.packages("BiocManager")
179+
180+
BiocManager::install("STRINGdb")
181+
BiocManager::install("PANTHER.db")
182+
BiocManager::install("rtracklayer")
183+
BiocManager::install("AnnotationForge")
184+
BiocManager::install("biomaRt")
185+
BiocManager::install("GO.db")
186+
```
187+
188+
<br>
189+
190+
---
191+
192+
#### 2. Download the Workflow Files
193+
194+
All files required for utilizing the GL_RefAnnotTable-A workflow for generating reference annotation tables are in the [workflow_code](workflow_code) directory. To get a copy of latest GL_RefAnnotTable version on to your system, run the following command:
195+
196+
```bash
197+
curl -LO https://github.com/nasa/GeneLab_Data_Processing/releases/download/GL_RefAnnotTable-A_1.1.0/GL_RefAnnotTable-A_1.1.0.zip
198+
```
199+
200+
<br>
201+
202+
---
203+
204+
#### 3. Set Execution Permissions for Workflow Scripts
205+
206+
Once you've downloaded the GL_RefAnnotTable-A workflow directory as a zip file, unzip the workflow then `cd` into the GL_RefAnnotTable-A_1.1.0 directory on the CLI. Next, run the following command to set the execution permissions for the R script:
61207

62208
```bash
63-
export SINGULARITY_CACHEDIR=$(pwd)/singularity
209+
unzip GL_RefAnnotTable-A_1.1.0.zip
210+
cd GL_RefAnnotTable-A_1.1.0
211+
chmod -R u+x *R
64212
```
65213

66214
<br>
67215

68216
---
69217

70-
### 4. Run the Workflow
218+
#### 4. Run the Workflow
71219

72-
While in the location containing the `GL_RefAnnotTable-A_1.1.0` directory that was downloaded in [step 2](#2-download-the-workflow-files), you are now able to run the workflow. Below is an example of how to run the workflow to build an annotation table for Mus musculus (mouse):
220+
While in the GL_RefAnnotTable workflow directory, you are now able to run the workflow. Below is an example of how to run the workflow to build an annotation table for Mus musculus (mouse):
73221

74222
```bash
75-
singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \
76-
$SINGULARITY_CACHEDIR/gl-refannottable_v1.0.0.sif \
77-
bash -c "cd /work && Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'"
223+
Rscript GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'
78224
```
79225

80226
**Input data:**
81227

82-
- No input files are required. Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above. To see a list of all available organisms, run the command without positional arguments. The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
228+
- No input files are required. Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above. To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. The correct argument for each organism can also be found in the 'species' column of [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
83229

84230
- Optional: a reference table CSV can be supplied as a second positional argument instead of using the default [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
85231

@@ -92,24 +238,21 @@ singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \
92238

93239
---
94240

95-
### 5. Run the annotations database creation function as a stand-alone script
241+
#### 5. Run the Annotations Database Creation Function as a Stand-Alone Script
96242

97-
When the workflow is run, if the reference table does not specify an annotations database for the target_organism in the `annotations` column, the `install_annotations` function, defined in the `install-org-db.R` script, will be executed. This script will locally create and install an annotations database R package using AnnotationForge. This function can also be run as a stand-alone script from the command line:
243+
If the reference table does not specify an annotations database for the target organism in the 'annotations' column, the `install_annotations` function (defined in `install-org-db.R`) will be executed. This function can also be run as a stand-alone script:
98244

99245
```bash
100-
singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \
101-
$SINGULARITY_CACHEDIR/gl-refannottable_v1.0.0.sif \
102-
bash -c "cd /work && Rscript install-org-db.R 'Bacillus subtilis' /path/to/GL-DPPD-7110-A_annotations.csv"
246+
Rscript install-org-db.R 'Bacillus subtilis'
103247
```
104248

105249
**Input data:**
106250

107-
- The target organism must be specified as the first positional command line argument, `Bacillus subtilis` is used in the example above. The correct argument for each organism can be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
108-
109-
- The path to a local reference table must also be supplied as the second positional argument
251+
- The target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in the example above. The correct argument for each organism can be found in the 'species' column of [GL-DPPD-7110-A_annotations.csv](https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
252+
- Optional: A local reference table can be supplied as a second positional argument. If not provided, the script will download the current version of GL-DPPD-7110-A_annotations.csv from Github by default.
110253

111254
**Output data:**
112255

113256
- org.*.eg.db/ (species-specific annotation database, as a local R package)
114257

115-
<br>
258+
<br>
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
# Addresses issue: https://github.com/nextflow-io/nextflow/issues/1210
55

66
CONFILE=${1:-nextflow.config}
7-
OUTDIR=${2:-./singularity}
7+
OUTDIR=${2:-./apptainer}
88

99
if [ ! -e $CONFILE ]; then
1010
echo "$CONFILE does not exist"
@@ -26,7 +26,7 @@ while IFS= read -r line; do
2626
name=${name/:/-}
2727
name=${name//\//-}
2828
echo $name
29-
singularity pull ${name}.img docker://$line
29+
apptainer pull ${name}.img docker://$line
3030
done < $TMPFILE
3131

3232
cd $CURDIR
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
// Config that specifies containers for nextflow processes
2+
process {
3+
withName: 'GL_REFANNOTTABLE_A' {
4+
container = "quay.io/nasa_genelab/gl-refannottable-a:1.1.0"
5+
}
6+
}

GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/workflow_code/install-org-db.R

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,20 @@
33
# Function: Get annotations db from ref table. If no annotations db is defined, create the package name from genus, species, (and strain for microbes),
44
# Try to Bioconductor install annotations db. If fail then build the package using AnnotationForge, install it into the current directory.
55
# Requires ~80GB for NCBIFilesDir file caching
6-
install_annotations <- function(target_organism, refTablePath) {
7-
if (!file.exists(refTablePath)) {
8-
stop("Reference table file does not exist at the specified path: ", refTablePath)
9-
}
6+
install_annotations <- function(target_organism, refTablePath = NULL) {
7+
# Default URL for the specific version of the reference CSV
8+
default_url <- "https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv"
9+
10+
# Use the provided path if available, otherwise use the default URL
11+
csv_source <- ifelse(is.null(refTablePath), default_url, refTablePath)
12+
13+
# Attempt to read the CSV file
14+
tryCatch({
15+
ref_table <- read.csv(csv_source)
16+
}, error = function(e) {
17+
stop("Failed to read the reference table: ", e$message)
18+
})
1019

11-
ref_table <- read.csv(refTablePath)
1220
target_taxid <- ref_table %>%
1321
filter(species == target_organism) %>%
1422
pull(taxon)

0 commit comments

Comments
 (0)