Skip to content

Commit bd917b4

Browse files
Formatting updates
1 parent 0338c93 commit bd917b4

File tree

1 file changed

+80
-44
lines changed
  • GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A

1 file changed

+80
-44
lines changed

GeneLab_Reference_Annotations/Workflow_Documentation/GL_RefAnnotTable-A/README.md

Lines changed: 80 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -16,25 +16,20 @@
1616
- [Step 2: Run the Workflow](#step-2-run-the-workflow)
1717
- [Step 3: Run the Annotations Database Creation Function as a Stand-Alone Script](#step-3-run-the-annotations-database-creation-function-as-a-stand-alone-script)
1818

19+
<br>
20+
1921
---
2022

2123
## General Workflow Information
2224

2325
The current GeneLab Reference Annotation Table (GL_RefAnnotTable-A) pipeline is implemented as an R workflow that can be run from a command line interface (CLI) using bash. The workflow can be executed using either a Singularity container or a local R environment. The workflow can be used even if you are unfamiliar with R, but if you want to learn more about R, visit the [R-project about page here](https://www.r-project.org/about.html). Additionally, an introduction to R along with installation help and information about using R for bioinformatics can be found [here at Happy Belly Bioinformatics](https://astrobiomike.github.io/R/basics).
2426

27+
<br>
28+
2529
---
2630

2731
## Utilizing the Workflow
2832

29-
To utilize the GL_RefAnnotTable-A workflow, follow the instructions below to download the necessary workflow files. Once downloaded, the workflow can be executed using two approaches:
30-
31-
1. **[Using Singularity](#approach-1-using-singularity)**
32-
2. **[Using a Local R Environment](#approach-2-using-a-local-r-environment)**
33-
34-
Please follow the instructions for the approach that best matches your setup and preferences. Each method is explained in detail below.
35-
36-
---
37-
3833
### 1. Download the Workflow Files
3934

4035
Download the latest version of the GL_RefAnnotTable-A workflow:
@@ -44,107 +39,135 @@ curl -LO https://github.com/nasa/GeneLab_Data_Processing/releases/download/GL_Re
4439
unzip GL_RefAnnotTable-A_1.1.0.zip
4540
```
4641

42+
<br>
43+
4744
---
4845

4946
### 2. Run the Workflow
5047

51-
The GL_RefAnnotTable-A workflow can be run using two approaches:
48+
The GL_RefAnnotTable-A workflow can be run using one of two approaches:
5249

5350
- **[Approach 1: Using Singularity](#approach-1-using-singularity)**
5451
- **[Approach 2: Using a Local R Environment](#approach-2-using-a-local-r-environment)**
5552

53+
Please follow the instructions for the approach that best matches your setup and preferences. Each method is explained in detail below.
54+
5655
---
5756

58-
#### Approach 1: Using Singularity
57+
### Approach 1: Using Singularity
5958

6059
This approach allows you to run the workflow within a containerized environment, ensuring consistency and reproducibility.
6160

62-
##### Step 1: Install Singularity
61+
#### Step 1: Install Singularity
6362

64-
Singularity is a containerization platform for running applications portably and reproducibly. We use container images hosted on Quay.io to encapsulate all the necessary software and dependencies required by the GL_RefAnnotTable-A workflow. This setup allows you to run the workflow without installing any software directly on your system. Other containerization tools like Docker or Apptainer can also be used to pull and run these images.
63+
Singularity is a containerization platform for running applications portably and reproducibly. We use container images hosted on Quay.io to encapsulate all the necessary software and dependencies required by the GL_RefAnnotTable-A workflow. This setup allows you to run the workflow without installing any software directly on your system.
64+
> ***Note**: Other containerization tools like Docker or Apptainer can also be used to pull and run these images.*
6565
6666
We recommend installing Singularity system-wide as per the official [Singularity installation documentation](https://docs.sylabs.io/guides/3.10/admin-guide/admin_quickstart.html).
6767

68-
> **Note**: While Singularity is also available through [Anaconda](https://anaconda.org/conda-forge/singularity), we recommend installing Singularity system-wide following the official installation documentation.
68+
> ***Note**: While Singularity is also available through [Anaconda](https://anaconda.org/conda-forge/singularity), we recommend installing Singularity system-wide following the official installation documentation.*
6969
70-
##### Step 2: Fetch the Singularity Image
70+
<br>
71+
72+
#### Step 2: Fetch the Singularity Image
7173

7274
To pull the Singularity image needed for the workflow, you can use the provided script as directed below or pull the image directly.
7375

74-
> **Note**: This command should be run in the location containing the `GL_RefAnnotTable-A_1.1.0` directory that was downloaded in [step 1](#1-download-the-workflow-files). Depending on your network speed, fetching the images will take approximately 20 minutes.
76+
> ***Note**: This command should be run in the location containing the `GL_RefAnnotTable-A_1.1.0` directory that was downloaded in [step 1](#1-download-the-workflow-files). Depending on your network speed, fetching the images will take approximately 20 minutes.*
77+
7578

7679
```bash
7780
bash GL_RefAnnotTable-A_1.1.0/bin/prepull_singularity.sh GL_RefAnnotTable-A_1.1.0/config/software/by_docker_image.config
7881
```
79-
80-
Once complete, a `singularity` folder containing the Singularity images will be created. Run the following command to export this folder as an environment variable:
82+
83+
Once complete, a `singularity` folder containing the Singularity images will be created. Run the following command to export this folder as an environment variable:
84+
8185

8286
```bash
8387
export SINGULARITY_CACHEDIR=$(pwd)/singularity
8488
```
89+
<br>
8590

86-
##### Step 3: Run the Workflow
91+
#### Step 3: Run the Workflow
8792

88-
While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder, you can now run the workflow. Below is an example for generating the annotation table for *Mus musculus* (mouse):
93+
While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder, you can now run the workflow. Below is an example for generating the annotation table for *Mus musculus* (mouse):
94+
8995

9096
```bash
9197
singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \
9298
$SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \
9399
Rscript /work/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'
94100
```
95-
101+
96102
**Input data:**
97103

98-
- No input files are required. Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above. To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
99-
- Optional: a reference table CSV can be supplied as a second positional argument instead of using the default [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
104+
- No input files are required. Specify the species name of the target organism using a positional command line argument. `Mus musculus` is used in the example above.
105+
> **Notes**:
106+
> To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments.
107+
> The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
108+
- *Optional*: A local reference table CSV file can be supplied as a second positional argument. If not provided, the script will download the current version of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) table by default.
109+
100110

101111
**Output data:**
102112

103113
- *-GL-annotations.tsv (Tab delineated table of gene annotations)
104114
- *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation)
105115

106-
##### Step 4: Run the Annotations Database Creation Function as a Stand-Alone Script
116+
<br>
117+
118+
#### *Optional*: Run the Annotations Database Creation Function as a Stand-Alone Script
107119

108-
If the reference table does not specify an annotations database for the target organism in the 'annotations' column, the `install_annotations` function (defined in `install-org-db.R`) will be executed. This function can also be run as a stand-alone script:
120+
If the reference table does not specify an annotations database for the target organism in the 'annotations' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) file, the `install_annotations` function (defined in `install-org-db.R`) will be executed by default. This function can also be run as a stand-alone script:
121+
109122

110123
```bash
111124
singularity exec -B $(pwd)/GL_RefAnnotTable-A_1.1.0:/work \
112125
$SINGULARITY_CACHEDIR/quay.io-nasa_genelab-gl-refannottable-a-1.1.0.img \
113126
Rscript /work/install-org-db.R 'Bacillus subtilis'
114127
```
128+
115129

116130
**Input data:**
117131

118-
- The target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in the example above. The correct argument for each organism can be found in the 'species' column of [GL-DPPD-7110-A_annotations.csv](https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
119-
- Optional: A local reference table can be supplied as a second positional argument. If not provided, the script will download the current version of GL-DPPD-7110-A_annotations.csv from Github by default.
132+
- The species name of the target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in the example above.
133+
> **Note**: The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
134+
- *Optional*: A local reference table CSV file can be supplied as a second positional argument. If not provided, the script will download the current version of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) table by default.
135+
120136

121137
**Output data:**
122138

123139
- org.*.eg.db/ (Species-specific annotation database, as a local R package)
124140

141+
<br>
142+
125143
---
126144

127-
#### Approach 2: Using a Local R Environment
145+
### Approach 2: Using a Local R Environment
128146

129147
This approach allows you to run the workflow directly in your local R environment without using containers.
130148

131-
##### Step 1: Install R and Required R Packages
149+
#### Step 1: Install R and Required R Packages
132150

133151
We recommend installing R via the [Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/):
134152

135153
1. Select the [CRAN Mirror](https://cran.r-project.org/mirrors.html) closest to your location.
154+
136155
2. Navigate to the download page for your operating system.
137-
3. Download and install R (e.g., R-4.4.0).
156+
157+
3. Download and install R (e.g., R-4.4.0).
138158

139-
Once R is installed, you need to install the required R packages.
159+
Once R is installed, install the required R packages as follows:
140160

141-
Open a terminal and start R:
161+
Open a terminal and start R:
162+
142163

143164
```bash
144165
R
145-
```
166+
```
146167

147-
Within the R environment, run the following commands to install the required packages:
168+
169+
Within the R environment, run the following commands to install the required packages:
170+
148171

149172
```R
150173
install.packages("tidyverse")
@@ -157,39 +180,52 @@ BiocManager::install("biomaRt")
157180
BiocManager::install("GO.db")
158181
```
159182

160-
##### Step 2: Run the Workflow
183+
<br>
161184

162-
While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder, you can now run the workflow. Below is an example of how to run the workflow to build an annotation table for *Mus musculus* (mouse):
185+
#### Step 2: Run the Workflow
186+
187+
While in the directory containing the `GL_RefAnnotTable-A_1.1.0` folder, you can now run the workflow. Below is an example of how to run the workflow to build an annotation table for *Mus musculus* (mouse):
188+
163189

164190
```bash
165191
Rscript GL_RefAnnotTable-A_1.1.0/GL-DPPD-7110-A_build-genome-annots-tab.R 'Mus musculus'
166192
```
193+
167194

168195
**Input data:**
169196

170-
- No input files are required. Specify the target organism using a positional command line argument. `Mus musculus` is used in the example above. To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments. The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
171-
- Optional: a reference table CSV can be supplied as a second positional argument instead of using the default [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
197+
- No input files are required. Specify the species name of the target organism using a positional command line argument. `Mus musculus` is used in the example above.
198+
> **Notes**:
199+
> To see a list of all available organisms, run `Rscript GL-DPPD-7110-A_build-genome-annots-tab.R` without positional arguments.
200+
> The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
201+
- *Optional*: A local reference table CSV file can be supplied as a second positional argument. If not provided, the script will download the current version of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) table by default.
202+
172203

173204
**Output data:**
174205

175206
- *-GL-annotations.tsv (Tab delineated table of gene annotations)
176207
- *-GL-build-info.txt (Text file containing information used to create the annotation table, including tool and tool versions and date of creation)
177208

178-
##### Step 3: Run the Annotations Database Creation Function as a Stand-Alone Script
209+
<br>
210+
211+
#### *Optional*: Run the Annotations Database Creation Function as a Stand-Alone Script
179212

180-
If the reference table does not specify an annotations database for the target organism in the 'annotations' column, the `install_annotations` function (defined in `install-org-db.R`) will be executed. This function can also be run as a stand-alone script:
213+
If the reference table does not specify an annotations database for the target organism in the 'annotations' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) file, the `install_annotations` function (defined in `install-org-db.R`) will be executed by default. This function can also be run as a stand-alone script:
214+
181215

182216
```bash
183217
Rscript GL_RefAnnotTable-A_1.1.0/install-org-db.R 'Bacillus subtilis'
184218
```
185219

186220
**Input data:**
187221

188-
- The target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in the example above. The correct argument for each organism can be found in the 'species' column of [GL-DPPD-7110-A_annotations.csv](https://raw.githubusercontent.com/nasa/GeneLab_Data_Processing/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
189-
- Optional: A local reference table can be supplied as a second positional argument. If not provided, the script will download the current version of GL-DPPD-7110-A_annotations.csv from Github by default.
222+
- The species name of the target organism must be specified as the first positional command line argument. `Bacillus subtilis` is used in the example above.
223+
> **Note**: The correct argument for each organism can also be found in the 'species' column of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv)
224+
- *Optional*: A local reference table CSV file can be supplied as a second positional argument. If not provided, the script will download the current version of the [GL-DPPD-7110-A_annotations.csv](../../Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110-A/GL-DPPD-7110-A_annotations.csv) table by default.
225+
190226

191227
**Output data:**
192228

193-
- org.*.eg.db/ (species-specific annotation database, as a local R package)
229+
- org.*.eg.db/ (Species-specific annotation database, as a local R package)
194230

195-
---
231+
---

0 commit comments

Comments
 (0)