You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: RNAseq/Workflow_Documentation/NF_RCP/README.md
+25-30Lines changed: 25 additions & 30 deletions
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
5
5
### Implementation Tools <!-- omit in toc -->
6
6
7
-
The current GeneLab RNAseq consensus processing pipeline (RCP) for eukaryotic organisms, [GL-DPPD-7101-G](../../Pipeline_GL-DPPD-7101_Versions/GL-DPPD-7101-G.md) and the GeneLab RNAseq consensus pipeline [GL-DPPD-7115](../../Pipeline_GL-DPPD-7115_Versions/GL-DPPD-7115.md), are implemented as a single [Nextflow](https://nextflow.io/) DSL2 workflow that utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) to run all tools in containers. This workflow (NF_RCP) is run using the command line interface (CLI) of any unix-based system. While knowledge of creating workflows in Nextflow is not required to run the workflow as is, [the Nextflow documentation](https://nextflow.io/docs/latest/index.html) is a useful resource for users who want to modify and/or extend this workflow.
7
+
The current GeneLab RNAseq consensus processing pipeline (RCP) for eukaryotic organisms ([GL-DPPD-7101-G](../../Pipeline_GL-DPPD-7101_Versions/GL-DPPD-7101-G.md)) and prokaryotic organisms ([GL-DPPD-7115](../../Pipeline_GL-DPPD-7115_Versions/GL-DPPD-7115.md)) are implemented as a single [Nextflow](https://nextflow.io/) DSL2 workflow that utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) to run all tools in containers. This workflow (NF_RCP) is run using the command line interface (CLI) of any unix-based system. While knowledge of creating workflows in Nextflow is not required to run the workflow as is, [the Nextflow documentation](https://nextflow.io/docs/latest/index.html) is a useful resource for users who want to modify and/or extend this workflow.
8
8
9
9
### Workflow & Subworkflows <!-- omit in toc -->
10
10
@@ -28,14 +28,13 @@ The current GeneLab RNAseq consensus processing pipeline (RCP) for eukaryotic or
28
28
29
29
---
30
30
The NF_RCP workflow is composed of three subworkflows as shown in the image above.
31
-
Below is a description of each subworkflow and the additional output files generated that are not already indicated in the [GL-DPPD-7101-G pipeline
Below is a description of each subworkflow and the additional output files generated that are not already indicated in the [GL-DPPD-7101-G](../../Pipeline_GL-DPPD-7101_Versions/GL-DPPD-7101-G.md) and [GL-DPPD-7115](../../Pipeline_GL-DPPD-7115_Versions/GL-DPPD-7115.md) pipeline documents:
33
32
34
33
1.**Analysis Staging Subworkflow**
35
34
36
35
- Description:
37
36
- This subworkflow extracts the metadata parameters (e.g. organism, library layout) needed for processing from the OSD/GLDS ISA archive and retrieves the raw reads files hosted on the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).
38
-
> *OSD/GLDS ISA archive*: ISA directory containing Investigation, Study, and Assay (ISA) metadata files for a respective GLDS dataset - the *ISA.zip file is located in the [OSDR](https://osdr.nasa.gov/bio/repo/)under 'Files' -> 'Study Metadata Files' for any GeneLab Data Set (GLDS) in the OSDR.
37
+
> *OSD/GLDS ISA archive*: ISA directory containing Investigation, Study, and Assay (ISA) metadata files for a respective GLDS dataset - the *ISA.zip file is located under 'Files' -> 'Study Metadata Files' for any GeneLab Data Set (GLDS) in the [OSDR](https://osdr.nasa.gov/bio/repo/).
4a. [Approach 1: Run the workflow on a GeneLab RNAseq dataset with automatic retrieval of Ensembl reference fasta and gtf files](#4a-approach-1-run-the-workflow-on-a-genelab-rnaseq-dataset-with-automatic-retrieval-of-ensembl-reference-fasta-and-gtf-files)
75
-
4b. [Approach 2: Run the workflow on a GeneLab RNAseq dataset using local Ensembl reference fasta and gtf files](#4b-approach-2-run-the-workflow-on-a-genelab-rnaseq-dataset-using-local-reference-fasta-and-gtf-files)
76
-
4c. [Approach 3: Run the workflow on a non-GLDS dataset using a user-created runsheet](#4c-approach-3-run-the-workflow-on-a-non-glds-dataset-using-a-user-created-runsheet)
77
-
4d. [Approach 4: Run the workflow on a GeneLab prokaryotic RNAseq dataset](#4d-approach-4-run-the-workflow-on-a-genelab-prokaryotic-rnaseq-dataset)
73
+
4a. [Approach 1: Run the workflow on a GeneLab RNAseq dataset with automatic retrieval of reference fasta and gtf files](#4a-approach-1-run-the-workflow-on-a-genelab-rnaseq-dataset-with-automatic-retrieval-of-reference-fasta-and-gtf-files)
74
+
4b. [Approach 2: Run the workflow on a GeneLab RNAseq dataset using local reference fasta and gtf files](#4b-approach-2-run-the-workflow-on-a-genelab-rnaseq-dataset-using-local-reference-fasta-and-gtf-files)
75
+
4c. [Approach 3: Run the workflow on a non-GeneLab dataset using a user-created runsheet](#4c-approach-3-run-the-workflow-on-a-non-genelab-dataset-using-a-user-created-runsheet)
While in the location containing the `NF_RCP_2.0.0` directory that was downloaded in [step 2](#2-download-the-workflow-files), you are now able to run the workflow. Below are four examples of how to run the NF_RCP workflow:
153
-
> Note: Nextflow commands use both single hyphen arguments (e.g. -help) that denote general nextflow arguments and double hyphen arguments (e.g. --ensemblVersion) that denote workflow specific parameters. Take care to use the proper number of hyphens for each argument.
151
+
> Note: Nextflow commands use both single hyphen arguments (e.g. -help) that denote general nextflow arguments and double hyphen arguments (e.g. --reference_version) that denote workflow specific parameters. Take care to use the proper number of hyphens for each argument.
154
152
155
153
<br>
156
154
157
-
#### 4a. Approach 1: Run the workflow on a GeneLab RNAseq dataset with automatic retrieval of Ensembl reference fasta and gtf files
155
+
#### 4a. Approach 1: Run the workflow on a GeneLab RNAseq dataset with automatic retrieval of reference fasta and gtf files
158
156
159
157
```bash
160
158
nextflow run NF_RCP_2.0.0/main.nf \
161
159
-profile singularity \
162
160
--accession OSD-194
163
161
```
164
162
163
+
> Note: For prokaryotic RNAseq datasets, add the parameter `--mode microbes` to run the workflow using the prokaryotic pipeline ([GL-DPPD-7115](../../Pipeline_GL-DPPD-7115_Versions/GL-DPPD-7115.md)). The default value of this parameter is `default`, which will use the eukaryotic pipeline ([GL-DPPD-7101-G](../../Pipeline_GL-DPPD-7101_Versions/GL-DPPD-7101-G.md)).
164
+
165
165
<br>
166
166
167
167
#### 4b. Approach 2: Run the workflow on a GeneLab RNAseq dataset using local reference fasta and gtf files
@@ -172,15 +172,15 @@ nextflow run NF_RCP_2.0.0/main.nf \
172
172
nextflow run NF_RCP_2.0.0/main.nf \
173
173
-profile singularity \
174
174
--accession OSD-194 \
175
-
--reference_version 107 \
175
+
--reference_version 112 \
176
176
--reference_source ensembl \
177
177
--reference_fasta </path/to/fasta> \
178
178
--reference_gtf </path/to/gtf>
179
179
```
180
180
181
181
<br>
182
182
183
-
#### 4c. Approach 3: Run the workflow on a non-OSD dataset using a user-created runsheet
183
+
#### 4c. Approach 3: Run the workflow on a non-GeneLab dataset using a user-created runsheet
184
184
185
185
> Note: Specifications for creating a runsheet manually are described [here](examples/runsheet/README.md).
186
186
@@ -192,30 +192,25 @@ nextflow run NF_RCP_2.0.0/main.nf \
192
192
193
193
<br>
194
194
195
-
#### 4d. Approach 4: Run the workflow on a GeneLab prokaryotic RNAseq dataset
196
-
197
-
```bash
198
-
nextflow run NF_RCP_2.0.0/main.nf \
199
-
-profile singularity \
200
-
--mode microbes \
201
-
--accession OSD-185
202
-
```
203
-
204
-
<br>
205
-
206
195
**Required Parameters For All Approaches:**
207
196
208
197
* `NF_RCP_2.0.0/main.nf` - Instructs Nextflow to run the NF_RCP workflow
209
198
210
199
* `-profile` - Specifies the configuration profile(s) to load, `singularity` instructs Nextflow to setup and use singularity for all software called in the workflow
211
-
> Note: The output directory will be named `GLDS-#` when using a OSDR or GLDS accession as input, or `results` when running the workflow with only a runsheet as input.
200
+
> Note: The output directory will be named `GLDS-#` when using a OSD or GLDS accession as input, or `results` when running the workflow with only a runsheet as input.
201
+
202
+
203
+
<br>
204
+
205
+
**Additional Required Parameters For [Approach 1](#4a-approach-1-run-the-workflow-on-a-genelab-rnaseq-dataset-with-automatic-retrieval-of-reference-fasta-and-gtf-files):**
212
206
207
+
* `--accession` - The OSD or GLDS ID for the dataset to be processed, eg. `GLDS-194` or `OSD-194`
213
208
214
209
<br>
215
210
216
-
**Additional Required Parameters For [Approach 2](#4b-approach-2-run-the-workflow-on-a-genelab-rnaseq-dataset-using-local-ensembl-reference-fasta-and-gtf-files):**
211
+
**Additional Required Parameters For [Approach 2](#4b-approach-2-run-the-workflow-on-a-genelab-rnaseq-dataset-using-local-reference-fasta-and-gtf-files):**
217
212
218
-
* `--reference_version` - specifies the Ensembl version to use for the reference genome (Ensembl release `107` is used in this example); only needed when using Ensembl as the reference source
213
+
* `--reference_version` - specifies the reference source version to use for the reference genome (Ensembl release `112` is used in this example); only needed when using Ensembl as the reference source
219
214
220
215
* `--reference_source` - specifies the source of the reference files used (the source indicated in the Approach 2 example is `ensembl`)
221
216
@@ -235,12 +230,12 @@ nextflow run NF_RCP_2.0.0/main.nf \
235
230
236
231
* `--force_single_end` - forces the analysis to use single end processing; for paired end datasets, this means only R1 is used; for single end datasets, this should have no effect
237
232
238
-
* `--reference_store_path` - specifies the directory to store the Ensembl fasta and gtf files (Default: within the directory structure created by default in the launch directory)
233
+
* `--reference_store_path` - specifies the directory to store the reference fasta and gtf files (Default: within the directory structure created by default in the launch directory)
239
234
240
-
* `--derived_store_path` - specifies the directory to store the tool-specific indices created during processing (Default: within the directory structure created by default in the launch directory)
235
+
* `--derived_store_path` - specifies the directory to store the tool-specific indices created during processing (Default: within the directory structure created by default in the launch directory) `
241
236
242
-
* `--runsheet_path` - specifies the path to a local runsheet (Default: a runsheet is automatically generated using the metadata on the GeneLab Repository for the OSD dataset being processed)
243
-
> This is required when prcessing a non-OSD dataset as indicated in [Approach 3 above](#4c-approach-3-run-the-workflow-on-a-non-glds-dataset-using-a-user-created-runsheet)
237
+
* `--runsheet_path` - specifies the path to a local runsheet (Default: a runsheet is automatically generated using the metadata on the OSDR for the dataset being processed)
238
+
> This is required when prcessing a non-OSDR dataset as indicated in [Approach 3 above](#4c-approach-3-run-the-workflow-on-a-non-genelab-dataset-using-a-user-created-runsheet)
244
239
245
240
* `--mode` - specifies which pipeline to use: set to `default` to run GL-DPPD-7101-G pipeline or set to `microbes` for the GL-DPPD-7115 prokaryotic pipeline (Default value: `default`)
246
241
> This allows the workflow to process either eukaryotic (default) or prokaryotic RNAseq data using the appropriate pipeline
0 commit comments