Skip to content

Commit f61a43d

Browse files
committed
readd NF_RCP-F folder, readd envs subfolder, update docs reference to this folder
1 parent 2176cf4 commit f61a43d

File tree

104 files changed

+13063
-1
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

104 files changed

+13063
-1
lines changed
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Changelog
2+
3+
All notable changes to this project will be documented in this file.
4+
5+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7+
8+
## [1.0.4](https://github.com/nasa/GeneLab_Data_Processing/tree/NF_RCP-F_1.0.4/RNAseq/Workflow_Documentation/NF_RCP-F) - 2024-02-08
9+
10+
### Fixed
11+
12+
- Workflow usage files will all follow output directory set by workflow user
13+
- ERCC Notebook:
14+
- Moved gene prefix definition to start of notebook
15+
- Added fallback for scenarios where every gene has zeros: use "poscounts" estimator to calculate a modified geometric mean
16+
- Reordered box-whisker plots from descending to ascending reference concentration order, ordered bar plots similarly
17+
18+
### Changed
19+
20+
- TrimGalore! will now use autodetect for adaptor type
21+
- V&V migrated from dp_tools version 1.1.8 to 1.3.4 including:
22+
- Migration of V&V protocol code to this codebase instead of dp_tools
23+
- Fix for sample wise checks reusing same sample
24+
- Added '_GLbulkRNAseq' to output file names
25+
26+
## [1.0.3](https://github.com/nasa/GeneLab_Data_Processing/tree/NF_RCP-F_1.0.3/RNAseq/Workflow_Documentation/NF_RCP-F) - 2023-01-25
27+
28+
### Added
29+
30+
- Test coverage using [nf-test](https://github.com/askimed/nf-test) approach
31+
32+
### Changed
33+
34+
- Updated software versions (via container update)
35+
- tximport == 1.27.1
36+
37+
### Fixed
38+
39+
- 'ERCC Non detection causes non-silent error' #65
40+
- 'This function is not compatible with certain updated ISA archive metadata filenaming' #56
41+
- 'Groups can become misassigned during group statistic calculation' #55
42+
- 'sample to filename mapping fails when sample names are prefix substrings of other sample names' #60
43+
- Fixed Singularity specific container issue related to DESeq2 steps
44+
45+
## [1.0.2](https://github.com/nasa/GeneLab_Data_Processing/tree/NF_RCP-F_1.0.2/RNAseq/Workflow_Documentation/NF_RCP-F) - 2022-11-30
46+
47+
### Added
48+
49+
- Manual tool version reporting functionality for [script](https://github.com/nasa/GeneLab_Data_Processing/tree/NF_RCP-F_1.0.2/RNAseq/Workflow_Documentation/NF_RCP-F/workflow_code/bin/format_software_versions.py) that consolidates tool versions for full workflow.
50+
- Currently includes manual version reporting for gtfToGenePred and genePredToBed
51+
52+
### Fixed
53+
54+
- Updated Cutadapt version in workflow from 3.4 to 3.7 in accordance with pipeline [specification](https://github.com/nasa/GeneLab_Data_Processing/tree/NF_RCP-F_1.0.2/RNAseq/Pipeline_GL-DPPD-7101_Versions/GL-DPPD-7101-F.md)
55+
56+
## [1.0.1](https://github.com/nasa/GeneLab_Data_Processing/tree/NF_RCP-F_1.0.1/RNAseq/Workflow_Documentation/NF_RCP-F) - 2022-11-17
57+
58+
### Changed
59+
60+
- Updated to dp_tools version 1.1.8 from 1.1.7: This addresses api changes from the release of the [OSDR](https://osdr.nasa.gov/bio/)
61+
62+
### Removed
63+
64+
- Docs: Recommendation to use Nextflow Version 21.10.6 removed as newer stable releases address original issue that had merited the recommendation
65+
66+
## [1.0.0](https://github.com/nasa/GeneLab_Data_Processing/tree/NF_RCP-F_1.0.0/RNAseq/Workflow_Documentation/NF_RCP-F) - 2022-11-04
67+
68+
### Added
69+
70+
- First internal production ready release of the RNASeq Consensus Pipeline Nextflow Workflow
Lines changed: 279 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,279 @@
1+
# NF_RCP-F Workflow Information and Usage Instructions <!-- omit in toc -->
2+
3+
## General Workflow Info <!-- omit in toc -->
4+
5+
### Implementation Tools <!-- omit in toc -->
6+
7+
The current GeneLab RNAseq consensus processing pipeline (RCP), [GL-DPPD-7101-F](../../Pipeline_GL-DPPD-7101_Versions/GL-DPPD-7101-F.md), is implemented as a [Nextflow](https://nextflow.io/) DSL2 workflow and utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) to run all tools in containers. This workflow (NF_RCP-F) is run using the command line interface (CLI) of any unix-based system. While knowledge of creating workflows in Nextflow is not required to run the workflow as is, [the Nextflow documentation](https://nextflow.io/docs/latest/index.html) is a useful resource for users who want to modify and/or extend this workflow.
8+
9+
### Workflow & Subworkflows <!-- omit in toc -->
10+
11+
---
12+
13+
- **Click image to expand**
14+
15+
<p align="center">
16+
<a href="../../images/NF_RCP-F_rnaseq_workflow.png"><img src="../../images/NF_RCP-F_rnaseq_workflow.png"></a>
17+
</p>
18+
19+
---
20+
The NF_RCP-F workflow is composed of three subworkflows as shown in the image above.
21+
Below is a description of each subworkflow and the additional output files generated that are not already indicated in the [GL-DPPD-7101-F pipeline
22+
document](../../Pipeline_GL-DPPD-7101_Versions/GL-DPPD-7101-F.md):
23+
24+
1. **Analysis Staging Subworkflow**
25+
26+
- Description:
27+
- This subworkflow extracts the metadata parameters (e.g. organism, library layout) needed for processing from the OSD/GLDS ISA archive and retrieves the raw reads files hosted on the [Open Science Data Repository (OSDR)](https://osdr.nasa.gov/bio/repo/).
28+
> *OSD/GLDS ISA archive*: ISA directory containing Investigation, Study, and Assay (ISA) metadata files for a respective GLDS dataset - the *ISA.zip file is located in the [OSDR](https://osdr.nasa.gov/bio/repo/) under 'Files' -> 'Study Metadata Files' for any GeneLab Data Set (GLDS) in the OSDR.
29+
30+
2. **RNASeq Consensus Pipeline Subworkflow**
31+
32+
- Description:
33+
- This subworkflow uses the staged raw data and metadata parameters from the Analysis Staging Subworkflow to generate processed data using [version F of the GeneLab RCP](../../Pipeline_GL-DPPD-7101_Versions/GL-DPPD-7101-F.md).
34+
35+
3. **V&V Pipeline Subworkflow**
36+
37+
- Description:
38+
- This subworkflow performs validation and verification (V&V) on the raw and processed data files in real-time. It performs a series of checks on the output files generated and flags the results, using the flag codes indicated in the table below, which are outputted as a series of log files.
39+
40+
**V&V Flags**:
41+
42+
|Flag Codes|Flag Name|Interpretation|
43+
|:---------|:--------|:-------------|
44+
| 20 | GREEN | Indicates the check passed all validation conditions |
45+
| 30 | YELLOW | Indicates the check was flagged for minor issues (e.g. slight outliers) |
46+
| 50 | RED | Indicates the check was flagged for moderate issues (e.g. major outliers) |
47+
| 80 | HALT | Indicates the check was flagged for severe issues that trigger a processing halt (e.g. missing data) |
48+
49+
<br>
50+
51+
---
52+
## Utilizing the Workflow
53+
54+
1. [Install Nextflow and Singularity](#1-install-nextflow-and-singularity)
55+
1a. [Install Nextflow](#1a-install-nextflow)
56+
1b. [Install Singularity](#1b-install-singularity)
57+
2. [Download the Workflow Files](#2-download-the-workflow-files)
58+
3. [Fetch Singularity Images](#3-fetch-singularity-images)
59+
4. [Run the Workflow](#4-run-the-workflow)
60+
4a. [Approach 1: Run the workflow on a GeneLab RNAseq dataset with automatic retrieval of Ensembl reference fasta and gtf files](#4a-approach-1-run-the-workflow-on-a-genelab-rnaseq-dataset-with-automatic-retrieval-of-ensembl-reference-fasta-and-gtf-files)
61+
4b. [Approach 2: Run the workflow on a GeneLab RNAseq dataset using local Ensembl reference fasta and gtf files](#4b-approach-2-run-the-workflow-on-a-genelab-rnaseq-dataset-using-local-reference-fasta-and-gtf-files)
62+
4c. [Approach 3: Run the workflow on a non-GLDS dataset using a user-created runsheet](#4c-approach-3-run-the-workflow-on-a-non-glds-dataset-using-a-user-created-runsheet)
63+
5. [Additional Output Files](#5-additional-output-files)
64+
65+
<br>
66+
67+
---
68+
69+
### 1. Install Nextflow and Singularity
70+
71+
#### 1a. Install Nextflow
72+
73+
Nextflow can be installed either through [Anaconda](https://anaconda.org/bioconda/nextflow) or as documented on the [Nextflow documentation page](https://www.nextflow.io/docs/latest/getstarted.html).
74+
75+
> Note: If you want to install Anaconda, we recommend installing a Miniconda, Python3 version appropriate for your system, as instructed by [Happy Belly Bioinformatics](https://astrobiomike.github.io/unix/conda-intro#getting-and-installing-conda).
76+
>
77+
> Once conda is installed on your system, you can install the latest version of Nextflow by running the following commands:
78+
>
79+
> ```bash
80+
> conda install -c bioconda nextflow
81+
> nextflow self-update
82+
> ```
83+
84+
<br>
85+
86+
#### 1b. Install Singularity
87+
88+
Singularity is a container platform that allows usage of containerized software. This enables the GeneLab RCP workflow to retrieve and use all software required for processing without the need to install the software directly on the user's system.
89+
90+
We recommend installing Singularity on a system wide level as per the associated [documentation](https://docs.sylabs.io/guides/3.10/admin-guide/admin_quickstart.html).
91+
92+
> Note: Singularity is also available through [Anaconda](https://anaconda.org/conda-forge/singularity).
93+
94+
<br>
95+
96+
---
97+
98+
### 2. Download the Workflow Files
99+
100+
All files required for utilizing the NF_RCP-F GeneLab workflow for processing RNASeq data are in the [workflow_code](workflow_code) directory. To get a
101+
copy of latest NF_RCP-F version on to your system, the code can be downloaded as a zip file from the release page then unzipped after downloading by running the following commands:
102+
103+
```bash
104+
wget https://github.com/nasa/GeneLab_Data_Processing/releases/download/NF_RCP-F_1.0.4/NF_RCP-F_1.0.4.zip
105+
106+
unzip NF_RCP-F_1.0.4.zip
107+
```
108+
109+
<br>
110+
111+
---
112+
113+
### 3. Fetch Singularity Images
114+
115+
Although Nextflow can fetch Singularity images from a url, doing so may cause issues as detailed [here](https://github.com/nextflow-io/nextflow/issues/1210).
116+
117+
To avoid this issue, run the following command to fetch the Singularity images prior to running the NF_RCP-F workflow:
118+
> Note: This command should be run in the location containing the `NF_RCP-F_1.0.4` directory that was downloaded in [step 2](#2-download-the-workflow-files) above. Depending on your network speed, fetching the images will take ~20 minutes.
119+
120+
```bash
121+
bash NF_RCP-F_1.0.4/bin/prepull_singularity.sh NF_RCP-F_1.0.4/config/software/by_docker_image.config
122+
```
123+
124+
125+
Once complete, a `singularity` folder containing the Singularity images will be created. Run the following command to export this folder as a Nextflow configuration environment variable to ensure Nextflow can locate the fetched images:
126+
127+
```bash
128+
export NXF_SINGULARITY_CACHEDIR=$(pwd)/singularity
129+
```
130+
131+
<br>
132+
133+
---
134+
135+
### 4. Run the Workflow
136+
137+
While in the location containing the `NF_RCP-F_1.0.4` directory that was downloaded in [step 2](#2-download-the-workflow-files), you are now able to run the workflow. Below are three examples of how to run the NF_RCP-F workflow:
138+
> Note: Nextflow commands use both single hyphen arguments (e.g. -help) that denote general nextflow arguments and double hyphen arguments (e.g. --ensemblVersion) that denote workflow specific parameters. Take care to use the proper number of hyphens for each argument.
139+
140+
<br>
141+
142+
#### 4a. Approach 1: Run the workflow on a GeneLab RNAseq dataset with automatic retrieval of Ensembl reference fasta and gtf files
143+
144+
```bash
145+
nextflow run NF_RCP-F_1.0.4/main.nf \
146+
-profile singularity \
147+
--gldsAccession OSD-194
148+
```
149+
150+
<br>
151+
152+
#### 4b. Approach 2: Run the workflow on a GeneLab RNAseq dataset using local reference fasta and gtf files
153+
154+
> Note: The `--ref_source` and `--ensemblVersion` parameters should match the reference source and version number of the local reference fasta and gtf files used
155+
156+
```bash
157+
nextflow run NF_RCP-F_1.0.4/main.nf \
158+
-profile singularity \
159+
--gldsAccession OSD-194 \
160+
--ensemblVersion 107 \
161+
--ref_source ensembl \
162+
--ref_fasta </path/to/fasta> \
163+
--ref_gtf </path/to/gtf>
164+
```
165+
166+
<br>
167+
168+
#### 4c. Approach 3: Run the workflow on a non-OSD dataset using a user-created runsheet
169+
170+
> Note: Specifications for creating a runsheet manually are described [here](examples/runsheet/README.md).
171+
172+
```bash
173+
nextflow run NF_RCP-F_1.0.4/main.nf \
174+
-profile singularity \
175+
--gldsAccession output_directory \
176+
--runsheetPath </path/to/runsheet>
177+
```
178+
179+
<br>
180+
181+
**Required Parameters For All Approaches:**
182+
183+
* `NF_RCP-F_1.0.4/main.nf` - Instructs Nextflow to run the NF_RCP-F workflow
184+
185+
* `-profile` - Specifies the configuration profile(s) to load, `singularity` instructs Nextflow to setup and use singularity for all software called in the workflow
186+
187+
* `--gldsAccession OSD-###` – specifies the OSD dataset to process through the RCP workflow (replace ### with the OSD number)
188+
> Note: The primary output directory will be titled "OSD-###"
189+
190+
* `--gldsAccession output_directory` – specifies the output directory name to use when processing a non-OSD dataset, as indicated in [Approach 3 above](#4c-approach-3-run-the-workflow-on-a-non-glds-dataset-using-a-user-created-runsheet)
191+
192+
193+
<br>
194+
195+
**Additional Required Parameters For [Approach 2](#4b-approach-2-run-the-workflow-on-a-genelab-rnaseq-dataset-using-local-ensembl-reference-fasta-and-gtf-files):**
196+
197+
* `--ensemblVersion` - specifies the Ensembl version to use for the reference genome (Ensembl release `107` is used in this example)
198+
199+
* `--ref_source` - specifies the source of the reference files used (the source indicated in the Approach 2 example is `ensembl`)
200+
201+
* `--ref_fasta` - specifices the path to a local fasta file
202+
203+
* `--ref_gtf` - specifices the path to a local gtf file
204+
205+
> Note: If the local reference files specified are different than the Ensembl reference files used to create the [GeneLab annotations table](https://github.com/nasa/GeneLab_Data_Processing/blob/master/GeneLab_Reference_Annotations/Pipeline_GL-DPPD-7110_Versions/GL-DPPD-7110/GL-DPPD-7110_annotations.csv), additional gene annotations associated with any Ensembl/TAIR IDs from the specified files that are not shared in the GeneLab annotations will not be added to the DGE output table(s).
206+
207+
<br>
208+
209+
**Optional Parameters:**
210+
211+
* `--skipVV` - skip the automated V&V processes (Default: the automated V&V processes are active)
212+
213+
* `--outputDir` - specifies the directory to save the raw and processed data files (Default: files are saved in the launch directory)
214+
215+
* `--force_single_end` - forces the analysis to use single end processing; for paired end datasets, this means only R1 is used; for single end datasets, this should have no effect
216+
217+
* `--stageLocal TRUE|FALSE` - TRUE = download the raw reads files for the OSD dataset indicated, FALSE = disable raw reads download and processing (Default: TRUE)
218+
219+
* `--referenceStorePath` - specifies the directory to store the Ensembl fasta and gtf files (Default: within the directory structure created by default in the launch directory)
220+
221+
* `--derivedStorePath` - specifies the directory to store the tool-specific indices created during processing (Default: within the directory structure created by default in the launch directory)
222+
223+
* `--runsheetPath` - specifies the path to a local runsheet (Default: a runsheet is automatically generated using the metadata on the GeneLab Repository for the OSD dataset being processed)
224+
> This is required when prcessing a non-OSD dataset as indicated in [Approach 3 above](#4c-approach-3-run-the-workflow-on-a-non-glds-dataset-using-a-user-created-runsheet)
225+
226+
<br>
227+
228+
**Additional Optional Parameters:**
229+
230+
All parameters listed above and additional optional arguments for the RCP workflow, including debug related options that may not be immediately useful for most users, can be viewed by running the following command:
231+
232+
```bash
233+
nextflow run NF_RCP-F_1.0.4/main.nf --help
234+
```
235+
236+
See `nextflow run -h` and [Nextflow's CLI run command documentation](https://nextflow.io/docs/latest/cli.html#run) for more options and details common to all nextflow workflows.
237+
238+
<br>
239+
240+
---
241+
242+
### 5. Additional Output Files
243+
244+
The outputs from the Analysis Staging and V&V Pipeline Subworkflows are described below:
245+
> Note: The outputs from the RNASeq Consensus Pipeline Subworkflow are documented in the [GL-DPPD-7101-F](../../Pipeline_GL-DPPD-7101_Versions/GL-DPPD-7101-F.md) processing protocol.
246+
247+
**Analysis Staging Subworkflow**
248+
249+
- Output:
250+
- \*_bulkRNASeq_v1_runsheet.csv (table containing metadata required for processing, including the raw reads files location)
251+
- \*-ISA.zip (the ISA archive of the OSD datasets to be processed, downloaded from the OSDR)
252+
- \*_metadata_table.txt (table that includes additional information about the OSD dataset, not used for processing)
253+
254+
255+
**V&V Pipeline Subworkflow**
256+
257+
- Output:
258+
- VV_Logs/VV_log_final_GLbulkRNAseq.tsv (table containing V&V flags for all checks performed)
259+
- VV_Logs/VV_log_final_only_issues_GLbulkRNAseq.tsv (table containing V&V flags ONLY for checks that produced a flag code >= 30)
260+
- VV_Logs/VV_log_VV_RAW_READS_GLbulkRNAseq.tsv (table containing V&V flags ONLY for raw reads checks)
261+
- VV_Logs/VV_log_VV_TRIMMED_READS_GLbulkRNAseq.tsv (table containing V&V flags for trimmed reads checks ONLY)
262+
- VV_Logs/VV_log_VV_STAR_ALIGNMENTS_GLbulkRNAseq.tsv (table containing V&V flags for alignment file checks ONLY)
263+
- VV_Logs/VV_log_VV_RSEQC_GLbulkRNAseq.tsv (table containing V&V flags for RSeQC file checks ONLY)
264+
- VV_Logs/VV_log_VV_RSEM_COUNTS_GLbulkRNAseq.tsv (table containing V&V flags for RSEM raw count file checks ONLY)
265+
- VV_Logs/VV_log_VV_DESEQ2_ANALYSIS_GLbulkRNAseq.tsv (table containing V&V flags for DESeq2 Analysis output checks ONLY)
266+
267+
<br>
268+
269+
Standard Nextflow resource usage logs are also produced as follows:
270+
> Further details about these logs can also found within [this Nextflow documentation page](https://www.nextflow.io/docs/latest/tracing.html#execution-report).
271+
272+
**Nextflow Resource Usage Logs**
273+
274+
- Output:
275+
- Resource_Usage/execution_report_{timestamp}.html (an html report that includes metrics about the workflow execution including computational resources and exact workflow process commands)
276+
- Resource_Usage/execution_timeline_{timestamp}.html (an html timeline for all processes executed in the workflow)
277+
- Resource_Usage/execution_trace_{timestamp}.txt (an execution tracing file that contains information about each process executed in the workflow, including: submission time, start time, completion time, cpu and memory used, machine-readable output)
278+
279+
<br>

0 commit comments

Comments
 (0)