Skip to content

Commit 1716f14

Browse files
Merge pull request #102 from olabiyi/DEV_Amplicon_Illumina_NF_conversion
Amplicon Illumina Nextflow workflow: Added Post-processing workflow
2 parents 3fc4b79 + e879110 commit 1716f14

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+8186
-1
lines changed

Amplicon/Illumina/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@
3030
**Developed by:**
3131
Michael D. Lee (Mike.Lee@nasa.gov)
3232
**Maintained by:**
33+
Olabiyi A. Obayomi (olabiyi.a.obayomi@nasa.gov)
3334
Michael D. Lee (Mike.Lee@nasa.gov)
3435
Alexis Torres (alexis.torres@nasa.gov)
3536
Amanda Saravia-Butler (amanda.m.saravia-butler@nasa.gov)
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Workflow change log
2+
3+
## [1.0.0](https://github.com/nasa/GeneLab_Data_Processing/tree/NF_AmpIllumina_1.0.0/Amplicon/Illumina/Workflow_Documentation/NF_AmpIllumina)
4+
- Version that coverted snakemake to nextflow
5+
6+
<br>
7+
8+
---
9+
10+
<br>
11+
Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
# Workflow Information and Usage Instructions
2+
3+
## General Workflow Info
4+
5+
### Implementation Tools
6+
7+
The current GeneLab Illumina amplicon sequencing data processing pipeline (AmpIllumina), [GL-DPPD-7104-B.md](../../Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md), is implemented as a [Nextflow](https://nextflow.io/) DSL2 workflow and utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) containers or [conda](https://docs.conda.io/en/latest/) environments to install/run all tools. This workflow is run using the command line interface (CLI) of any unix-based system. While knowledge of creating workflows in nextflow is not required to run the workflow as is, [the Nextflow documentation](https://nextflow.io/docs/latest/index.html) is a useful resource for users who want to modify and/or extend this workflow.
8+
9+
## Utilizing the Workflow
10+
11+
1. [Install Nextflow and Singularity](#1-install-nextflow-and-singularity)
12+
1a. [Install Nextflow](#1a-install-nextflow)
13+
1b. [Install Singularity](#1b-install-singularity)
14+
15+
2. [Download the workflow files](#2-download-the-workflow-files)
16+
17+
3. [Fetch Singularity Images](#3-fetch-singularity-images)
18+
19+
4. [Run the workflow](#4-run-the-workflow)
20+
4a. [Approach 1: Run slurm jobs in singularity containers with OSD accession as input](#4a-approach-1-run-slurm-jobs-in-singularity-containers-with-osd-accession-as-input)
21+
4b. [Approach 2: Run slurm jobs in singularity containers with a csv file as input](#4b-approach-2-run-slurm-jobs-in-singularity-containers-with-a-csv-file-as-input)
22+
4c. [Approach 3: Run jobs locally in conda environments and specify the path to one or more existing conda environments](#4c-approach-run-jobs-locally-in-conda-environments-and-specify-the-path-to-one-or-more-existing-conda-environments)
23+
4d. [Modify parameters and cpu resources in the nextflow config file](#4d-modify-parameters-and-cpu-resources-in-the-nextflow-config-file)
24+
25+
5. [Workflow outputs](#5-workflow-outputs)
26+
5a. [Main outputs](#5a-main-outputs)
27+
5b. [Resource logs](#5b-resource-logs)
28+
29+
6. [Post Processing](#6-post-processing)
30+
31+
<br>
32+
33+
---
34+
35+
### 1. Install Nextflow and Singularity
36+
37+
#### 1a. Install Nextflow
38+
39+
Nextflow can be installed either through [Anaconda](https://anaconda.org/bioconda/nextflow) or as documented on the [Nextflow documentation page](https://www.nextflow.io/docs/latest/getstarted.html).
40+
41+
> Note: If you want to install Anaconda, we recommend installing a Miniconda, Python3 version appropriate for your system, as instructed by [Happy Belly Bioinformatics](https://astrobiomike.github.io/unix/conda-intro#getting-and-installing-conda).
42+
>
43+
> Once conda is installed on your system, you can install the latest version of Nextflow by running the following commands:
44+
>
45+
> ```bash
46+
> conda install -c bioconda nextflow
47+
> nextflow self-update
48+
> ```
49+
50+
<br>
51+
52+
#### 1b. Install Singularity
53+
54+
Singularity is a container platform that allows usage of containerized software. This enables the GeneLab workflow to retrieve and use all software required for processing without the need to install the software directly on the user's system.
55+
56+
We recommend installing Singularity on a system wide level as per the associated [documentation](https://docs.sylabs.io/guides/3.10/admin-guide/admin_quickstart.html).
57+
58+
> Note: Singularity is also available through [Anaconda](https://anaconda.org/conda-forge/singularity).
59+
60+
<br>
61+
62+
---
63+
64+
### 2. Download the workflow files
65+
66+
All files required for utilizing the NF_XXX GeneLab workflow for processing amplicon illumina data are in the [workflow_code](workflow_code) directory. To get a copy of latest *NF_XXX* version on to your system, the code can be downloaded as a zip file from the release page then unzipped after downloading by running the following commands:
67+
68+
```bash
69+
wget https://github.com/nasa/GeneLab_Data_Processing/releases/download/NF_AmpIllumina/NF_AmpIllumina.zip
70+
unzip NF_AmpIllumina.zip && cd NF_XXX-X_X.X.X
71+
```
72+
73+
<br>
74+
75+
---
76+
77+
### 3. Fetch Singularity Images
78+
79+
Although Nextflow can fetch Singularity images from a url, doing so may cause issues as detailed [here](https://github.com/nextflow-io/nextflow/issues/1210).
80+
81+
To avoid this issue, run the following command to fetch the Singularity images prior to running the NF_AmpIllumina workflow:
82+
83+
> Note: This command should be run in the location containing the `NF_AMPIllumina` directory that was downloaded in [step 2](#2-download-the-workflow-files) above.
84+
85+
```bash
86+
bash ./bin/prepull_singularity.sh nextflow.config
87+
```
88+
89+
Once complete, a `singularity` folder containing the Singularity images will be created. Run the following command to export this folder as a Nextflow configuration environment variable to ensure Nextflow can locate the fetched images:
90+
91+
```bash
92+
export NXF_SINGULARITY_CACHEDIR=$(pwd)/singularity
93+
```
94+
95+
<br>
96+
97+
---
98+
99+
### 4. Run the Workflow
100+
101+
For options and detailed help on how to run the workflow, run the following command:
102+
103+
```bash
104+
nextflow run main.nf --help
105+
```
106+
107+
> Note: Nextflow commands use both single hyphen arguments (e.g. -help) that denote general nextflow arguments and double hyphen arguments (e.g. --input_file) that denote workflow specific parameters. Take care to use the proper number of hyphens for each argument.
108+
109+
<br>
110+
111+
#### 4a. Approach 1: Run slurm jobs in singularity containers with OSD or GLDS accession as input
112+
113+
```bash
114+
nextflow run main.nf -resume -profile slurm,singularity --accession GLDS-487 --target_region 16S
115+
```
116+
117+
<br>
118+
119+
#### 4b. Approach 2: Run slurm jobs in singularity containers with a csv file as input
120+
121+
```bash
122+
nextflow run main.nf -resume -profile slurm,singularity --input_file PE_file.csv --target_region 16S --F_primer AGAGTTTGATCCTGGCTCAG --R_primer CTGCCTCCCGTAGGAGT
123+
```
124+
125+
<br>
126+
127+
#### 4c. Approach 3: Run jobs locally in conda environments and specify the path to one or more existing conda environment(s)
128+
129+
```bash
130+
nextflow run main.nf -resume -profile conda --input_file SE_file.csv --target_region 16S --F_primer AGAGTTTGATCCTGGCTCAG --R_primer CTGCCTCCCGTAGGAGT --conda.qc <path/to/existing/conda/environment>
131+
```
132+
133+
<br>
134+
135+
**Required Parameters For All Approaches:**
136+
137+
* `-run main.nf` - Instructs nextflow to run the NF_XXX workflow
138+
* `-resume` - Resumes workflow execution using previously cached results
139+
* `-profile` – Specifies the configuration profile(s) to load, `singularity` instructs nextflow to setup and use singularity for all software called in the workflow
140+
* `--target_region` – Specifies the amplicon target region to be analyzed, 16S, 18S or ITS.
141+
142+
*Required only if you would like to pull and process data directly from OSDR*
143+
144+
* `--accession` – A Genelab / OSD accession number e.g. GLDS-487.
145+
146+
*Required only if --accession is not passed as an argument*
147+
148+
* `--input_file` – A 4-column (single-end) or 5-column (paired-end) input csv file with the following headers (sample_id, forward, [reverse,] paired, groups). Please see the sample [SE_file.csv](workflow_code/SE_file.csv) and [PE_file.csv](workflow_code/PE_file.csv) in this repository for examples on how to format this file.
149+
150+
* `--F_primer` – Forward primer sequence.
151+
152+
* `--R_primer` – Reverse primer sequence.
153+
154+
> See `nextflow run -h` and [Nextflow's CLI run command documentation](https://nextflow.io/docs/latest/cli.html#run) for more options and details on how to run nextflow.
155+
156+
<br>
157+
158+
#### 4d. Modify parameters and cpu resources in the nextflow config file
159+
160+
Additionally, the parameters and workflow resources can be directly specified in the nextflow.config file. For detailed instructions on how to modify and set parameters in the nextflow.config file, please see the [documentation here](https://www.nextflow.io/docs/latest/config.html).
161+
162+
Once you've downloaded the workflow template, you can modify the parameters in the `params` scope and cpus/memory requirements in the `process` scope in your downloaded version of the [nextflow.config](workflow_code/nextflow.config) file as needed in order to match your dataset and system setup. For example, you can directly set the the full paths to available conda environments in the `conda` scope within the `params` scope. Additionally, if necessary, you'll need to modify each variable in the [nextflow.config](workflow_code/nextflow.config) file to be consistent with the study you want to process and the machine you're using.
163+
164+
<br>
165+
166+
---
167+
168+
### 5. Workflow outputs
169+
170+
#### 5a. Main outputs
171+
172+
The outputs from this pipeline are documented in the [GL-DPPD-7104-B](../../Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md) processing protocol.
173+
174+
#### 5b. Resource logs
175+
176+
Standard nextflow resource usage logs are also produced as follows:
177+
178+
- Output:
179+
- Resource_Usage/execution_report_{timestamp}.html (an html report that includes metrics about the workflow execution including computational resources and exact workflow process commands)
180+
- Resource_Usage/execution_timeline_{timestamp}.html (an html timeline for all processes executed in the workflow)
181+
- Resource_Usage/execution_trace_{timestamp}.txt (an execution tracing file that contains information about each process executed in the workflow, including: submission time, start time, completion time, cpu and memory used, machine-readable output)
182+
183+
> Further details about these logs can also found within [this Nextflow documentation page](https://www.nextflow.io/docs/latest/tracing.html#execution-report).
184+
185+
<br>
186+
187+
---
188+
189+
### 6. Post Processing
190+
191+
For options and detailed help on how to run the post-processing workflow, run the following command:
192+
193+
```bash
194+
nextflow run post_processing.nf --help
195+
```
196+
197+
To generate a README file, a protocols file, a md5sums table and a file association table after running the processing workflow sucessfully, modify and set the parameters in [post_processing.config](workflow_code/post_processing.config) then run the following command:
198+
199+
```bash
200+
nextflow -C post_processing.config run post_processing.nf -resume -profile slurm,singularity
201+
```
202+
203+
The outputs of the run will be in a directory called `Post_Processing` by default and they are as follows:
204+
- Post_processing/FastQC_Outputs/filtered_multiqc_GLAmpSeq_report.zip (Filtered sequence multiqc report with paths purged)
205+
- Post_processing/FastQC_Outputs/raw_multiqc_GLAmpSeq_report.zip (Raw sequence multiqc report with paths purged)
206+
- Post_processing/<GLDS_accession>_-associated-file-names.tsv (File association table for curation)
207+
- Post_processing/<GLDS_accession>_amplicon-validation.log (Automatic verification and validation log file)
208+
- Post_processing/processed_md5sum_GLAmpSeq.tsv (md5sums for the files to be released on OSDR)
209+
- Post_processing/processing_info_GLAmpSeq.zip (Zip file containing all files used to run the workflow and required logs with paths purged)
210+
- Post_processing/protocol.txt (File describing the methods used by the workflow)
211+
- Post_processing/README_GLAmpSeq.txt (README file listing and describing the outputs of the workflow)
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
sample_id,forward,reverse,paired,groups
2+
Sample-1,/path/to/raw-reads/Sample-1_R1_raw.fastq.gz,/path/to/raw-reads/Sample-1_R2_raw.fastq.gz,true,A
3+
Sample-2,/path/to/raw-reads/Sample-2_R1_raw.fastq.gz,/path/to/raw-reads/Sample-2_R2_raw.fastq.gz,true,B
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
sample_id,forward,paired,groups
2+
Sample-1,/path/to/raw-reads/Sample-1_R1_raw.fastq.gz,false,A
3+
Sample-2,/path/to/raw-reads/Sample-2_R1_raw.fastq.gz,false,B

0 commit comments

Comments
 (0)