Skip to content

Commit 5b5cfa3

Browse files
committed
Added README.md
1 parent d2cfbdf commit 5b5cfa3

File tree

3 files changed

+260
-153
lines changed

3 files changed

+260
-153
lines changed
Lines changed: 122 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,101 +1,173 @@
1-
# SW_MGIllumina Workflow Information and Usage Instructions
1+
# Workflow Information and Usage Instructions
22

3+
## General Workflow Info
34

4-
## General workflow info
5-
The current GeneLab Illumina metagenomics sequencing data processing pipeline (MGIllumina), [GL-DPPD-7107.md](../../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107.md), is implemented as a [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow and utilizes [conda](https://docs.conda.io/en/latest/) environments to install/run all tools. This workflow (SW_MGIllumina) is run using the command line interface (CLI) of any unix-based system. The workflow can be used even if you are unfamiliar with Snakemake and conda, but if you want to learn more about those, [this Snakemake tutorial](https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html) within [Snakemake's documentation](https://snakemake.readthedocs.io/en/stable/) is a good place to start for that, and an introduction to conda with installation help and links to other resources can be found [here at Happy Belly Bioinformatics](https://astrobiomike.github.io/unix/conda-intro).
5+
### Implementation Tools
6+
The current GeneLab Illumina metagenomics sequencing data processing pipeline (MGIllumina), [GL-DPPD-7107.md](../../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107.md), is implemented as a [Nextflow](https://nextflow.io/) DSL2 workflow and utilizes [Singularity](https://docs.sylabs.io/guides/3.10/user-guide/introduction.html) containers or [conda](https://docs.conda.io/en/latest/) environments to install/run all tools. This workflow is run using the command line interface (CLI) of any unix-based system. While knowledge of creating workflows in nextflow is not required to run the workflow as is, [the Nextflow documentation](https://nextflow.io/docs/latest/index.html) is a useful resource for users who want to modify and/or extend this workflow.
67

78
> **Note on reference databases**
8-
> Many reference databases are relied upon throughout this workflow. They will be installed and setup automatically the first time the workflow is run. All together, after installed and unpacked, they will take up about 240 GB of storage, but they may also require up to 500GB during installation and initial un-packing, so be sure there is enough room on your system before running the workflow.
9+
> Many reference databases are relied upon throughout this workflow. They will be installed and setup automatically the first time the workflow is run. All together, after installed and unpacked, they will take up about about 340 GB of storage, but they may also require up to 500GB during installation and initial un-packing, so be sure there is enough room on your system before running the workflow.
910
10-
## Utilizing the workflow
1111

12-
1. [Install conda, mamba, and `genelab-utils` package](#1-install-conda-mamba-and-genelab-utils-package)
13-
2. [Download the workflow template files](#2-download-the-workflow-template-files)
14-
3. [Modify the variables in the config.yaml file](#3-modify-the-variables-in-the-configyaml-file)
15-
4. [Run the workflow](#4-run-the-workflow)
12+
## Utilizing the Workflow
1613

17-
### 1. Install conda, mamba, and `genelab-utils` package
18-
We recommend installing a Miniconda, Python3 version appropriate for your system, as exemplified in [the above link](https://astrobiomike.github.io/unix/conda-intro#getting-and-installing-conda).
14+
1. [Install nextflow, conda and singularity](#1-install-nextflow-conda-and-singularity)
15+
1a. [Install nextflow and conda](#1a-install-nextflow-and-conda)
16+
1b. [Install singularity](#1b-install-singularity)
1917

20-
Once conda is installed on your system, we recommend installing [mamba](https://github.com/mamba-org/mamba#mamba), as it generally allows for much faster conda installations:
18+
2. [Download the workflow files](#2-download-the-workflow-files)
19+
20+
3. [Run the workflow](#3-run-the-workflow)
21+
3a. [Approach 1: Run slurm jobs in singularity containers with OSD accession as input](#3a-approach-1-run-slurm-jobs-in-singularity-containers-with-osd-accession-as-input)
22+
3b. [Approach 2: Run slurm jobs in singularity containers with a csv file as input](#3b-approach-2-run-slurm-jobs-in-singularity-containers-with-a-csv-file-as-input)
23+
3c. [Approach 3: Run jobs locally in conda environments and specify the path to one or more existing conda environments](#3c-approach-run-jobs-locally-in-conda-environments-and-specify-the-path-to-one-or-more-existing-conda-environments)
24+
3d. [Modify parameters and cpu resources in the nextflow config file](#3d-modify-parameters-and-cpu-resources-in-the-nextflow-config-file)
25+
26+
4. [Workflow outputs](#4-workflow-outputs)
27+
4a. [Main outputs](#4a-main-outputs)
28+
4b. [Resource logs](#4b-resource-logs)
29+
30+
<br>
31+
32+
### 1. Install nextflow, conda and singularity
33+
34+
35+
36+
#### 1a. Install nextflow and conda
37+
38+
Nextflow can be installed either through [Anaconda](https://anaconda.org/bioconda/nextflow) or as documented on the [Nextflow documentation page](https://www.nextflow.io/docs/latest/getstarted.html).
39+
40+
> Note: If you want to install anaconda, we recommend installing a miniconda, python3 version appropriate for your system, as instructed by [Happy Belly Bioinformatics](https://astrobiomike.github.io/unix/conda-intro#getting-and-installing-conda).
41+
42+
We recommend installing a miniconda, python3 version appropriate for your system, as exemplified in [the above link](https://astrobiomike.github.io/unix/conda-intro#getting-and-installing-conda).
43+
44+
Once conda is installed on your system, we recommend installing [mamba](https://github.com/mamba-org/mamba#mamba), as it generally allows for much faster conda installations.
2145

2246
```bash
2347
conda install -n base -c conda-forge mamba
2448
```
2549

26-
> You can read a quick intro to mamba [here](https://astrobiomike.github.io/unix/conda-intro#bonus-mamba-no-5) if wanted.
50+
> You can read a quick intro to mamba [here](https://astrobiomike.github.io/unix/conda-intro#bonus-mamba-no-5).
2751
28-
Once mamba is installed, you can install the genelab-utils conda package in a new environment with the following command:
52+
Once mamba is installed, you can install the genelab-utils conda package which contains nextflow with the following command:
2953

3054
```bash
31-
mamba create -n genelab-utils -c conda-forge -c bioconda -c defaults -c astrobiomike 'genelab-utils>=1.1.02'
55+
mamba create -n genelab-utils -c conda-forge -c bioconda -c defaults -c astrobiomike genelab-utils
3256
```
3357

34-
3558
The environment then needs to be activated:
3659

3760
```bash
3861
conda activate genelab-utils
39-
```
4062

41-
### 2. Download the workflow template files
42-
The workflow files for processing Illumina metagenomics sequencing data are in the [workflow_code](workflow_code) directory. To get a copy of the latest SW_MGIllumina version on to your system, run the following command:
63+
# Test that nextflow is installed
64+
nextflow -h
4365

44-
```bash
45-
GL-get-workflow MG-Illumina
66+
# Update nextflow
67+
nextflow self-update
4668
```
4769

48-
This downloaded the workflow into a directory called `SW_MGIllumina_*/`, with the workflow version number at the end.
70+
<br>
4971

50-
> Note: If wanting an earlier version, the wanted version can be provided as an optional argument like so:
51-
> ```bash
52-
> GL-get-workflow MG-Illumina --wanted-version 2.0.0
53-
> ```
72+
#### 1b. Install singularity
5473

55-
### 3. Modify the variables in the config.yaml file
56-
Once you've downlonaded the workflow template, you can modify the variables in your downloaded version of the [config.yaml](workflow_code/config.yaml) file as needed in order to match your dataset and system setup. For example, you will have to provide a text file containing a single-column list of unique sample identifiers (see an example of how to set this up below). You will also need to indicate the paths to your input data (raw reads) and the root directory for where the reference databases should be stored (they will be setup automatically). Additionally, if necessary, you'll need to modify each variable in the config.yaml file to be consistent with the study you want to process and the machine you're using.
74+
Singularity is a container platform that allows usage of containerized software. This enables the GeneLab workflow to retrieve and use all software required for processing without the need to install the software directly on the user's system.
5775

58-
> Note: If you are unfamiliar with how to specify paths, one place you can learn more is [here](https://astrobiomike.github.io/unix/getting-started#the-unix-file-system-structure).
76+
We recommend installing singularity on a system wide level as per the associated [documentation](https://docs.sylabs.io/guides/3.10/admin-guide/admin_quickstart.html).
5977

60-
**Example for how to create a single-column list of unique sample identifiers from your raw data file names**
78+
<br>
6179

62-
For example, if you have paired-end read data for 2 samples located in `../Raw_Data/` relative to your workflow directory, that would look like this:
80+
### 2. Download the workflow files
81+
82+
All files required for utilizing the NF_XXX GeneLab workflow for processing metagenomics illumina data are in the [workflow_code](workflow_code) directory. To get a copy of latest *NF_XXX* version on to your system, the code can be downloaded as a zip file from the release page then unzipped after downloading by running the following commands:
6383

6484
```bash
65-
ls ../Raw_Data/
85+
wget https://github.com/nasa/GeneLab_Data_Processing/releases/download/NF_MGIllumina/NF_MGIllumina.zip
86+
unzip NF_MGIllumina.zip && cd NF_XXX-X_X.X.X
6687
```
6788

89+
OR by using the genelab-utils conda package
90+
91+
```bash
92+
GL-get-workflow MG-Illumina
6893
```
69-
Sample-1_R1_raw.fastq.gz
70-
Sample-1_R2_raw.fastq.gz
71-
Sample-2_R1_raw.fastq.gz
72-
Sample-2_R2_raw.fastq.gz
73-
```
7494

75-
You would set up your `unique-sample-IDs.txt` file as follows:
95+
<br>
96+
97+
### 3. Run the Workflow
98+
99+
For options and detailed help on how to run the workflow, run the following command:
76100

77101
```bash
78-
cat unique-sample-IDs.txt
102+
nextflow run main.nf --help
79103
```
80104

105+
> Note: Nextflow commands use both single hyphen arguments (e.g. -help) that denote general nextflow arguments and double hyphen arguments (e.g. --csv_file) that denote workflow specific parameters. Take care to use the proper number of hyphens for each argument.
106+
107+
<br>
108+
109+
#### 3a. Approach 1: Run slurm jobs in singularity containers with OSD accession as input
110+
111+
```bash
112+
nextflow run main.nf -resume -profile slurm,singularity --GLDS_accession OSD-574
81113
```
82-
Sample-1
83-
Sample-2
114+
115+
<br>
116+
117+
#### 3b. Approach 2: Run slurm jobs in singularity containers with a csv file as input
118+
119+
```bash
120+
nextflow run main.nf -resume -profile slurm,singularity --csv_file PE_file.csv
84121
```
85122

86-
### 4. Run the workflow
123+
<br>
87124

88-
While in the directory holding the Snakefile, config.yaml, and other workflow files that you downloaded in [step 2](#2-download-the-workflow-template-files), here is one example command of how to run the workflow:
125+
#### 3c. Approach 3: Run jobs locally in conda environments and specify the path to one or more existing conda environment(s)
89126

90127
```bash
91-
snakemake --use-conda --conda-prefix ${CONDA_PREFIX}/envs -j 2 -p
128+
nextflow run main.nf -resume -profile conda --csv_file SE_file.csv --conda.qc <path/to/existing/conda/environment>
92129
```
93130

94-
* `--use-conda` – specifies to use the conda environments included in the workflow (these are specified in the files in the workflow [envs/](workflow_code/envs) directory)
95-
* `--conda-prefix` – indicates where the needed conda environments will be stored. Adding this option will also allow the same conda environments to be re-used when processing additional datasets, rather than making new environments each time you run the workflow. The value listed for this option, `${CONDA_PREFIX}/envs`, points to the default location for conda environments (note: the variable `${CONDA_PREFIX}` will be expanded to the appropriate location on whichever system it is run on).
96-
* `-j` – assigns the number of jobs Snakemake should run concurrently (keep in mind that many of the thread and cpu parameters set in the config.yaml file will be multiplied by this)
97-
* `-p` – specifies to print out each command being run to the screen
131+
<br>
132+
133+
**Required Parameters For All Approaches:**
134+
135+
* `-run main.nf` - Instructs nextflow to run the NF_XXX workflow
136+
* `-resume` - Resumes workflow execution using previously cached results
137+
* `-profile` – Specifies the configuration profile(s) to load, `singularity` instructs nextflow to setup and use singularity for all software called in the workflow
138+
139+
140+
*Required only if you would like to pull and process data directly from OSDR*
141+
142+
* `--GLDS_accession` – A Genelab / OSD accession number e.g. OSD-574.
143+
144+
*Required only if --GLDS_accession is not passed as an argument*
145+
146+
* `--csv_file` – A 3-column (single-end) or 4-column (paired-end) input csv file (sample_id, forward, [reverse,] paired). Please see the sample `SE_file.csv` and `PE_file.csv` in this repository for examples on how to format this file.
147+
148+
> See `nextflow run -h` and [Nextflow's CLI run command documentation](https://nextflow.io/docs/latest/cli.html#run) for more options and details on how to run nextflow.
149+
150+
<br>
151+
152+
#### 3d. Modify parameters and cpu resources in the nextflow config file
153+
154+
Additionally, the parameters and workflow resources can be directly specified in the nextflow.config file. For detailed instructions on how to modify and set parameters in the nextflow.config file, please see the [documentation here](https://www.nextflow.io/docs/latest/config.html).
155+
156+
Once you've downloaded the workflow template, you can modify the parameters in the `params` scope and cpus/memory requirements in the `process` scope in your downloaded version of the [nextflow.config](workflow_code/nextflow.config) file as needed in order to match your dataset and system setup. For example, you can directly set the the full paths to available conda environments in the `conda` scope within the `params` scope. Additionally, if necessary, you'll need to modify each variable in the nexflow.config file to be consistent with the study you want to process and the machine you're using.
157+
158+
### 4. Workflow outputs
159+
160+
#### 4a. Main outputs
161+
162+
The outputs from this pipeline are documented in the [GL-DPPD-7107](../../Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107.md) processing protocol.
163+
164+
#### 4b. Resource logs
165+
166+
Standard nextflow resource usage logs are also produced as follows:
98167

99-
See `snakemake -h` and [Snakemake's documentation](https://snakemake.readthedocs.io/en/stable/) for more options and details.
168+
- Output:
169+
- Resource_Usage/execution_report_{timestamp}.html (an html report that includes metrics about the workflow execution including computational resources and exact workflow process commands)
170+
- Resource_Usage/execution_timeline_{timestamp}.html (an html timeline for all processes executed in the workflow)
171+
- Resource_Usage/execution_trace_{timestamp}.txt (an execution tracing file that contains information about each process executed in the workflow, including: submission time, start time, completion time, cpu and memory used, machine-readable output)
100172

101-
---
173+
> Further details about these logs can also found within [this Nextflow documentation page](https://www.nextflow.io/docs/latest/tracing.html#execution-report).

0 commit comments

Comments
 (0)