Skip to content

Commit 620598f

Browse files
authored
Metagenomics Illumina Nextflow conversion (#134)
* improved checkm performance by running separately on every bin * fixed typo in config * updated the nextflow version * changed the default value of accession input parameter * reverted AmpIllumina pipeline doc to remove updates * added launchDir variable * made format changes to config files * updated README * updated MGIllumina pipeline doc after CCB approval * added launch scripts and fixed bugs * deleted cluster path * fixed humman utilities mounting bug * commented out singularity cache_dir * updated the content of processing_info.zip to match other GeneLab Nextflow workflows
1 parent 703f469 commit 620598f

File tree

19 files changed

+666
-471
lines changed

19 files changed

+666
-471
lines changed

Amplicon/Illumina/Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ Amanda Saravia-Butler (GeneLab Data Processing Lead)
3838

3939
<!-- Included R packages -->
4040
- Assay-specific suffixes were added where needed for GeneLab repo ("GLAmpSeq")
41-
- The ITS UNITE reference database used was updated to "UNITE_v2023_July2023.RData", from http://www2.decipher.codes/Classification/TrainingSets/
41+
- The ITS UNITE reference database used was updated to "UNITE_v2023_July2023.RData", from https://www2.decipher.codes/data/Downloads/TrainingSets/
4242
- Several program versions were updated (all versions listed in [Software used](#software-used) below)
4343

4444
---
@@ -103,8 +103,8 @@ Amanda Saravia-Butler (GeneLab Data Processing Lead)
103103

104104
|Program used| Database| Relevant Links|
105105
|:-----|:-----:|--------:|
106-
|DECIPHER| SILVA SSU r138 | [http://www2.decipher.codes/Classification/TrainingSets/SILVA_SSU_r138_2019.RData](http://www2.decipher.codes/Classification/TrainingSets/)|
107-
|DECIPHER| UNITE v2020 | [http://www2.decipher.codes/Classification/TrainingSets/UNITE_v2020_February2020.RData](http://www2.decipher.codes/Classification/TrainingSets/)|
106+
|DECIPHER| SILVA SSU r138 | [https://www2.decipher.codes/data/Downloads/TrainingSets/SILVA_SSU_r138_2019.RData](https://www2.decipher.codes/data/Downloads/TrainingSets/)|
107+
|DECIPHER| UNITE v2023 | [https://www2.decipher.codes/data/Downloads/TrainingSets/UNITE_v2023_July2023.RData](https://www2.decipher.codes/data/Downloads/TrainingSets/)|
108108

109109
---
110110

@@ -443,7 +443,7 @@ dna <- DNAStringSet(getSequences(seqtab.nochim))
443443

444444
Downloading the reference R taxonomy object:
445445
```R
446-
download.file( url=http://www2.decipher.codes/Classification/TrainingSets/SILVA_SSU_r138_2019.RData”, destfile=SILVA_SSU_r138_2019.RData”)
446+
download.file( url=https://www2.decipher.codes/data/Downloads/TrainingSets/SILVA_SSU_r138_2019.RData”, destfile=SILVA_SSU_r138_2019.RData”)
447447
```
448448

449449
**Parameter Definitions:**

Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/GL-DPPD-7107-A.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
55
---
66

7-
**Date:** October XX, 2024
7+
**Date:** October 28, 2024
88
**Revision:** -A
99
**Document Number:** GL-DPPD-7107
1010

Metagenomics/Illumina/Workflow_Documentation/NF_MGIllumina-A/README.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,10 @@ Nextflow can be installed either through [Anaconda](https://anaconda.org/biocond
5151
> conda install -c bioconda nextflow
5252
> nextflow self-update
5353
> ```
54-
54+
> You may also install [mamba](https://mamba.readthedocs.io/en/latest/index.html) which is a faster implementation of conda like so:
55+
> ```bash
56+
> conda install -c conda-forge mamba
57+
> ```
5558
<br>
5659
5760
#### 1b. Install Singularity
@@ -111,7 +114,7 @@ For options and detailed help on how to run the workflow, run the following comm
111114
nextflow run main.nf --help
112115
```
113116
114-
> Note: Nextflow commands use both single hyphen arguments (e.g. -help) that denote general nextflow arguments and double hyphen arguments (e.g. --csv_file) that denote workflow specific parameters. Take care to use the proper number of hyphens for each argument.
117+
> Note: Nextflow commands use both single hyphen arguments (e.g. -help) that denote general nextflow arguments and double hyphen arguments (e.g. --input_file) that denote workflow specific parameters. Take care to use the proper number of hyphens for each argument.
115118
116119
<br>
117120
@@ -126,15 +129,15 @@ nextflow run main.nf -resume -profile slurm,singularity --accession OSD-574
126129
#### 4b. Approach 2: Run slurm jobs in singularity containers with a csv file as input
127130
128131
```bash
129-
nextflow run main.nf -resume -profile slurm,singularity --csv_file PE_file.csv
132+
nextflow run main.nf -resume -profile slurm,singularity --input_file PE_file.csv
130133
```
131134
132135
<br>
133136
134137
#### 4c. Approach 3: Run jobs locally in conda environments and specify the path to one or more existing conda environment(s)
135138
136139
```bash
137-
nextflow run main.nf -resume -profile conda --csv_file SE_file.csv --conda.qc <path/to/existing/conda/environment>
140+
nextflow run main.nf -resume -profile mamba --input_file SE_file.csv --conda_megahit <path/to/existing/conda/environment>
138141
```
139142
140143
<br>
@@ -153,7 +156,7 @@ nextflow run main.nf -resume -profile conda --csv_file SE_file.csv --conda.qc <p
153156
154157
*Required only if --accession is not passed as an argument*
155158
156-
* `--csv_file` – A single-end or paired-end input csv file containing assay metadata for each sample, including sample_id, forward, reverse, and/or paired. Please see the sample [SE_file.csv](workflow_code/SE_file.csv) and [PE_file.csv](workflow_code/PE_file.csv) in this repository for examples on how to format this file.
159+
* `--input_file` – A single-end or paired-end input csv file containing assay metadata for each sample, including sample_id, forward, reverse, and/or paired. Please see the sample [SE_file.csv](workflow_code/SE_file.csv) and [PE_file.csv](workflow_code/PE_file.csv) in this repository for examples on how to format this file.
157160
158161
> See `nextflow run -h` and [Nextflow's CLI run command documentation](https://nextflow.io/docs/latest/cli.html#run) for more options and details on how to run nextflow.
159162
@@ -163,7 +166,7 @@ nextflow run main.nf -resume -profile conda --csv_file SE_file.csv --conda.qc <p
163166
164167
Additionally, the parameters and workflow resources can be directly specified in the nextflow.config file. For detailed instructions on how to modify and set parameters in the nextflow.config file, please see the [documentation here](https://www.nextflow.io/docs/latest/config.html).
165168
166-
Once you've downloaded the workflow template, you can modify the parameters in the `params` scope and cpus/memory requirements in the `process` scope in your downloaded version of the [nextflow.config](workflow_code/nextflow.config) file as needed in order to match your dataset and system setup. For example, you can directly set the the full paths to available conda environments in the `conda` scope within the `params` scope. Additionally, if necessary, you'll need to modify each variable in the [nextflow.config](workflow_code/nextflow.config) file to be consistent with the study you want to process and the machine you're using.
169+
Once you've downloaded the workflow template, you can modify the parameters in the `params` scope and cpus/memory requirements in the `process` scope in your downloaded version of the [nextflow.config](workflow_code/nextflow.config) file as needed in order to match your dataset and system setup. Additionally, if necessary, you'll need to modify each variable in the [nextflow.config](workflow_code/nextflow.config) file to be consistent with the study you want to process and the machine you're using.
167170
168171
<br>
169172

Metagenomics/Illumina/Workflow_Documentation/NF_MGIllumina-A/workflow_code/bin/clean-paths.sh

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,25 @@ if [ -s t ]; then
1313
exit
1414
fi
1515

16-
16+
FILE=$1
1717
ROOT_DIR=$(echo $2 | awk '{N=split($0,a,"/"); for(i=0; i < N-1; i++) printf "%s/", a[i]}' | sed 's|//|/|')
1818

19+
20+
# Remove path in paired end runsheet
21+
if [ `awk 'NR==1{print}' ${FILE} | grep -c reverse` -gt 0 ]; then
22+
23+
awk 'BEGIN{FS=OFS=","} NR==1{print} NR>1{split($2, f, "/");split($3, r, "/"); print $1,f[length(f)],r[length(r)],$4}' ${FILE} > temp && mv temp ${FILE}
24+
25+
# Remove path in single end runsheet
26+
elif [ `awk 'NR==1{print}' ${FILE} | grep -c forward` -gt 0 ]; then
27+
28+
29+
awk 'BEGIN{FS=OFS=","} NR==1{print} NR>1{split($2, f, "/"); print $1,f[length(f)],$3}' ${FILE} > temp && mv temp ${FILE}
30+
31+
fi
1932

20-
sed -E 's|.*/GLDS_Datasets/(.+)|\1|g' ${1} \
33+
sed -E 's|.*/GLDS_Datasets/(.+)|\1|g' ${FILE} \
2134
| sed -E 's|.+/miniconda.+/envs/[^/]*/||g' \
2235
| sed -E 's|/[^ ]*/GLDS-|GLDS-|g' \
2336
| sed -E 's|/[a-z]{6}/[^ ]*|<path-removed-for-security-purposes>|g' \
24-
| sed -E "s|${ROOT_DIR}||g" > t && mv t ${1}
37+
| sed -E "s|${ROOT_DIR}||g" > t && mv t ${FILE}

Metagenomics/Illumina/Workflow_Documentation/NF_MGIllumina-A/workflow_code/bin/get-cov-and-depth.sh

Lines changed: 0 additions & 67 deletions
This file was deleted.
Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
#!/usr/bin/env bash
2+
set -euo pipefail
3+
4+
# Script to launch a nextflow workflow on slurm cluster
5+
6+
# Usage: bash ./launch.sh [mode] [main.nf] [config] '[extra arguments]'
7+
# Examples
8+
9+
# Processing:
10+
# bash ./launch.sh processing path/to/main.nf path/to/nextflow.config '--accession OSD-574'
11+
12+
# Postprocessing:
13+
# bash ./launch.sh post_processing path/to/post_processing.nf path/to/post_processing.config \
14+
# '--name FirstNAme M. LastName --email email@doamin.com --GLDS_accession GLDS-574 --OSD_accession OSD-574 --isa_zip ../GeneLab/OSD-574_metadata_GLDS-574-ISA.zip --runsheet ../GeneLab/GLfile.csv'
15+
16+
17+
18+
MODE=${1:-''} # Script run mode i.e. processing or post_processing
19+
MAIN=${2:-''} # Path to the main.nf or post_processing.nf nextflow script for processing and post_processing, respectively.
20+
CONFIG=${3:-''} # nextflow config file i.e. nextflow.config or post_processing.config
21+
EXTRA=${4:-''} # extra arguments to the nextflow run command
22+
23+
24+
#==============================================================================
25+
# SETUP START
26+
#==============================================================================
27+
eval "$(conda shell.bash hook)"
28+
conda activate /path/to/conda/envs/nextflow
29+
export NXF_SINGULARITY_CACHEDIR=<PATH TO SINGULARITY IMAGES>
30+
export TOWER_ACCESS_TOKEN=<YOUR ACCESS TOKEN>
31+
export TOWER_WORKSPACE_ID=<YOUR WORKSPACE ID>
32+
33+
#==============================================================================
34+
# UMASK CONFIGURATION
35+
#==============================================================================
36+
echo "Setting umask to enable group read-access by default"
37+
umask u=rwx,g=rx
38+
echo "Umask settings for this launch: $(umask -S)"
39+
40+
41+
#==============================================================================
42+
# NEXTFLOW COMMAND START
43+
#==============================================================================
44+
if [ ${MODE} == "processing" ]; then
45+
46+
RUN_NAME=MAIN_$(date +%Y%m%d%H%M%S)
47+
48+
RUN_COMMAND="nextflow -C ${CONFIG}
49+
run \
50+
-name ${RUN_NAME} \
51+
${MAIN} \
52+
-resume \
53+
-profile slurm,singularity \
54+
-with-tower \
55+
-process.queue 'normal' \
56+
-ansi-log false \
57+
${EXTRA}"
58+
59+
echo "Running command: ${RUN_COMMAND}"
60+
echo ""
61+
[ -d processing_scripts ] || mkdir processing_scripts
62+
eval ${RUN_COMMAND} && echo ${RUN_COMMAND} > processing_scripts/command.txt
63+
64+
# Save the nextflow log to a file
65+
echo "Creating Nextflow processing info file..."
66+
nextflow log ${RUN_NAME} -f name,script > processing_scripts/nextflow_processing_info_GLmetagenomics.txt
67+
echo nextflow log ${RUN_NAME} -f name,script >> processing_scripts/nextflow_processing_info_GLmetagenomics.txt
68+
echo "Nextflow processing info written to processing_scripts/nextflow_processing_info_GLmetagenomics.txt"
69+
70+
71+
elif [ ${MODE} == "post_processing" ];then
72+
73+
74+
RUN_NAME=POST_$(date +%Y%m%d%H%M%S)
75+
76+
RUN_COMMAND="nextflow -C ${CONFIG}
77+
run \
78+
-name ${RUN_NAME} \
79+
${MAIN} \
80+
-resume \
81+
-profile slurm,singularity \
82+
-with-tower \
83+
-process.queue 'normal' \
84+
-ansi-log false \
85+
${EXTRA}"
86+
87+
echo "Running command: ${RUN_COMMAND}"
88+
echo ""
89+
eval ${RUN_COMMAND}
90+
91+
else
92+
echo 'Please provide a valid mode to run the workflow.'
93+
echo 'Either processing or post_processing for running the processing or post_processing workflows, respectively.'
94+
exit 1
95+
fi
96+
97+
98+
# Set permissions on launch directory
99+
echo ""
100+
echo "Setting permissions on launch directory..."
101+
chmod -R 755 .
102+
echo "Permissions set to 755 recursively on launch directory"
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
#!/bin/bash
2+
3+
#SBATCH --job-name="nf_master" ## Replace job_name with the name of the job you are running ##
4+
#SBATCH --output=nf_master.o.%j ## Replace job_name with the name of the job you are running ##
5+
#SBATCH --error=nf_master.e.%j ## Replace job_name with the name of the job you are running ##
6+
#SBATCH --partition=normal ## Specifies the job queue to use, for urgent jobs change normal to priority ##
7+
#SBATCH --mem=20G ## Memory required to run the job in MB, this example is showing 10,000 MB or 10GB, change this number based on how much RAM you need ##
8+
#SBATCH --cpus-per-task=1 ## Number of CPUs to run the job, this example is showing 5 CPUs, change this number based on how many CPUs you need ##
9+
#SBATCH --mail-user=name@domain.com ## Specifies the e-mail address to e-mail when the job is complete, replace this e-mail address with your NASA e-mail address ##
10+
#SBATCH --mail-type=END ## Tells slurm to e-mail the address above when the job has completed ##
11+
12+
. ~/.profile
13+
14+
15+
echo "nf_master" ## Replace job_name with the name of the job you are running ##
16+
echo ""
17+
18+
19+
## Add a time-stamp at the start of the job ##
20+
start=$(date +%s)
21+
echo "start time: $start"
22+
23+
## Print the name of the compute node executing the job ##
24+
echo $HOSTNAME
25+
26+
WORKFLOW_DIR='/path/to/nextflow/workflow_code'
27+
# Processing
28+
bash ./launch.sh processing ${WORKFLOW_DIR}/main.nf ${WORKFLOW_DIR}/nextflow.config '--accession OSD-574'
29+
30+
# Post Processing
31+
#bash ./launch.sh post_processing ${WORKFLOW_DIR}/post_processing.nf ${WORKFLOW_DIR}/post_processing.config \
32+
# '--name First M. Last --email name@domain.com --GLDS_accession GLDS-574 --OSD_accession OSD-574 --isa_zip ../GeneLab/OSD-574_metadata_OSD-574-ISA.zip --runsheet ../GeneLab/GLfile.csv'
33+
34+
35+
## Add a time-stamp at the end of the job then calculate how long the job took to run in seconds, minutes, and hours ##
36+
echo ""
37+
end=$(date +%s)
38+
echo "end time: $end"
39+
runtime_s=$(echo $(( end - start )))
40+
echo "total run time(s): $runtime_s"
41+
sec_per_min=60
42+
sec_per_hr=3600
43+
runtime_m=$(echo "scale=2; $runtime_s / $sec_per_min;" | bc)
44+
echo "total run time(m): $runtime_m"
45+
runtime_h=$(echo "scale=2; $runtime_s / $sec_per_hr;" | bc)
46+
echo "total run time(h): $runtime_h"
47+
echo ""
48+
49+
50+
## Print the slurm job ID so you have it recorded and can view slurm job statistics if needed ##
51+
echo "slurm job ID: ${SLURM_JOB_ID}"

0 commit comments

Comments
 (0)