Skip to content

Commit 703f469

Browse files
Merge pull request #129 from olabiyi/DEV_Metagenomics_Illumina_NF_conversion
Nextflow Metagenomics Illumina conversion: Added missing post-processing script and fixed no assemblies produced bug
2 parents 0dcec9b + e91d7a5 commit 703f469

File tree

7 files changed

+73
-20
lines changed

7 files changed

+73
-20
lines changed

Metagenomics/Illumina/Workflow_Documentation/NF_MGIllumina-A/README.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ nextflow run main.nf --help
115115
116116
<br>
117117
118-
#### 4a. Approach 1: Run slurm jobs in singularity containers with OSD accession as input
118+
#### 4a. Approach 1: Run slurm jobs in singularity containers with OSD or GLDS accession as input
119119
120120
```bash
121121
nextflow run main.nf -resume -profile slurm,singularity --accession OSD-574
@@ -195,30 +195,30 @@ Standard nextflow resource usage logs are also produced as follows:
195195
For options and detailed help on how to run the post-processing workflow, run the following command:
196196
197197
```bash
198-
nextflow run post_processng.nf --help
198+
nextflow run post_processing.nf --help
199199
```
200200
201201
To generate a README file, a protocols file, a md5sums table and a file association table after running the processing workflow sucessfully, modify and set the parameters in [post_processing.config](workflow_code/post_processing.config) then run the following command:
202202
203203
```bash
204-
nextflow -C post_processing.config run post_processng.nf -resume -profile slurm,singularity
204+
nextflow -C post_processing.config run post_processing.nf -resume -profile slurm,singularity
205205
```
206206
207207
The outputs of the run will be in a directory called `Post_Processing` by default and they are as follows:
208208
209-
- Post_processing/FastQC_Outputs/filtered_multiqc_GLmetagenomics_report.zip (Filtered sequence multiqc report with paths purged)
209+
- Post_processing/FastQC_Outputs/filtered_multiqc_GLmetagenomics_report.zip (Filtered sequence multiqc report with paths purged)
210210
211-
- Post_processing/FastQC_Outputs/raw_multiqc_GLmetagenomics_report.zip (Raw sequence multiqc report with paths purged)
211+
- Post_processing/FastQC_Outputs/raw_multiqc_GLmetagenomics_report.zip (Raw sequence multiqc report with paths purged)
212212
213-
- Post_processing/<GLDS_accession>_-associated-file-names.tsv (File association table for curation)
213+
- Post_processing/<GLDS_accession>_-associated-file-names.tsv (File association table for curation)
214214
215-
- Post_processing/<GLDS_accession>_metagenomics-validation.log (Automatic verification and validation log file)
215+
- Post_processing/<GLDS_accession>_metagenomics-validation.log (Automatic verification and validation log file)
216216
217-
- Post_processing/processed_md5sum_GLmetagenomics.tsv (md5sums for the files to be released on OSDR)
217+
- Post_processing/processed_md5sum_GLmetagenomics.tsv (md5sums for the files to be released on OSDR)
218218
219-
- Post_processing/processing_info_GLmetagenomics.zip (Zip file containing all files used to run the workflow and required logs with paths purged)
219+
- Post_processing/processing_info_GLmetagenomics.zip (Zip file containing all files used to run the workflow and required logs with paths purged)
220220
221-
- Post_processing/protocol.txt (File describing the methods used by the workflow)
221+
- Post_processing/protocol.txt (File describing the methods used by the workflow)
222222
223223
- Post_processing/README_GLmetagenomics.txt (README file listing and describing the outputs of the workflow)
224224
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
#!/usr/bin/env bash
2+
3+
# Generate protocol according to a pipeline document
4+
5+
# USAGE:
6+
# generate_protocol.sh <software_versions> <protocol_id>
7+
# EXAMPLE
8+
# generate_protocol.sh ../Metadata/software_versions.txt GL-DPPD-7107-A
9+
10+
FASTQC=`grep -i 'fastqc' $1 | awk '{print $2}' |sed -E 's/v//'`
11+
MULTIQC=`grep -i 'multiqc' $1 | awk '{print $3}'`
12+
BBMAP=`grep -i 'bbtools' $1 | awk '{print $2}'`
13+
HUMANN=`grep -i 'humann' $1 | awk '{print $2}'|sed -E 's/v//'`
14+
MEGAHIT=`grep -i 'megahit' $1 | awk '{print $2}'|sed -E 's/v//'`
15+
PRODIGAL=`grep -i 'prodigal' $1 | awk '{print $2}'|sed -E 's/[vV:]//g'`
16+
CAT=`grep 'CAT' $1 | awk '{print $2}'|sed -E 's/v//'`
17+
KOFAMSCAN=`grep 'exec_annotation' $1 | awk '{print $2}'`
18+
BOWTIE2=`grep -i 'bowtie' $1 | awk '{print $3}'`
19+
SAMTOOLS=`grep -i 'samtools' $1 | awk '{print $2}'`
20+
METABAT2=`grep -i 'metabat' $1 | awk '{print $2}'`
21+
BIT=`grep -i 'bioinformatics tools' $1 | awk '{print $3}' | sed 's/v//' | sed -E 's/.+([0-9]+.[0-9]+.[0-9]+).+/\1/'`
22+
CHECKM=`grep -i 'checkm' $1 | awk '{print $2}' |sed -E 's/v//'`
23+
GTDBTK=`grep -i '^GTDB' $1 | awk '{print $2}' |sed -E 's/v//' | head -n2` # If 2 versions are used, choose the second
24+
25+
PROTOCOL_ID=$2
26+
27+
PROTOCOL="Data were processed as described in ${PROTOCOL_ID} (https://github.com/nasa/GeneLab_Data_Processing/blob/master/Metagenomics/Illumina/Pipeline_GL-DPPD-7107_Versions/${PROTOCOL_ID}.md), using workflow NF_MGIllumina v1.0.0 (https://github.com/nasa/GeneLab_Data_Processing/tree/NF_MGIllumina_1.0.0/Metagenomics/Illumina/Workflow_Documentation/NF_MGIllumina). \
28+
In breif, quality assessment of reads was performed with FastQC v${FASTQC} and reports were summarized with MultiQC v${MULTIQC}. \
29+
Quality trimming and filtering were performed with bbmap v${BBMAP}. Read-based processing was performed with humann3 v${HUMANN}. \
30+
Individual samples were assembled with megahit v${MEGAHIT}. Genes were called with prodigal v${PRODIGAL}. \
31+
Taxonomic classification of genes and contigs was performed with CAT v${CAT}. Functional annotation was done with KOFamScan v${KOFAMSCAN}. \
32+
Reads were mapped to assemblies with bowtie2 v${BOWTIE2} and coverage information was extracted for reads and contigs with samtools v${SAMTOOLS} and bbmap v${BBMAP}. \
33+
Binning of contigs was performed with metabat2 v${METABAT2}. Bins were summarized with bit v${BIT} and estimates of quality were generated with checkm v${CHECKM}. \
34+
High-quality bins (> 90% est. completeness and < 10% est. redundancy) were taxonomically classified with gtdb-tk v${GTDBTK}."
35+
36+
echo ${PROTOCOL}

Metagenomics/Illumina/Workflow_Documentation/NF_MGIllumina-A/workflow_code/modules/assembly.nf

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ process RENAME_HEADERS {
6161
output:
6262
tuple val(sample_id), path("${sample_id}-assembly.fasta"), emit: contigs
6363
path("versions.txt"), emit: version
64+
path("Failed-assemblies.tsv"), optional: true, emit: failed_assembly
6465
script:
6566
"""
6667
bit-rename-fasta-headers -i ${assembly} \\

Metagenomics/Illumina/Workflow_Documentation/NF_MGIllumina-A/workflow_code/modules/assembly_based_processing.nf

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,12 @@ workflow assembly_based {
4848
sample_id, assembly -> file("${assembly}")
4949
}.collect()
5050
SUMMARIZE_ASSEMBLIES(assemblies_ch)
51+
52+
// Write failed assemblies to a Failed assemblies file
53+
failed_assemblies = RENAME_HEADERS.out.failed_assembly
54+
failed_assemblies
55+
.map{ it.text }
56+
.collectFile(name: "${params.assemblies_dir}/Failed-assemblies.tsv", cache: false)
5157

5258
// Map reads to assembly
5359
MAPPING(assembly_ch.join(filtered_ch))

Metagenomics/Illumina/Workflow_Documentation/NF_MGIllumina-A/workflow_code/modules/create_runsheet.nf

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,37 @@
11
#!/usr/bin/env nextflow
22
nextflow.enable.dsl = 2
33

4-
//params.GLDS_accession = "OSD-574"
4+
//params.accession = "OSD-574"
55
//params.RawFilePattern = null // Pattern of files on OSDR for the OSD accession you want to process
66

77
process GET_RUNSHEET {
88

99
beforeScript "chmod +x ${baseDir}/bin/create_runsheet.sh"
10+
tag "Downloading raw fastq files and runsheet for ${accession}..."
1011

1112
input:
12-
val(GLDS_accession)
13+
val(accession)
1314
output:
1415
path("a_*metagenomic*.txt"), emit: assay_TABLE
1516
path("*.zip"), emit: zip
1617
path("GLfile.csv"), emit: input_file
1718
path("versions.txt"), emit: version
1819
script:
1920
"""
20-
# Download ISA zip file for the GLDS_accession then unzip it
21-
GL-download-GLDS-data -g ${GLDS_accession} -p ISA -f && unzip *-ISA.zip
21+
# Download ISA zip file for the GLDS/OSD accession then unzip it
22+
GL-download-GLDS-data -g ${accession} -p ISA -f && unzip *-ISA.zip
2223
2324
if [ ${params.RawFilePattern} == null ];then
2425
2526
# Attempt to download the sequences using the assay table, if that fails then
2627
# attempt retrieving all fastq.gz files
27-
GL-download-GLDS-data -f -g ${GLDS_accession} -a a_*metagenomic*.txt -o Raw_Sequence_Data || \\
28-
GL-download-GLDS-data -f -g ${GLDS_accession} -p ".fastq.gz" -o Raw_Sequence_Data
28+
GL-download-GLDS-data -f -g ${accession} -a a_*metagenomic*.txt -o Raw_Sequence_Data || \\
29+
GL-download-GLDS-data -f -g ${accession} -p ".fastq.gz" -o Raw_Sequence_Data
2930
3031
else
3132
3233
33-
GL-download-GLDS-data -f -g ${GLDS_accession} -p ${params.RawFilePattern} -o Raw_Sequence_Data
34+
GL-download-GLDS-data -f -g ${accession} -p ${params.RawFilePattern} -o Raw_Sequence_Data
3435
3536
fi
3637
@@ -39,8 +40,8 @@ process GET_RUNSHEET {
3940
grep '+' *wanted-file-download-commands.sh | \\
4041
sort -u | \\
4142
awk '{gsub(/\\+/,"%2B", \$NF);print}' \\
42-
> plus_containing_${GLDS_accession}-wanted-file-download-commands.sh
43-
cat plus_containing_${GLDS_accession}-wanted-file-download-commands.sh | parallel -j $task.cpus
43+
> plus_containing_${accession}-wanted-file-download-commands.sh
44+
cat plus_containing_${accession}-wanted-file-download-commands.sh | parallel -j $task.cpus
4445
fi
4546
4647
# Create runsheet from the assay table
@@ -52,7 +53,7 @@ process GET_RUNSHEET {
5253

5354
workflow {
5455

55-
GET_RUNSHEET(params.GLDS_accession)
56+
GET_RUNSHEET(params.accession)
5657
file_ch = GET_RUNSHEET.out.input_file
5758
.splitCsv(header:true)
5859

Metagenomics/Illumina/Workflow_Documentation/NF_MGIllumina-A/workflow_code/modules/read_mapping.nf

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ process MAPPING {
3030
else
3131
3232
touch ${sample_id}.sam
33+
echo "Mapping not performed for ${sample_id} because the assembly didn't produce anything." > ${sample_id}-mapping-info.txt
3334
printf "Mapping not performed for ${sample_id} because the assembly didn't produce anything.\\n"
3435
3536
fi
@@ -48,6 +49,7 @@ process MAPPING {
4849
else
4950
5051
touch ${sample_id}.sam
52+
echo "Mapping not performed for ${sample_id} because the assembly didn't produce anything." > ${sample_id}-mapping-info.txt
5153
printf "Mapping not performed for ${sample_id} because the assembly didn't produce anything.\\n"
5254
5355
fi

Metagenomics/Illumina/Workflow_Documentation/NF_MGIllumina-A/workflow_code/nextflow.config

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -327,6 +327,13 @@ process {
327327
publishDir = [path: params.logs_dir, pattern: "*-assembly.log", mode: params.publishDir_mode]
328328
}
329329

330+
withName: RENAME_HEADERS{
331+
332+
publishDir = [path: params.assemblies_dir, pattern: "*-assembly.fasta" , mode: params.publishDir_mode]
333+
334+
}
335+
336+
330337
withLabel: mapping {
331338
conda = {params.conda.mapping != null ? params.conda.mapping : "envs/mapping.yaml"}
332339
cpus = 8

0 commit comments

Comments
 (0)