Skip to content

Commit 8d82d81

Browse files
Merge pull request #66 from torres-alexis/SW_AmpIllumina-B-optional-vis
SW-AmpIllumina-B optional visualizations updates Moved the visualization script from workflow_code/scripts/ to new folder workflow_code/visualizations/ Visualizations are now optional with the default being off. R-visualizations.log is now checked for and then copied to the outputs along with the other logs Visualizations can be enabled by either: run_workflow.py launch: using the ‘--visualizations TRUE’ argument. Direct snakemake launch: setting config[“enable_visualizations”] to “TRUE” Reformatted Snakefile rule all inputs (final outputs) to make addition of conditional outputs easier
2 parents 786fcbd + 2aa92ab commit 8d82d81

File tree

7 files changed

+122
-108
lines changed

7 files changed

+122
-108
lines changed

Amplicon/Illumina/Workflow_Documentation/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
|Pipeline Version|Current Workflow Version (for respective pipeline version)|
88
|:---------------|:---------------------------------------------------------|
9-
|*[GL-DPPD-7104-B.md](../Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md)|[1.2.1](SW_AmpIllumina-B)|
9+
|*[GL-DPPD-7104-B.md](../Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md)|[1.2.2](SW_AmpIllumina-B)|
1010
|*[GL-DPPD-7104-A.md](../Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-A.md)|[1.1.1](SW_AmpIllumina-A)|
1111

1212
*Current GeneLab Pipeline/Workflow Implementation

Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
# Workflow change log
22

3+
## [1.2.2](https://github.com/nasa/GeneLab_Data_Processing/tree/SW_AmpIllumina-B_1.2.2/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B)
4+
- Visualizations are now optional with the default being off.
5+
- Enable with optional `run_workflow.py` argument `--visualizations TRUE` or setting `config.yaml` `enable_visualizations` to "TRUE"
6+
- Moved the visualization script from `workflow_code/scripts/` to new folder `workflow_code/visualizations/`
7+
- Refactored Snakefile outputs
8+
39
## [1.2.1](https://github.com/nasa/GeneLab_Data_Processing/tree/SW_AmpIllumina-B_1.2.1/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B)
410
- Moved SW_AmpIllumina-A_1.2.1 to SW_AmpIllumina-B_1.2.1
511
- Workflow runs the [GL-DPPD-7104-B version](../../Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md) of the GeneLab standard pipeline, which includes data visualization outputs

Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/README.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -52,15 +52,15 @@ ___
5252
<!-- All files required for utilizing the GeneLab workflow for processing Illumina amplicon sequencing data are in the [workflow_code](workflow_code) directory. To get a copy of latest SW_AmpIllumina-B version on to your system, the code can be downloaded as a zip file from the release page then unzipped after downloading by running the following commands:
5353
5454
```bash
55-
wget https://github.com/nasa/GeneLab_Data_Processing/releases/download/SW_AmpIllumina-B_1.2.1/SW_AmpIllumina-B_1.2.1.zip
55+
wget https://github.com/nasa/GeneLab_Data_Processing/releases/download/SW_AmpIllumina-B_1.2.2/SW_AmpIllumina-B_1.2.2.zip
5656
57-
unzip SW_AmpIllumina-B_1.2.1.zip
57+
unzip SW_AmpIllumina-B_1.2.2.zip
5858
```
5959
60-
This downloaded the workflow into a directory called `SW_AmpIllumina-B_1.2.1`. To run the workflow, you will need to move into that directory by running the following command:
60+
This downloaded the workflow into a directory called `SW_AmpIllumina-B_1.2.2`. To run the workflow, you will need to move into that directory by running the following command:
6161
6262
```bash
63-
cd SW_AmpIllumina-B_1.2.1
63+
cd SW_AmpIllumina-B_1.2.2
6464
``` -->
6565

6666
All files required for utilizing the GeneLab workflow for processing Illumina amplicon sequencing data are in the [workflow_code](workflow_code) directory. To get a copy of the latest SW_AmpIllumina-B version on to your system, run the following command:
@@ -132,7 +132,7 @@ ___
132132
* `--run` - specifies the command used to execute the snakemake workflow; snakemake-specific parameters are defined below
133133
134134
* `--outputDir` - specifies the output directory for the output files generated by the workflow
135-
> *This is an optional command that can be added outside the quotation marks in either approach to specify the output directory. If this option is not used, the output files will be printed to the current working directory, i.e. in the `SW_AmpIllumina-B_1.2.1` directory that was downloaded in [step 2](#2-download-the-workflow-template-files).*
135+
> *This is an optional command that can be added outside the quotation marks in either approach to specify the output directory. If this option is not used, the output files will be printed to the current working directory, i.e. in the `SW_AmpIllumina-B_1.2.2` directory that was downloaded in [step 2](#2-download-the-workflow-template-files).*
136136
137137
* `--trim-primers TRUE/FALSE` - specifies to trim primers (TRUE) or not (FALSE). Default: TRUE
138138
> *Note: Primers should virtually always be trimmed from amplicon datasets. This option is here for cases where they have already been removed.*
@@ -167,6 +167,8 @@ ___
167167
* `--specify-runsheet` - specifies the runsheet to use when multiple runsheets are generated
168168
> *Optional parameter used in Approach 1 for datasets that have multiple assays for the same amplicon target (e.g. [OSD-249](https://osdr.nasa.gov/bio/repo/data/studies/OSD-249)).*
169169
170+
* `--visualizations TRUE/FALSE` - if set to TRUE, the [visualizations script](workflow_code/visualizations/Illumina-R-visualizations.R) will be run. Default: FALSE
171+
170172
<br>
171173
172174
**Parameter Definitions for `snakemake`**

Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/Snakefile

Lines changed: 91 additions & 98 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ import os
99

1010
configfile: "config.yaml"
1111

12+
enable_visualizations = config["enable_visualizations"]
1213

1314
########################################
1415
############# General Info #############
@@ -54,17 +55,85 @@ if len(set(sample_ID_list)) != len(sample_ID_list):
5455
######## Setting up directories ########
5556
########################################
5657

58+
# Initialize the list of needed directories without plots_dir
5759
if config["trim_primers"] == "TRUE":
58-
needed_dirs = [config["info_out_dir"], config["fastqc_out_dir"], config["trimmed_reads_dir"], config["filtered_reads_dir"], config["final_outputs_dir"], config["plots_dir"], "benchmarks"]
60+
needed_dirs = [
61+
config["info_out_dir"],
62+
config["fastqc_out_dir"],
63+
config["trimmed_reads_dir"],
64+
config["filtered_reads_dir"],
65+
config["final_outputs_dir"],
66+
"benchmarks"
67+
]
5968
else:
60-
needed_dirs = [config["info_out_dir"], config["fastqc_out_dir"], config["filtered_reads_dir"], config["final_outputs_dir"], config["plots_dir"], "benchmarks"]
61-
69+
needed_dirs = [
70+
config["info_out_dir"],
71+
config["fastqc_out_dir"],
72+
config["filtered_reads_dir"],
73+
config["final_outputs_dir"],
74+
"benchmarks"
75+
]
76+
77+
# Conditionally add plots_dir if enable_visualizations is True
78+
if enable_visualizations == "TRUE":
79+
needed_dirs.append(config["plots_dir"])
80+
81+
# Try to create the directories
6282
for dir in needed_dirs:
6383
try:
64-
os.mkdir(dir)
65-
except:
66-
pass
84+
os.makedirs(dir, exist_ok=True)
85+
except Exception as e:
86+
print(f"Could not create directory {dir}: {e}")
6787

88+
########################################
89+
########## Setting up outputs ##########
90+
########################################
91+
92+
# Base rule all inputs (final outs) for PE, with or without trimming
93+
base_PE_inputs = [
94+
expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R1_suffix"], ID = sample_ID_list),
95+
expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R2_suffix"], ID = sample_ID_list),
96+
config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy_{assay_suffix}.tsv",
97+
config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts_{assay_suffix}.biom.zip",
98+
config["final_outputs_dir"] + config["output_prefix"] + f"ASVs_{assay_suffix}.fasta",
99+
config["final_outputs_dir"] + config["output_prefix"] + f"read-count-tracking_{assay_suffix}.tsv",
100+
config["final_outputs_dir"] + config["output_prefix"] + f"counts_{assay_suffix}.tsv",
101+
config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts_{assay_suffix}.tsv",
102+
config["fastqc_out_dir"] + config["output_prefix"] + f"raw_multiqc_{assay_suffix}_report.zip",
103+
config["fastqc_out_dir"] + config["output_prefix"] + f"filtered_multiqc_{assay_suffix}_report.zip"
104+
]
105+
106+
# Base rule all inputs (final outs) for SE, with or without trimming
107+
base_SE_inputs = [
108+
expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R1_suffix"], ID = sample_ID_list),
109+
config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy_{assay_suffix}.tsv",
110+
config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts_{assay_suffix}.biom.zip",
111+
config["final_outputs_dir"] + config["output_prefix"] + f"ASVs_{assay_suffix}.fasta",
112+
config["final_outputs_dir"] + config["output_prefix"] + f"read-count-tracking_{assay_suffix}.tsv",
113+
config["final_outputs_dir"] + config["output_prefix"] + f"counts_{assay_suffix}.tsv",
114+
config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts_{assay_suffix}.tsv",
115+
config["fastqc_out_dir"] + config["output_prefix"] + f"raw_multiqc_{assay_suffix}_report.zip",
116+
config["fastqc_out_dir"] + config["output_prefix"] + f"filtered_multiqc_{assay_suffix}_report.zip"
117+
]
118+
119+
# Add additional inputs for trimming
120+
if config["trim_primers"] == "TRUE":
121+
if config["data_type"] == "PE":
122+
base_PE_inputs += [
123+
expand(config["trimmed_reads_dir"] + "{ID}" + config["primer_trimmed_R1_suffix"], ID = sample_ID_list),
124+
expand(config["trimmed_reads_dir"] + "{ID}" + config["primer_trimmed_R2_suffix"], ID = sample_ID_list),
125+
config["trimmed_reads_dir"] + config["output_prefix"] + f"cutadapt_{assay_suffix}.log",
126+
config["trimmed_reads_dir"] + config["output_prefix"] + f"trimmed-read-counts_{assay_suffix}.tsv",
127+
]
128+
else: # SE with primer trimming
129+
base_SE_inputs += [
130+
expand(config["trimmed_reads_dir"] + "{ID}" + config["primer_trimmed_R1_suffix"], ID = sample_ID_list),
131+
config["trimmed_reads_dir"] + config["output_prefix"] + f"cutadapt_{assay_suffix}.log",
132+
config["trimmed_reads_dir"] + config["output_prefix"] + f"trimmed-read-counts_{assay_suffix}.tsv",
133+
]
134+
135+
# Conditional addition of visualization outputs (color legend only to keep it simple)
136+
visualization_outputs = [config["plots_dir"] + config["output_prefix"] + f"color_legend_{assay_suffix}.png"] if enable_visualizations == "TRUE" else []
68137

69138
########################################
70139
############# Rules start ##############
@@ -73,53 +142,13 @@ for dir in needed_dirs:
73142
#### rules if paired-end data ####
74143
if config["data_type"] == "PE":
75144

76-
# "all" starting rule for paired-end data
77-
if config["trim_primers"] == "TRUE":
78-
79-
rule all:
80-
input:
81-
expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R1_suffix"], ID = sample_ID_list),
82-
expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R2_suffix"], ID = sample_ID_list),
83-
expand(config["trimmed_reads_dir"] + "{ID}" + config["primer_trimmed_R1_suffix"], ID = sample_ID_list),
84-
expand(config["trimmed_reads_dir"] + "{ID}" + config["primer_trimmed_R2_suffix"], ID = sample_ID_list),
85-
config["trimmed_reads_dir"] + config["output_prefix"] + f"cutadapt_{assay_suffix}.log",
86-
config["trimmed_reads_dir"] + config["output_prefix"] + f"trimmed-read-counts_{assay_suffix}.tsv",
87-
config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy_{assay_suffix}.tsv",
88-
config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts_{assay_suffix}.biom.zip",
89-
config["final_outputs_dir"] + config["output_prefix"] + f"ASVs_{assay_suffix}.fasta",
90-
config["final_outputs_dir"] + config["output_prefix"] + f"read-count-tracking_{assay_suffix}.tsv",
91-
config["final_outputs_dir"] + config["output_prefix"] + f"counts_{assay_suffix}.tsv",
92-
config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts_{assay_suffix}.tsv",
93-
config["fastqc_out_dir"] + config["output_prefix"] + f"raw_multiqc_{assay_suffix}_report.zip",
94-
config["fastqc_out_dir"] + config["output_prefix"] + f"filtered_multiqc_{assay_suffix}_report.zip",
95-
config["plots_dir"] + config["output_prefix"] + f"color_legend_{assay_suffix}.png"
96-
shell:
97-
"""
98-
bash scripts/combine-benchmarks.sh
99-
python scripts/copy_info.py
100-
"""
101-
102-
# if we are not trimming the primers
103-
else:
104-
105-
rule all:
106-
input:
107-
expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R1_suffix"], ID = sample_ID_list),
108-
expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R2_suffix"], ID = sample_ID_list),
109-
config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy_{assay_suffix}.tsv",
110-
config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts_{assay_suffix}.biom.zip",
111-
config["final_outputs_dir"] + config["output_prefix"] + f"ASVs_{assay_suffix}.fasta",
112-
config["final_outputs_dir"] + config["output_prefix"] + f"read-count-tracking_{assay_suffix}.tsv",
113-
config["final_outputs_dir"] + config["output_prefix"] + f"counts_{assay_suffix}.tsv",
114-
config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts_{assay_suffix}.tsv",
115-
config["fastqc_out_dir"] + config["output_prefix"] + f"raw_multiqc_{assay_suffix}_report.zip",
116-
config["fastqc_out_dir"] + config["output_prefix"] + f"filtered_multiqc_{assay_suffix}_report.zip",
117-
config["plots_dir"] + config["output_prefix"] + f"color_legend_{assay_suffix}.png"
118-
shell:
119-
"""
120-
bash scripts/combine-benchmarks.sh
121-
python scripts/copy_info.py
122-
"""
145+
rule all:
146+
input: base_PE_inputs + visualization_outputs
147+
shell:
148+
"""
149+
bash scripts/combine-benchmarks.sh
150+
python scripts/copy_info.py
151+
"""
123152

124153

125154
# R processing rule for paired-end data
@@ -371,50 +400,14 @@ if config["data_type"] == "PE":
371400
##################################
372401
if config["data_type"] == "SE":
373402

374-
# "all" starting rule for single-end data
375-
if config["trim_primers"] == "TRUE":
376-
377-
rule all:
378-
input:
379-
expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R1_suffix"], ID = sample_ID_list),
380-
expand(config["trimmed_reads_dir"] + "{ID}" + config["primer_trimmed_R1_suffix"], ID = sample_ID_list),
381-
config["trimmed_reads_dir"] + config["output_prefix"] + f"cutadapt_{assay_suffix}.log",
382-
config["trimmed_reads_dir"] + config["output_prefix"] + f"trimmed-read-counts_{assay_suffix}.tsv",
383-
config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy_{assay_suffix}.tsv",
384-
config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts_{assay_suffix}.biom.zip",
385-
config["final_outputs_dir"] + config["output_prefix"] + f"ASVs_{assay_suffix}.fasta",
386-
config["final_outputs_dir"] + config["output_prefix"] + f"read-count-tracking_{assay_suffix}.tsv",
387-
config["final_outputs_dir"] + config["output_prefix"] + f"counts_{assay_suffix}.tsv",
388-
config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts_{assay_suffix}.tsv",
389-
config["fastqc_out_dir"] + config["output_prefix"] + f"raw_multiqc_{assay_suffix}_report.zip",
390-
config["fastqc_out_dir"] + config["output_prefix"] + f"filtered_multiqc_{assay_suffix}_report.zip",
391-
config["plots_dir"] + config["output_prefix"] + f"color_legend_{assay_suffix}.png"
392-
shell:
393-
"""
394-
bash scripts/combine-benchmarks.sh
395-
python scripts/copy_info.py
396-
"""
397-
398-
# if we are not trimming the primers
399-
else:
403+
rule all:
404+
input: base_SE_inputs + visualization_outputs
405+
shell:
406+
"""
407+
bash scripts/combine-benchmarks.sh
408+
python scripts/copy_info.py
409+
"""
400410

401-
rule all:
402-
input:
403-
expand(config["filtered_reads_dir"] + "{ID}" + config["filtered_R1_suffix"], ID = sample_ID_list),
404-
config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy_{assay_suffix}.tsv",
405-
config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts_{assay_suffix}.biom.zip",
406-
config["final_outputs_dir"] + config["output_prefix"] + f"ASVs_{assay_suffix}.fasta",
407-
config["final_outputs_dir"] + config["output_prefix"] + f"read-count-tracking_{assay_suffix}.tsv",
408-
config["final_outputs_dir"] + config["output_prefix"] + f"counts_{assay_suffix}.tsv",
409-
config["final_outputs_dir"] + config["output_prefix"] + f"taxonomy-and-counts_{assay_suffix}.tsv",
410-
config["fastqc_out_dir"] + config["output_prefix"] + f"raw_multiqc_{assay_suffix}_report.zip",
411-
config["fastqc_out_dir"] + config["output_prefix"] + f"filtered_multiqc_{assay_suffix}_report.zip",
412-
config["plots_dir"] + config["output_prefix"] + f"color_legend_{assay_suffix}.png"
413-
shell:
414-
"""
415-
bash scripts/combine-benchmarks.sh
416-
python scripts/copy_info.py
417-
"""
418411

419412

420413
# R processing rule for single-end data
@@ -664,7 +657,7 @@ rule r_visualizations:
664657
"benchmarks/r-visualizations-benchmarks.tsv"
665658
shell:
666659
"""
667-
Rscript scripts/Illumina-R-visualizations.R "{input.runsheet}" "{input.sample_info}" "{input.counts}" "{input.taxonomy}" "{params.assay_suffix}" "{params.plots_dir}" "{params.output_prefix}" > {log} 2>&1
660+
Rscript visualizations/Illumina-R-visualizations.R "{input.runsheet}" "{input.sample_info}" "{input.counts}" "{input.taxonomy}" "{params.assay_suffix}" "{params.plots_dir}" "{params.output_prefix}" > {log} 2>&1
668661
"""
669662

670663

@@ -701,4 +694,4 @@ rule combine_cutadapt_logs_and_summarize:
701694

702695
rule clean_all:
703696
shell:
704-
"rm -rf {needed_dirs}"
697+
"rm -rf {needed_dirs}"

Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/scripts/copy_info.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,15 @@ def main(config, sample_IDs_file):
2626
(sample_IDs_file, os.path.join(info_out_dir, os.path.basename(sample_IDs_file))),
2727
(config["runsheet"], os.path.join(info_out_dir, os.path.basename(config["runsheet"]))),
2828
("R-processing.log", os.path.join(info_out_dir, "R-processing.log")),
29-
("R-visualizations.log", os.path.join(info_out_dir, "R-visualizations.log")),
3029
("all-benchmarks.tsv", os.path.join(info_out_dir,"all-benchmarks.tsv")),
3130
("Snakefile", os.path.join(info_out_dir, "Snakefile"))
3231
]
3332

33+
# Check and add "R-visualizations.log" if it exists (visualizations are optional)
34+
r_visualizations_log_path = "R-visualizations.log"
35+
if os.path.isfile(r_visualizations_log_path):
36+
files_to_copy.append((r_visualizations_log_path, os.path.join(info_out_dir, "R-visualizations.log")))
37+
3438
# Optional ISA archive
3539
if config.get("isa_archive") and os.path.isfile(config["isa_archive"]):
3640
files_to_copy.append((config["isa_archive"], os.path.join(info_out_dir, os.path.basename(config["isa_archive"]))))

0 commit comments

Comments
 (0)