Skip to content

Commit 9f03919

Browse files
Merge pull request #68 from torres-alexis/Amp-B-config-typo
SW_AmpIllumina-B Fixes: Add enable_visualizations to default config.yaml example Fix run_workflow.py issue with single assay amplicon OSDR datasets Account for different sizes of characters in dendrogram sample label scaling Account for volcano plot x-axis labels that are too long to fit on plot Stand-alone visualization script execution changes: Add new visualizations/README.md for manual execution Reference this README in Workflow Documentation's --visualizations description details Add RColorBrewer palette variable to R script, include it in visualizations/README.md Move envs/R_visualizations.yaml to visualizations/R_visualizations.yaml Rename visualizations script variable "final_outputs_dir" to "plots_dir"
2 parents dd75cc8 + d961d02 commit 9f03919

File tree

7 files changed

+155
-44
lines changed

7 files changed

+155
-44
lines changed

Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/README.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@ The current GeneLab Illumina amplicon sequencing data processing pipeline (AmpIl
1515
- [3. Run the workflow using `run_workflow.py`](#3-run-the-workflow-using-run_workflowpy)
1616
- [3a. Approach 1: Run the workflow on a GeneLab Amplicon (Illumina) sequencing dataset with automatic retrieval of raw read files and metadata](#3a-approach-1-run-the-workflow-on-a-genelab-amplicon-illumina-sequencing-dataset-with-automatic-retrieval-of-raw-read-files-and-metadata)
1717
- [3b. Approach 2: Run the workflow on a non-OSD dataset using a user-created runsheet](#3b-approach-2-run-the-workflow-on-a-non-osd-dataset-using-a-user-created-runsheet)
18-
- [4. Parameter Definitions](#4-parameter-definitions)
19-
- [5. Additional Output Files](#5-additional-output-files)
18+
- [4. Parameter definitions](#4-parameter-definitions)
19+
- [5. Additional output files](#5-additional-output-files)
2020

2121
<br>
2222

@@ -114,11 +114,11 @@ python ./scripts/run_workflow.py --runsheetPath </path/to/runsheet> --run "snake
114114
115115
___
116116
117-
### 4. Parameter Definitions
117+
### 4. Parameter definitions
118118
119119
<br>
120120
121-
**Parameter Definitions for `run_workflow.py`:**
121+
**Parameter definitions for `run_workflow.py`:**
122122
123123
* `--OSD OSD-###` - specifies the OSD dataset to process through the SW_AmpIllumina workflow (replace ### with the OSD number)
124124
> *Used for Approach 1 only.*
@@ -168,10 +168,11 @@ ___
168168
> *Optional parameter used in Approach 1 for datasets that have multiple assays for the same amplicon target (e.g. [OSD-249](https://osdr.nasa.gov/bio/repo/data/studies/OSD-249)).*
169169
170170
* `--visualizations TRUE/FALSE` - if set to TRUE, the [visualizations script](workflow_code/visualizations/Illumina-R-visualizations.R) will be run. Default: FALSE
171+
> *Note: For instructions on manually executing the visualizations script, refer to the [stand-alone execution documentation](./workflow_code/visualizations/README.md).*
171172
172173
<br>
173174
174-
**Parameter Definitions for `snakemake`**
175+
**Parameter definitions for `snakemake`**
175176
176177
* `--use-conda` – specifies to use the conda environments included in the workflow (these are specified in the [envs](workflow_code/envs) directory)
177178
* `--conda-prefix` – indicates where the needed conda environments will be stored. Adding this option will also allow the same conda environments to be re-used when processing additional datasets, rather than making new environments each time you run the workflow. The value listed for this option, `${CONDA_PREFIX}/envs`, points to the default location for conda environments (note: the variable `${CONDA_PREFIX}` will be expanded to the appropriate location on whichever system it is run on).
@@ -186,7 +187,7 @@ See `snakemake -h` and [Snakemake's documentation](https://snakemake.readthedocs
186187
187188
___
188189
189-
### 5. Additional Output Files
190+
### 5. Additional output files
190191
191192
The outputs from the `run_workflow.py` and differential abundance analysis (DAA) / visualizations scripts are described below:
192193
> Note: Outputs from the Amplicon Seq - Illumina pipeline are documented in the [GL-DPPD-7104-B.md](../../Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md) processing protocol.

Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/Snakefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -635,7 +635,7 @@ if config["data_type"] == "SE":
635635
rule r_visualizations:
636636
""" This rule generates R visualizations using trimmed read data and grouping info from the runsheet"""
637637
conda:
638-
"envs/R_visualizations.yaml"
638+
"visualizations/R_visualizations.yaml"
639639
input:
640640
runsheet = config["runsheet"],
641641
sample_info = config["sample_info_file"],

Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/config.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -131,6 +131,9 @@ final_outputs_dir:
131131
plots_dir:
132132
"workflow_output/Final_Outputs/Plots/"
133133

134+
enable_visualizations:
135+
"FALSE"
136+
134137
############################################################
135138
###################### GENERAL INFO ########################
136139
############################################################

Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/scripts/run_workflow.py

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -161,10 +161,16 @@ def handle_runsheet_selection(runsheet_files, target=None, specified_runsheet=No
161161
return selected_runsheet
162162

163163
if len(runsheet_files) == 1:
164-
# Automatically use the single runsheet file
165-
if target_region == runsheet_df['Parameter Value[Library Selection]'].unique()[0]:
166-
selected_runsheet = runsheet_files[0]
167-
print(f"Using runsheet: {selected_runsheet}")
164+
if target:
165+
runsheet = runsheet_files[0]
166+
try:
167+
runsheet_df = pd.read_csv(runsheet)
168+
target_region = runsheet_df['Parameter Value[Library Selection]'].unique()[0]
169+
if target.lower() == target_region.lower():
170+
selected_runsheet = runsheet
171+
except Exception as e:
172+
print(f"Error reading {runsheet}: {e}")
173+
print(f"Using runsheet: {selected_runsheet}")
168174

169175
elif len(runsheet_files) > 1:
170176
if target:

Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/visualizations/Illumina-R-visualizations.R

Lines changed: 63 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -12,24 +12,21 @@ library(grid)
1212
##################################################################################
1313
## R visualization script for Illumina paired-end amplicon data ##
1414
##################################################################################
15+
# This script is automatically executed as part of the Snakemake workflow when the run_workflow.py --visualizations TRUE argument is used.
16+
# This script can also be manually executed using processed data from the workflow
1517

16-
# For local testing:
17-
# runsheet_file <- ""
18-
# sample_info <- ""
19-
# counts <- ""
20-
# taxonomy <- ""
21-
# final_outputs_dir <- ""
22-
23-
# For Snakemake:
24-
# Assign arguments to variables based on snakemake shell line
18+
# Store command line args as variables #
2519
args <- commandArgs(trailingOnly = TRUE)
2620
runsheet_file <- paste0(args[1])
2721
sample_info <- paste0(args[2])
2822
counts <- paste0(args[3])
2923
taxonomy <- paste0(args[4])
3024
assay_suffix <- paste(args[5])
31-
final_outputs_dir <- paste0(args[6])
25+
plots_dir <- paste0(args[6])
3226
output_prefix <- paste0(args[7])
27+
########################################
28+
29+
RColorBrewer_Palette <- "Set1"
3330

3431
# Runsheet read1 path/filename column name
3532
read1_path_colname <- 'read1_path'
@@ -93,16 +90,16 @@ g_legend <- function(a.gplot){
9390
###########################################
9491

9592
# Assign the directory paths to variables
96-
beta_diversity_out_dir <- paste0(final_outputs_dir, "beta_diversity", .Platform$file.sep)
97-
alpha_diversity_out_dir <- paste0(final_outputs_dir, "alpha_diversity", .Platform$file.sep)
98-
taxonomy_out_dir <- paste0(final_outputs_dir, "taxonomy", .Platform$file.sep)
99-
de_out_dir <- paste0(final_outputs_dir, "da", .Platform$file.sep)
93+
beta_diversity_out_dir <- paste0(plots_dir, "beta_diversity", .Platform$file.sep)
94+
alpha_diversity_out_dir <- paste0(plots_dir, "alpha_diversity", .Platform$file.sep)
95+
taxonomy_out_dir <- paste0(plots_dir, "taxonomy", .Platform$file.sep)
96+
de_out_dir <- paste0(plots_dir, "da", .Platform$file.sep)
10097

10198
abundance_out_dir <- paste0(de_out_dir, "differential_abundance", .Platform$file.sep)
10299
volcano_out_dir <- paste0(de_out_dir, "volcano", .Platform$file.sep)
103100

104101
# List of all directory variables
105-
out_dirs <- list(final_outputs_dir, beta_diversity_out_dir, alpha_diversity_out_dir, taxonomy_out_dir, de_out_dir, abundance_out_dir, volcano_out_dir)
102+
out_dirs <- list(plots_dir, beta_diversity_out_dir, alpha_diversity_out_dir, taxonomy_out_dir, de_out_dir, abundance_out_dir, volcano_out_dir)
106103

107104
# Loop through each directory path to check and create if necessary
108105
for (dir_path in out_dirs) {
@@ -183,14 +180,17 @@ vst_trans_count_tab <- assay(deseq_counts_vst)
183180
# Add colors to runsheet
184181
num_colors <- length(unique(runsheet[[groups_colname]]))
185182

186-
# Check if number of colors exceeds the limit of the "Set1" palette (9)
187-
if (num_colors > 9) {
188-
# Create a custom palette with more colors
189-
custom_palette <- colorRampPalette(brewer.pal(9, "Set1"))(num_colors)
183+
# List of RColorBrewer Palette lengths
184+
RColorBrewer_Palette_lengths <- c(Accent=8, Dark2=8, Paired=12, Pastel1=9, Pastel2=8, Set1=9, Set2=8, Set3=12)
185+
186+
# Check if number of colors exceeds the limit of the selected RColorBrewer_Palette palette
187+
if (num_colors > RColorBrewer_Palette_lengths[RColorBrewer_Palette]) {
188+
# If so, reate a custom palette with more colors
189+
custom_palette <- colorRampPalette(brewer.pal(RColorBrewer_Palette_lengths[RColorBrewer_Palette], RColorBrewer_Palette))(num_colors)
190190
colors <- custom_palette
191191
} else {
192-
# Use the standard "Set1" palette
193-
colors <- brewer.pal(num_colors, "Set1")
192+
# Else just use the standard RColorBrewer_Palette palette
193+
colors <- brewer.pal(num_colors, RColorBrewer_Palette)
194194
}
195195

196196

@@ -244,20 +244,37 @@ default_cex <- adjust_cex(length(rownames(sample_info_tab)))
244244
# Set for 11x8 plot margins, else try ggdendrogram
245245
space_available <- height_in_inches/5.4
246246

247-
calculate_max_cex <- function(n, space_avail) {
248-
base_char_width_inch <- 0.1 # average width of a character in inches at cex = 1
247+
longest_name <- rownames(sample_info_tab)[which.max(nchar(rownames(sample_info_tab)))]
248+
249+
calculate_max_cex <- function(longest_name, space_avail) {
250+
# Define weights for lower case letters, periods, else
251+
lower_case_weight <- 0.10
252+
other_char_weight <- 0.15
253+
dot_weight <- 0.02 # Weight for the period character
254+
255+
# Calculate weights in longest sample name
256+
char_weights <- sapply(strsplit(longest_name, "")[[1]], function(char) {
257+
if (char == ".") {
258+
return(dot_weight)
259+
} else if (grepl("[a-z]", char)) {
260+
return(lower_case_weight)
261+
} else {
262+
return(other_char_weight)
263+
}
264+
})
265+
266+
average_weight <- mean(char_weights)
249267

250-
# Calculate the maximum cex that fits the space
251-
max_cex <- space_avail / (n * base_char_width_inch)
268+
# Calculate the maximum cex that fits the space using the average weight
269+
n = nchar(longest_name)
270+
max_cex <- space_avail / (n * average_weight)
252271

253272
return(max_cex)
254273
}
255274

275+
max_cex <- calculate_max_cex(longest_name, space_available)
276+
dendro_cex <- min(max_cex, default_cex)
256277

257-
# Lower cex based on max sample names to prevent clipping of sample names on plot
258-
max_length <- max(nchar(rownames(sample_info_tab)))
259-
max_cex <- calculate_max_cex(max_length, space_available)
260-
max_cex <- min(max_cex, default_cex)
261278

262279
legend_groups <- unique(sample_info_tab$groups)
263280
legend_colors <- unique(sample_info_tab$color)
@@ -269,7 +286,7 @@ png(file.path(beta_diversity_out_dir, paste0(output_prefix, "dendrogram_by_group
269286
height = height_in_pixels,
270287
res = dpi)
271288
par(mar = c(10.5, 4.1, 0.6 , 2.1))
272-
euc_dend %>% set("labels_cex", max_cex) %>% plot(ylab = "VST Euc. dist.")
289+
euc_dend %>% set("labels_cex", dendro_cex) %>% plot(ylab = "VST Euc. dist.")
273290
par(xpd=TRUE)
274291
legend("bottom", inset = c(0, -.34), legend = legend_groups, fill = legend_colors, bty = 'n', cex = legend_cex)
275292
dev.off()
@@ -438,7 +455,7 @@ ggsave(filename = paste0(alpha_diversity_out_dir, output_prefix, "richness_and_d
438455
legend <- g_legend(ordination_plot)
439456
grid.newpage()
440457
grid.draw(legend)
441-
legend_filename <- paste0(final_outputs_dir, output_prefix, "color_legend_", assay_suffix, ".png")
458+
legend_filename <- paste0(plots_dir, output_prefix, "color_legend_", assay_suffix, ".png")
442459
increment <- ifelse(length(unique(sample_info_tab$groups)) > 9, ceiling((length(unique(sample_info_tab$groups)) - 9) / 3), 0)
443460
legend_height <- 3 + increment
444461
ggsave(legend_filename, plot = legend, device = "png", width = 11.1, height = legend_height, dpi = 300)
@@ -565,6 +582,8 @@ deseq_modeled <- tryCatch({
565582
write.table(counts(deseq_modeled, normalized=TRUE), file = paste0(de_out_dir, output_prefix, "normalized_counts_", assay_suffix, ".tsv"), sep="\t", row.names=TRUE, quote=FALSE)
566583
# make the volcanoplot
567584
plot_comparison <- function(group1, group2) {
585+
plot_width_inches = 11.1
586+
plot_height_inches = 8.33
568587

569588
deseq_res <- results(deseq_modeled, contrast = c("groups", group1, group2))
570589
norm_tab <- counts(deseq_modeled, normalized = TRUE) %>% data.frame()
@@ -575,13 +594,24 @@ plot_comparison <- function(group1, group2) {
575594
volcano_data <- volcano_data[!is.na(volcano_data$padj), ]
576595
volcano_data$significant <- volcano_data$padj <= p_val #also logfc cutoff?
577596

597+
######Long x-axis label adjustments##########
598+
x_label <- paste("Log2 Fold Change\n(",group1," vs ",group2,")")
599+
label_length <- nchar(x_label)
600+
max_allowed_label_length = plot_width_inches * 10
601+
602+
# Construct x-axis label with new line breaks if was too long
603+
if (label_length > max_allowed_label_length){
604+
x_label <- paste("Log2 Fold Change\n\n(", group1, "\n vs \n", group2, ")", sep="")
605+
}
606+
#######################################
607+
578608
# ASVs promoted in space on right, reduced on left
579609
p <- ggplot(volcano_data, aes(x=log2FoldChange, y=-log10(padj), color=significant)) +
580610
geom_point(alpha=0.7, size=2) +
581611
scale_color_manual(values=c("black", "red"), labels=c(paste0("padj > ", p_val), paste0("padj \u2264 ", p_val))) +
582612
theme_bw() +
583613
labs(title="Volcano Plot",
584-
x=paste("Log2 Fold Change\n(",group1," vs ",group2,")"),
614+
x=x_label,
585615
y="-Log10 P-value",
586616
color=paste0("")) +
587617
theme(legend.position="top")
@@ -598,7 +628,7 @@ plot_comparison <- function(group1, group2) {
598628
"_vs_",
599629
gsub(" ", "_", group2), ".png"),
600630
plot=volcano_plot,
601-
width = 11.1, height = 8.33, dpi = 300)
631+
width = plot_width_inches, height = plot_height_inches, dpi = 300)
602632

603633
write.csv(deseq_res, file = paste0(abundance_out_dir,
604634
output_prefix, gsub(" ", "_", group1),
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# SW_AmpIllumina-B Visualization Script Information and Usage Instructions<!-- omit in toc -->
2+
3+
4+
## General info <!-- omit in toc -->
5+
The documentation for this script and its outputs can be found in [sections 6-10 of the GL-DPPD-7104-B.md processing protocol](/Amplicon/Illumina/Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md#6-amplicon-seq-data-analysis-set-up). This script is automatically executed as an optional step of the SW_AmpIllumina-B Snakemake workflow when the `run_workflow.py` argument `--visualizations TRUE` is used. Alternatively, the script can be executed manually as detailed below.
6+
7+
<br>
8+
9+
---
10+
11+
## Utilizing the script <!-- omit in toc -->
12+
13+
14+
- [1. Set up the execution environment](#1-run-the-workflow-using-run_workflowpy)
15+
- [2. Run the visualization script manually](#2-run-the-visualization-script-manually)
16+
- [3. Parameter definitions](#3-parameter-definitions)
17+
18+
<br>
19+
20+
___
21+
22+
### 1. Set up the execution environment
23+
24+
The script should be executed from a Conda environment created using the [R_visualizations.yaml](/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/workflow_code/visualizations/R_visualizations.yaml) environment file.
25+
26+
<br>
27+
28+
___
29+
30+
### 2. Run the visualization script manually
31+
32+
To run the script, the variables `runsheet_file`, `sample_info`, `counts`, `taxonomy`, `assay_suffix`, `plots_dir`, and `output_prefix` must be specified. The script can be manually executed via the command line by providing positional arguments.
33+
34+
Additionally, the `RColorBrewer_Palette` variable can be modified in the script. This variable determines the color palette from the RColorBrewer package that is applied to the plots.
35+
36+
```R
37+
# Store command line args as variables #
38+
args <- commandArgs(trailingOnly = TRUE)
39+
runsheet_file <- paste0(args[1])
40+
sample_info <- paste0(args[2])
41+
counts <- paste0(args[3])
42+
taxonomy <- paste0(args[4])
43+
assay_suffix <- paste(args[5])
44+
plots_dir <- paste0(args[6])
45+
output_prefix <- paste0(args[7])
46+
########################################
47+
48+
RColorBrewer_Palette <- "Set1"
49+
50+
```
51+
52+
Example run command:
53+
```bash
54+
Rscript /path/to/visualizations/Illumina-R-visualizations.R "{runsheet_file}" "{sample_info}" "{counts}" "{taxonomy}" "{assay_suffix}" "{plots_dir}" "{output_prefix}"
55+
```
56+
57+
<br>
58+
59+
___
60+
61+
### 3. Parameter definitions
62+
63+
**Parameter Definitions for Illumina-R-visualizations.R:**
64+
* `runsheet_file` – specifies the runsheet containing sample metadata required for processing (output from [GL-DPPD-7104-B step 6a](/Amplicon/Illumina/Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md#6a-create-sample-runsheet))
65+
* `sample_info` – specifies the text file containing the IDs of each sample used, required for running the SW_AmpIllumina workflow (output from [run_workflow.py](/Amplicon/Illumina/Workflow_Documentation/SW_AmpIllumina-B/README.md#5-additional-output-files))
66+
* `counts` – specifies the ASV counts table (output from [GL-DPPD-7104-B step 5g](/Amplicon/Illumina/Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md#5g-generating-and-writing-standard-outputs))
67+
* `taxonomy` – specifies the taxonomy table (output from [GL-DPPD-7104-B step 5g](/Amplicon/Illumina/Pipeline_GL-DPPD-7104_Versions/GL-DPPD-7104-B.md#5g-generating-and-writing-standard-outputs))
68+
* `assay_suffix` – specifies a string that is prepended to the start of the output file names. Default: ""
69+
* `plots_dir` – specifies the path where output files will be saved
70+
* `output_prefix` – specifies a string that is appended to the end of the output file names. Default: "_GLAmpSeq"
71+
* `RColorBrewer_Palette` – specifies the RColorBrewer palette that will be used for coloring in the plots. Options include "Set1", "Accent", "Dark2", "Paired", "Pastel1", "Pastel2", "Set2", and "Set3". Default: "Set1"

0 commit comments

Comments
 (0)