The goal of daylily is to enable more rigorous comparisons of informatics tools by formalizing their compute environments and establishing hardware profiles that reproduce each tool’s accuracy and runtime/cost performance. This approach is general, not tied to a single toolset; while AWS is involved, nothing prevents deployment elsewhere. AWS simply offers a standardized hardware environment accessible to anyone with an account. By “compute environment,” I mean more than a container—containers alone don’t guarantee hardware performance, and cost/runtime considerations demand reproducibility on specific hardware. Though daylily uses containers and conda, it remains agnostic about the tools themselves. I have three main aims:
Move away from unhelpful debates over “the best” tool and toward evidence-based evaluations. Real use cases dictate tool choice, so let’s make sure relevant data and clear methodologies are accessible—or at least ensure enough detail is published to make meaningful comparisons. Specifically, I wish to move away from scant and overly reductive metrics which fail to describe our tools in as rich detail as they can be. ie:
If I am looking for the best possible
recall
in SNV calling, initial data suggestes I might look towardssentieon bwa
+sentieon DNAscope
... and interestingly, if I wanted the best possibleprecision
, it would be worth investigatingstrobealigner
+deepvariant
REF DATA.Fscore
would not be as informative for these more sepcific cases.
Demand better metrics and documentation in tool publications: thorough cost data, specific and reproducible hardware details, more nuanced concordance metrics, and expansive QC reporting. Half-measures shouldn’t pass as “sufficient.”
They were helpful at first, but our field is stuck in 2012. We need shareable frameworks that capture both accuracy and cost/runtime for truly reproducible pipeline performance—so we can finally move forward.
This repo will hold the analsis from reuslts of the first stable release of daylily
running on 7 GIAB samples. The outcome of these analyses will be a whitepaper (actively being drafted).
Concordance Metrics (Fscore, Recall, Precision, FDR, PPV, Sensitivity), by Sample, by Pipeline, by Variant Class
Best Recall
sentieon bwa
+sentieon DNAscope
== 0.9961 Best Precisionstrobe aligner
+deepvariant
== 0.9993
note!: the deepvariant times are artificially long in these data.
note!: the deepvariant times are artificially long in these data.
_these are euc1c-two
data, and deep variant costs are acurate for the AZ spot price market at that time.
useful in predicting per-sample analysis costs in advance of starting an ephemeral cluster
note!: the deepvariant times are artificially long in these data.
-
Demonstrate the Utility of Daylily
- Ephemeral cluster usage: spin up an AWS cluster only when needed, run your WGS analysis, then tear it down to minimize cost.
- Built-in cost tracking, spot instance usage, and performance metrics.
-
Promote Rigorous Comparison of Tools
- Typically, WGS comparisons omit cost and raw compute details. Daylily captures CPU time, wall time, spot pricing, and cost per vCPU second, among other metrics.
- The data produced here can help drive cost estimations for your pipeline in the best availability zone.
-
Streamlined Data & Results
- All raw data (FASTQs, references, alignstats, etc.) are in
./data/
. - Summaries of runtime, cost, coverage, and variant-caller performance are in
./results/
. - MultiQC reports, F-score heatmaps, boxplots, and more provide a comprehensive overview of pipeline performance.
- All raw data (FASTQs, references, alignstats, etc.) are in
For information on installing or configuring Daylily, please see the Daylily repository. This README focuses on reproducing these analyses and showcasing the results, not the low-level setup.
We analyzed 7 GIAB samples (Illumina, ~30× coverage) on different references (hg38 or b37) in separate AWS regions/clusters, each with distinct aligners, variant callers, and QC steps.
- Reference: hg38
- Region:
us-west-2d
- Tools:
- Aligners: Sentieon BWA, BWA-MEM2, Strobealign
- SNV Callers: Sentieon DNAscope, DeepVariant, Octopus, LoFreq2, Clair3
- SV Callers: Manta, Tiddit, Dysgu
- QC: MultiQC, alignstats + ~14 other tools.
- Note: A resizing event on FSX caused prolonged task hang times, which inflated costs and CPU/wall-time metrics for some runs. Cost/timing insights should be drawn from the
eu-central-1c
datasets or theb37
dataset below.
- Reference: b37
- Region:
us-west-2d
- Tools:
- Three aligners: Strobealign, Sentieon BWA, BWA-MEM2
- Two SNV Callers: Sentieon DNAscope, DeepVariant
- Additional QC: MultiQC, alignstats + ~14 other tools.
- Reference: hg38
- Region:
eu-central-1c
- Tools:
- Cluster A: Sentieon BWA + Sentieon DNAscope
- Cluster B: BWA-MEM2 + DeepVariant
- Goal: Show ephemeral cluster usage under ideal conditions (with spot pricing and no interruptions). This is effectively a smaller targeted run that can be combined for cost/CPU comparisons, emphasizing Daylily’s cost management features.
- Obtain Daylily from its main repository.
- Configure AWS credentials, references (hg38/b37), and your GIAB FASTQ paths in Daylily’s config.
- Process GIAB FASTQs On An Ephemeral Cluster >> detailed step-by-step commands may be found here << , which will guide you through:
a. Launch of an ephemeral cluster via Daylily
b. Run the WGS pipeline
- Executes the alignment, variant calling, and QC steps
c. Review the final results in./results/*
, which is where all files used in the analysis to follow will be found.
- Executes the alignment, variant calling, and QC steps
- Mirror
Fsx
data back toS3
. - Delete the ephemieral cluster.
Each dataset analyzed here includes:
build_annotation_giab_concordance_mqc.tsv
Contains SNV caller concordance results (F-scores, etc.) by variant class (SNP transitions/transversions, Indels, etc.).build_annotation_benchmarks_summary.tsv
Summaries of runtime, CPU usage, and spot pricing per task—used to derive cost metrics.- Alignstats (coverage metrics) and (for the two
us_west_2d
runs) MultiQC reports. - Consolidated plots and tables inside each subdirectory.
Below, we highlight key plots and observations from the three main datasets. Additional figures and tables can be found under each dataset’s concordance/
and benchmarks/
subdirectories.
Goal: Explore a broad matrix of aligners (3) and SNV callers (5) plus multiple SV callers on the 7 GIAB samples (hg38).
- Technical Interruption: A resizing of FSX inflated run times for some tasks, leading to higher computed costs.
- Interesting Plots:
- Cost per vCPU-second per GB boxplots (see
results/us_west_2d/all/concordance/boxplots/
):
- Observation: BWA-MEM2 + DeepVariant shows relatively lower cost per vCPU-second per GB. Strobealign + DNAscope is comparable, but higher standard deviation.
- F-score Heatmaps (see
heatmaps/
):
- Observation: DNAscope and DeepVariant consistently hit high F-scores for SNP transitions and transversions across samples.
- Precision-vs-Recall (PVR) plots (
pvr/
):
- Observation: Points cluster near top-right for most aligner+caller combos, but we see slight differences in recall for Indels vs. SNPs.
- Cost per vCPU-second per GB boxplots (see
Despite the FSX interruption, these runs confirm:
- High baseline accuracy for widely used pipelines (BWA + DeepVariant, Sentieon bwa mem + Sentieon DNAscope).
- Potentially lower cost solutions with Strobealign, albeit with some variability
note: SNV callers are not tuned to expect strobe aligner alignments, and I expect there is significant room for improvement. Given it's disadvantage, it's performance out of the box is quite encouraging.
Goal: Evaluate a smaller matrix (3 aligners × 2 SNV callers) against b37-based GIAB data in
us_west_2d
.
- Benchmarks:
- Faster alignments with BWA-MEM2 than with older BWA in some tasks.
- Slightly higher coverage variance with Strobealign.
- Concordance:
- DeepVariant typically edges out DNAscope in terms of SNP recall on certain GIAB samples, though the difference is small.
- Indel F-scores are relatively similar for both.
- Costs:
- Overall lower than the
hg38_usw2d-all
set, due to fewer pipeline steps and no major FSX interruptions. - Boxplots in
benchmarks/
show cost scaling with CPU time but remaining within typical spot pricing bounds.
- Overall lower than the
Goal: Show ideal ephemeral usage for two minimal pipelines on 7 GIAB samples with the hg38 reference:
- Sentieon BWA + DNAscope,
- BWA-MEM2 + DeepVariant.
- Separate Clusters: Each pipeline was run on a separate ephemeral cluster at spot pricing in
eu_central_1c
. - Highlights:
- Lower overall cost because the cluster spooled up only for these specific tasks, then shut down.
- Daylily’s cost-tracking reveals consistent spot pricing across tasks, with minimal idle times.
- Comparisons:
- Aligners performed comparably for coverage on these GIAB samples.
- DNAscope vs. DeepVariant differences remain subtle but show up in some variant classes—check
concordance/raw_metrics
for exact precision/recall.
This dataset underscores Daylily’s ephemeral cluster approach. The ability to create and destroy a cluster quickly, with tasks tracked by cost, was highly effective for controlling expenses.
concordance/boxplots/
Side-by-side boxplots for cost, F-scores, coverage distribution, etc.concordance/heatmaps/
Heatmaps covering per-tool performance across variant classes.concordance/pvr/
Precision vs. Recall (P/R) curves for each pipeline.benchmarks/
Summaries of CPU/wall time per Snakefile rule, cost breakdowns, and spot instance logs.
Feel free to explore the raw_metrics subfolders for CSV/TSV data if you want to do further custom analysis or re-plot these metrics.
From the daylily repo, generate a spot instance pricing report.
python bin/check_current_spot_market_by_zones.py --profile $AWS_PROFILE -o ./sentieon_case_study.csv --zones us-west-2a,us-west-2b,us-west-2c,us-west-2d,us-east-1a,ap-south-1a,eu-central-1a,eu-central-1b,eu-central-1c,ca-central-1a,ca-central-1b
eu-central-1c
has been among the cheapest and with reasonable stability for a few weeks. Proceed with this AZ to create an ephemeral cluster, run analysis, and clean it up when idle. see daylily repo docs for how to create and run an ephemeral cluster.
BWA MEM2 + DEEPVARIANT // Complete Ephemeral Cluster Cost Analysis for 7 30x GIAB Samples, FASTQ->snv.VCF (bwa mem2, doppelmark, deepvariant)
daylily
tracks every AWS service involved in creating, running and tearing down ephemeral clusters. Below is the complete cost of running an ephemeral cluster to analyze 7 GIAB 30x fastq files using a bwa mem2
+doppelmark duplicates
+deepvariant
pipeline. In this case, running vs hg38
.
This ephemeral cluster was created in AZ
eu-central-1c
as it had a very favorable spot market for the192vcpu
spot instances daylily relies upon, which cost ~$1.80/hr at that time.
This AZ had quota restrictions on how many spot instances could be run at one time, so it existed for 5hr.
total AWS cost
(EC2, Fsx, networking, etc) to run this cluster = $47.05total EC2 compute
cost = $41.50active EC2 compute
cost as calculated from hg38_eu-central-1c_benchmarks.tsv = $36.33idle EC2 compute
cost (total EC2
-active EC2
) = $5.17 (12% idle)- Idle time are vcpu seconds not actively in use by a job/task. 12% likely represents an upper bound, as this cluster was not running at capacity, and many jobs ran on partially utilized instances. This time can be dialed back by reducing the time threshold to teardown idle spot instances.
SENTIEON // Complete Ephemeral Cluster Cost Analysis for 7 30x GIAB Samples, FASTQ->snv.VCF (sentieon bwa mem, doppelmark, sentieon DNAscope)
$2.02 Avg EC2 Costs per Sample (some as low as $1/sample) // $2.26(ESTIMATED) Burdened AWS Ephemeral Cluster Cost per Sample
daylily
tracks every AWS service involved in creating, running and tearing down ephemeral clusters. Below is the complete cost of running an ephemeral cluster to analyze 7 GIAB 30x fastq files using a sentieon bwa
+doppelmark duplicates
+sentieon DNAscope
pipeline. In this case, running vs hg38
.
This ephemeral cluster was created in AZ
eu-central-1c
as it had a very favorable spot market for the192vcpu
spot instances daylily relies upon, which cost ~$1.10/hr at that time.
This AZ had quota restrictions on how many spot instances could be run at one time, so it existed for ~3hr (creation takes ~20m, teardown takes ~20m).
total AWS cost
(EC2, Fsx, networking, etc) to run this cluster = $??.00total EC2 compute
cost = $??.00active EC2 compute
cost as calculated from hg38_eu-central-1c_SENTIEON_benchmarks.tsv = $14.17idle EC2 compute
cost (total EC2
-active EC2
) = $?? (??% idle)- Idle time are vcpu seconds not actively in use by a job/task. ??% likely represents an upper bound, as this cluster was not running at capacity, and many jobs ran on partially utilized instances. This time can be dialed back by reducing the time threshold to teardown idle spot instances.
These three data sets illustrate how Daylily:
- Unifies alignment, variant calling, SV detection, and QC in a cost effective, highly observable, scalable, rigourously reproducible hardware environment paired with tools optimized for this hardware environment. The magic is more in the approach to hardware-software management than in the workflow itseld (NOTE: daylily can already run
CROMWEL
workflows and it is reasonably trivial to enable other workflow managers) - Captures & Predicts cost, CPU, coverage, and F-score metrics in one place.
- Scales from smaller targeted runs (
hg38_euc1c-two
) to comprehensive tool comparisons (hg38_usw2d-all
) and should be able to handle 1000's of genomes in parallel per-cluster (given appropriate quotas and so on).
Next Steps:
- Integrate final results and plots into a whitepaper or preprint (see the whitepaper sketch).
- Include additional variant callers (e.g., GPU-accelerated) or references for broader coverage.
- Expand cost-estimator logic to automatically recommend an optimal AWS region or instance type based on real-time pricing.
Questions or Contributions:
- Please open an issue or pull request in this repository.
- For Daylily-specific usage, see the main Daylily repo.
Last updated: February 2025