Skip to content

Commit d3f6ba5

Browse files
authored
PGEN + Docker image updates [VS-1254] (#8749)
1 parent bf8dc13 commit d3f6ba5

28 files changed

+156329
-6
lines changed

.dockstore.yml

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -433,3 +433,27 @@ workflows:
433433
- master
434434
tags:
435435
- /.*/
436+
- name: GvsExtractCallsetPgen
437+
subclass: WDL
438+
primaryDescriptorPath: /scripts/variantstore/wdl/GvsExtractCallsetPgen.wdl
439+
filters:
440+
branches:
441+
- master
442+
- ah_var_store
443+
- EchoCallset
444+
- name: GvsExtractCallsetPgenMerged
445+
subclass: WDL
446+
primaryDescriptorPath: /scripts/variantstore/wdl/GvsExtractCallsetPgenMerged.wdl
447+
filters:
448+
branches:
449+
- master
450+
- ah_var_store
451+
- EchoCallset
452+
- name: MergePgenHierarchicalWdl
453+
subclass: WDL
454+
primaryDescriptorPath: /scripts/variantstore/wdl/MergePgenHierarchical.wdl
455+
filters:
456+
branches:
457+
- master
458+
- ah_var_store
459+
- EchoCallset

Dockerfile

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,13 @@ RUN echo "source activate gatk" > /root/run_unit_tests.sh && \
8080
echo "ln -s /gatkCloneMountPoint/build/ /gatkCloneMountPoint/scripts/docker/build" >> /root/run_unit_tests.sh && \
8181
echo "cd /gatk/ && /gatkCloneMountPoint/gradlew -Dfile.encoding=UTF-8 -b /gatkCloneMountPoint/dockertest.gradle testOnPackagedReleaseJar jacocoTestReportOnPackagedReleaseJar -a -p /gatkCloneMountPoint" >> /root/run_unit_tests.sh
8282

83+
# TODO: Determine whether we actually need this. For now it seems to be required because the version of libstdc++ on
84+
# TODO: the gatk base docker is out of date (maybe?)
85+
RUN add-apt-repository -y ppa:ubuntu-toolchain-r/test && \
86+
apt-get update && \
87+
apt-get -y upgrade libstdc++6 && \
88+
apt-get -y dist-upgrade
89+
8390
WORKDIR /root
8491
RUN cp -r /root/run_unit_tests.sh /gatk
8592
RUN cp -r gatk.jar /gatk

build.gradle

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -353,6 +353,9 @@ dependencies {
353353
implementation('net.grey-panther:natural-comparator:1.1')
354354
implementation('com.fasterxml.jackson.module:jackson-module-scala_2.12:2.9.8')
355355

356+
// pgen jni
357+
implementation('org.broadinstitute:pgenjni:1.0.1')
358+
356359
testUtilsImplementation sourceSets.main.output
357360
testUtilsImplementation 'org.testng:testng:' + testNGVersion
358361
testUtilsImplementation 'org.apache.hadoop:hadoop-minicluster:' + hadoopVersion
@@ -361,6 +364,8 @@ dependencies {
361364

362365
testImplementation "org.mockito:mockito-core:2.28.2"
363366
testImplementation "com.google.jimfs:jimfs:1.1"
367+
368+
testImplementation "com.github.luben:zstd-jni:1.5.5-11"
364369
}
365370

366371
// This list needs to be kept in sync with the corresponding list in scripts/dockertest.gradle.

scripts/variantstore/docs/aou/AOU_DELIVERABLES.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
- [GvsExtractAvroFilesForHail](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsExtractAvroFilesForHail) workflow
2020
- [GvsPrepareRangesCallset](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsPrepareRangesCallset) workflow
2121
- [GvsExtractCallset](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsExtractCallset) workflow
22+
- [GvsExtractCallsetPgenMerged](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsExtractCallsetPgenMerged) workflow
2223
- [GvsCallsetStatistics](https://dockstore.org/workflows/github.com/broadinstitute/gatk/GvsCallsetStatistics) workflow
2324
- [GvsCalculatePrecisionAndSensitivity](https://dockstore.org/workflows/github.com/broadinstitute/gatk/GvsCalculatePrecisionAndSensitivity) workflow
2425
- [GvsCallsetCost](https://dockstore.org/workflows/github.com/broadinstitute/gatk/GvsCallsetCost) workflow
@@ -92,7 +93,7 @@
9293
- If you are debugging a Hail-related issue, you may want to set `leave_hail_cluster_running_at_end` to `true` and refer to [the suggestions for debugging issues with Hail](HAIL_DEBUGGING.md).
9394

9495
1. `GvsCallsetStatistics` workflow
95-
- You will need to run `GvsPrepareRangesCallset` workflow first, if it has not been run already
96+
- You will need to run `GvsPrepareRangesCallset` workflow for callset statistics first, if it has not been run already.
9697
- This workflow transforms the data in the vet tables into a schema optimized for callset stats creation and for calculating sensitivity and precision.
9798
- The `only_output_vet_tables` input should be set to `true` (the default value is `false`).
9899
- The `enable_extract_table_ttl` input should be set to `true` (the default value is `false`), which will add a TTL of two weeks to the tables it creates.
@@ -102,6 +103,16 @@
102103
- This workflow needs to be run with the `extract_table_prefix` input from `GvsPrepareRangesCallset` step.
103104
- This workflow needs to be run with the `filter_set_name` input from `GvsCreateFilterSet` step.
104105
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
106+
1. `GvsExtractCallset` / `GvsExtractCallsetPgenMerged` workflow
107+
- You will need to run the `GvsPrepareRangesCallset` workflow for each "[Region](https://support.researchallofus.org/hc/en-us/articles/14929793660948-Smaller-Callsets-for-Analyzing-Short-Read-WGS-SNP-Indel-Data-with-Hail-MT-VCF-and-PLINK)" (interval list) for which a PGEN or VCF deliverable is required for the callset.
108+
- This workflow transforms the data in the vet, ref_ranges, and samples tables into a schema optimized for extract.
109+
- The `enable_extract_table_ttl` input should be set to `true` (the default value is `false`), which will add a TTL of two weeks to the tables it creates.
110+
- `extract_table_prefix` should be set to a name that is unique to the given Region / interval list. See the [naming conventions doc](https://docs.google.com/document/d/1pNtuv7uDoiOFPbwe4zx5sAGH7MyxwKqXkyrpNmBxeow) for guidance on what to use.
111+
- Specify the `interval_list` appropriate for the PGEN / VCF extraction run you are performing.
112+
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
113+
- Specify the same `call_set_identifier`, `dataset_name`, `project_id`, `extract_table_prefix`, and `interval_list` that were used in the `GvsPrepareRangesCallset` run documented above.
114+
- Specify the `interval_weights_bed` appropriate for the PGEN / VCF extraction run you are performing. `gs://gvs_quickstart_storage/weights/gvs_full_vet_weights_1kb_padded_orig.bed` is the interval weights BED used for Quickstart.
115+
- These workflows do not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
105116
1. `GvsCalculatePrecisionAndSensitivity` workflow
106117
- Please see the detailed instructions for running the Precision and Sensitivity workflow [here](../../tieout/AoU_PRECISION_SENSITIVITY.md).
107118
1. `GvsCallsetCost` workflow

0 commit comments

Comments
 (0)