Skip to content

Half-bad hetvars: one bad allele fails the genotype [VS-1615] [VS-1649] #9149

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 57 commits into from
May 8, 2025
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
f75c299
Half-bad hetvars, VCF edition: one bad allele fails the genotype [VS-…
mcovarr Apr 14, 2025
1d30881
dockstore
mcovarr Apr 14, 2025
e21f0ac
VQSR version
mcovarr Apr 15, 2025
3ee277a
VDS attempt
mcovarr Apr 15, 2025
756d363
Merge remote-tracking branch 'origin/ah_var_store' into vs_1615_hetva…
mcovarr Apr 16, 2025
4e18503
dockstore
mcovarr Apr 16, 2025
0171302
make VDS from half-bad failed hetvar VCFs
mcovarr Apr 16, 2025
d5b88b6
update merge and rescore too
mcovarr Apr 16, 2025
d428e13
turn on filtering for VQSR manual tieout
mcovarr Apr 17, 2025
182019c
Merge remote-tracking branch 'origin/ah_var_store' into vs_1615_hetva…
mcovarr Apr 17, 2025
815c1a3
variants docker
mcovarr Apr 17, 2025
bda44b7
dockstore
mcovarr Apr 17, 2025
45b65b2
holy moly tabs
mcovarr Apr 18, 2025
ed09ccc
docker
mcovarr Apr 18, 2025
ace768f
dockstore
mcovarr Apr 18, 2025
852e96f
dockstore
mcovarr Apr 18, 2025
7b47b30
remove stuff prior to review
mcovarr Apr 22, 2025
7c9e753
Merge remote-tracking branch 'origin/ah_var_store' into vs_1615_hetva…
mcovarr Apr 23, 2025
eaf3b08
docker
mcovarr Apr 23, 2025
f919d24
include Hail optimizations, update Docker
mcovarr Apr 23, 2025
1a21244
simplify more
mcovarr Apr 24, 2025
23d45a6
Docker
mcovarr Apr 24, 2025
f444f30
oops
mcovarr Apr 24, 2025
84eaa02
docker
mcovarr Apr 24, 2025
a040dc3
hash bump
mcovarr Apr 24, 2025
124d917
another hash bump
mcovarr Apr 24, 2025
c533916
yet another hash bump
mcovarr Apr 24, 2025
eb82f9f
update truth path
mcovarr Apr 25, 2025
2c00049
hash bump
mcovarr Apr 25, 2025
a545738
another hash bump
mcovarr Apr 25, 2025
fdb1988
Merge remote-tracking branch 'origin/ah_var_store' into vs_1615_hetva…
mcovarr Apr 25, 2025
2311f27
how did that get messed up
mcovarr Apr 29, 2025
7dc7c85
spanning deletion . quality scores become NaN doubles
mcovarr Apr 29, 2025
ab94c99
fresh baked GATK Docker with . quality fixes
mcovarr Apr 29, 2025
b3bb025
hash bump for all chr run
mcovarr Apr 30, 2025
218dae4
hash bump
mcovarr Apr 30, 2025
41f3844
hash bump
mcovarr May 1, 2025
27ac2bb
hash bump
mcovarr May 1, 2025
8e52897
hash bump
mcovarr May 1, 2025
0f21f6f
vat hash bump
mcovarr May 1, 2025
6f831cb
vat hash bump
mcovarr May 1, 2025
64a5bcc
vat hash bump
mcovarr May 1, 2025
3a9a0ae
vat hash bump
mcovarr May 1, 2025
6f606f0
hash bump
mcovarr May 1, 2025
347eb2f
hash bump
mcovarr May 2, 2025
6d92d22
hash bump
mcovarr May 3, 2025
521a606
Merge remote-tracking branch 'origin/ah_var_store' into vs_1615_hetva…
mcovarr May 5, 2025
5fbc434
cleanup
mcovarr May 5, 2025
a8e5796
hash bump
mcovarr May 5, 2025
33a67a8
hash bump
mcovarr May 6, 2025
2a8a8b5
hash bump
mcovarr May 6, 2025
e2bc338
hash bump
mcovarr May 6, 2025
09a0d5f
cleanup, VAT from VDS
mcovarr May 7, 2025
6c6fa2b
hash bump
mcovarr May 7, 2025
9ade1a4
hash bump
mcovarr May 8, 2025
c09d1a3
hash bump
mcovarr May 8, 2025
7aa2ce1
hash bump
mcovarr May 8, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions scripts/variantstore/scripts/import_gvs.py
Original file line number Diff line number Diff line change
Expand Up @@ -378,11 +378,11 @@ def convert_array_with_id_keys_to_dense_array(arr, ids, drop=[]):
any_yes=acc.any_yes | allele_YES[called_idx - 1],
any_snp=acc.any_snp | allele_is_snp[called_idx - 1],
any_indel=acc.any_indel | ~allele_is_snp[called_idx - 1],
any_snp_ok=acc.any_snp_ok | (allele_is_snp[called_idx - 1] & allele_OK[called_idx - 1]),
any_indel_ok=acc.any_indel_ok | (~allele_is_snp[called_idx - 1] & allele_OK[called_idx - 1]),
), hl.struct(any_no=False, any_yes=False, any_snp=False, any_indel=False, any_snp_ok=False, any_indel_ok=False)))
all_snps_ok=acc.all_snps_ok & (~allele_is_snp[called_idx - 1] | allele_OK[called_idx - 1]),
all_indels_ok=acc.all_indels_ok & (allele_is_snp[called_idx - 1] | allele_OK[called_idx - 1]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused as to why this is not:

Suggested change
all_indels_ok=acc.all_indels_ok & (allele_is_snp[called_idx - 1] | allele_OK[called_idx - 1]),
all_snps_ok=acc.all_snps_ok & (allele_is_snp[called_idx - 1] & allele_OK[called_idx - 1]),
all_indels_ok=acc.all_indels_ok & (~allele_is_snp[called_idx - 1] & allele_OK[called_idx - 1]),

Maybe I'm just confounded by the logic

Copy link
Collaborator Author

@mcovarr mcovarr Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking the SNP line as an example, I interpreted this as "nothing wrong SNP-wise if it's not a SNP at all, OR (implying it is a SNP) if the allele is OK".

If the allele was bad and it was an INDEL, the following line would catch it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually now that I look at this again I think I can/should clean it up a bit... I don't think we care about "any" SNPs or INDELs any more now that the default for the "ok" accumulators is True.

), hl.struct(any_no=False, any_yes=False, any_snp=False, any_indel=False, all_snps_ok=True, all_indels_ok=True)))

vd = vd.annotate_entries(FT=~ft.any_no & (ft.any_yes | ((~ft.any_snp | ft.any_snp_ok) & (~ft.any_indel | ft.any_indel_ok))))
vd = vd.annotate_entries(FT=~ft.any_no & (ft.any_yes | ((~ft.any_snp | ft.all_snps_ok) & (~ft.any_indel | ft.all_indels_ok))))

vd = vd.drop('allele_NO', 'allele_YES', 'allele_is_snp', 'allele_OK')
hl.vds.VariantDataset(
Expand Down
14 changes: 7 additions & 7 deletions scripts/variantstore/scripts/merge_and_rescore_vdses.py
Original file line number Diff line number Diff line change
Expand Up @@ -153,18 +153,18 @@ def patch_variant_data(vd: hl.MatrixTable, site_filters: hl.Table, vets_filters:
any_yes=acc.any_yes | allele_YES[called_idx - 1],
any_snp=acc.any_snp | allele_is_snp[called_idx - 1],
any_indel=acc.any_indel | ~allele_is_snp[called_idx - 1],
any_snp_ok=acc.any_snp_ok
| (allele_is_snp[called_idx - 1] & allele_OK[called_idx - 1]),
any_indel_ok=acc.any_indel_ok
| (~allele_is_snp[called_idx - 1] & allele_OK[called_idx - 1]),
all_snps_ok=acc.all_snps_ok
& (~allele_is_snp[called_idx - 1] | allele_OK[called_idx - 1]),
all_indels_ok=acc.all_indels_ok
& (allele_is_snp[called_idx - 1] | allele_OK[called_idx - 1]),
),
hl.struct(
any_no=False,
any_yes=False,
any_snp=False,
any_indel=False,
any_snp_ok=False,
any_indel_ok=False,
all_snps_ok=True,
all_indels_ok=True,
),
)
)
Expand All @@ -173,7 +173,7 @@ def patch_variant_data(vd: hl.MatrixTable, site_filters: hl.Table, vets_filters:
FT=~ft.any_no
& (
ft.any_yes
| ((~ft.any_snp | ft.any_snp_ok) & (~ft.any_indel | ft.any_indel_ok))
| ((~ft.any_snp | ft.all_snps_ok) & (~ft.any_indel | ft.all_indels_ok))
)
)

Expand Down
36 changes: 18 additions & 18 deletions scripts/variantstore/scripts/vds_validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,34 +17,34 @@

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is all stowaway changes fixing some bonkers whitespace that was driving IntelliJ nuts


def check_samples_match(vds):
print('checking sample equivalence between reference and variant MTs')
assert vds.reference_data.cols().select().collect() == vds.variant_data.cols().select().collect()
print('checking sample equivalence between reference and variant MTs')
assert vds.reference_data.cols().select().collect() == vds.variant_data.cols().select().collect()

def check_ref_blocks(vds):
print('checking that:\n * no reference blocks have GQ=0\n * all ref blocks have END after start\n * all ref blocks are max 1000 bases long')
rd = vds.reference_data
rd = rd.annotate_rows(locus_start = rd.locus.position)
print('checking that:\n * no reference blocks have GQ=0\n * all ref blocks have END after start\n * all ref blocks are max 1000 bases long')
rd = vds.reference_data
rd = rd.annotate_rows(locus_start = rd.locus.position)

LEN = rd.END - rd.locus_start + 1
LEN = rd.END - rd.locus_start + 1

print('checking that: no reference blocks have GQ=0')
assert rd.aggregate_entries(hl.agg.all(hl.all(rd.GQ > 0)))
print('checking that: no reference blocks have GQ=0')
assert rd.aggregate_entries(hl.agg.all(hl.all(rd.GQ > 0)))

print('checking that: all ref blocks have END after start')
assert rd.aggregate_entries(hl.agg.all(hl.all(LEN >= 0)))
print('checking that: all ref blocks have END after start')
assert rd.aggregate_entries(hl.agg.all(hl.all(LEN >= 0)))

print('checking that: all ref blocks are max 1000 bases long')
assert rd.aggregate_entries(hl.agg.all(hl.all(LEN <= rd.ref_block_max_length)))
print('checking that: all ref blocks are max 1000 bases long')
assert rd.aggregate_entries(hl.agg.all(hl.all(LEN <= rd.ref_block_max_length)))

def check_densify_small_region(vds):
print('running densify on 200kb region')
from time import time
t1 = time()
print('running densify on 200kb region')
from time import time
t1 = time()

filt = hl.vds.filter_intervals(vds, [hl.parse_locus_interval('chr16:29.5M-29.7M', reference_genome='GRCh38')])
n=hl.vds.to_dense_mt(filt).select_entries('LGT')._force_count_rows()
filt = hl.vds.filter_intervals(vds, [hl.parse_locus_interval('chr16:29.5M-29.7M', reference_genome='GRCh38')])
n=hl.vds.to_dense_mt(filt).select_entries('LGT')._force_count_rows()

print(f'took {time() - t1:.1f}s to densify {n} rows after interval query')
print(f'took {time() - t1:.1f}s to densify {n} rows after interval query')



Expand Down
2 changes: 1 addition & 1 deletion scripts/variantstore/wdl/GvsUtils.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ task GetToolVersions {
# GVS generally uses the smallest `alpine` version of the Google Cloud SDK as it suffices for most tasks, but
# there are a handlful of tasks that require the larger GNU libc-based `slim`.
String cloud_sdk_slim_docker = "gcr.io/google.com/cloudsdktool/cloud-sdk:435.0.0-slim"
String variants_docker = "us-central1-docker.pkg.dev/broad-dsde-methods/gvs/variants:2025-04-17-alpine-927ee022e0b8"
String variants_docker = "us-central1-docker.pkg.dev/broad-dsde-methods/gvs/variants:2025-04-18-alpine-364b678b8945"
String variants_nirvana_docker = "us.gcr.io/broad-dsde-methods/variantstore:nirvana_2022_10_19"
String gatk_docker = "us-central1-docker.pkg.dev/broad-dsde-methods/gvs/gatk:2025-03-20-gatkbase-728e45646e04"
String real_time_genomics_docker = "docker.io/realtimegenomics/rtg-tools:latest"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1006,14 +1006,14 @@ private VariantContext createVariantContextFromSampleRecord(final ExtractCohortR
boolean isFailingGenotype(final Stream<Allele> nonRefAlleles,
final Map<Allele, Double> remappedVQScoreMap,
final Double vqScoreThreshold) {
// get the max (best) vqslod of the alleles in this genotype
Optional<Double> maxVal =
// get the min (worst) vqslod of the alleles in this genotype
Optional<Double> minVal =
nonRefAlleles
.map(remappedVQScoreMap::get)
.filter(Objects::nonNull)
.max(Double::compareTo);
// It's a failing site if the maximum vqlod (if found) is less than the threshold
return maxVal.isPresent() && maxVal.get() < vqScoreThreshold;
.min(Double::compareTo);
// It's a failing site if the minimum vqlod (if found) is less than the threshold
return minVal.isPresent() && minVal.get() < vqScoreThreshold;
}

private SortingCollection<GenericRecord> createSortedVetCollectionFromBigQuery(final String projectID,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -129,13 +129,14 @@ boolean isFailingSite(final Stream<Double> vqScores, final Double vqScoreThresho
boolean isFailingGenotype(final Stream<Allele> nonRefAlleles,
final Map<Allele, Double> remappedVQScoreMap,
final Double vqScoreThreshold) {
// get the minimum (best) calibration sensitivity for all non-Yay sites, and apply the filter
Optional<Double> minVal =
// Get the maximum (worst) calibration sensitivity for these non-Yay alleles. If there is an allele, and it is
// greater than the vqScoreThreshold, fail the genotype.
Optional<Double> maxVal =
nonRefAlleles
.map(remappedVQScoreMap::get)
.filter(Objects::nonNull)
.min(Double::compareTo);
.max(Double::compareTo);

return minVal.isPresent() && minVal.get() > vqScoreThreshold;
return maxVal.isPresent() && maxVal.get() > vqScoreThreshold;
}
}
Loading