Skip to content

Dev #1050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Mar 6, 2025
Merged

Dev #1050

Changes from 13 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
f164733
gnomad v4 sv migration
bpblanken Feb 19, 2025
e8d9616
ruff
bpblanken Feb 19, 2025
abeb65e
Update 0004_add_gnomad_svs.py
bpblanken Feb 20, 2025
e88c303
Update 0004_add_gnomad_svs.py
bpblanken Feb 20, 2025
3e3658e
Update 0004_add_gnomad_svs.py
bpblanken Feb 20, 2025
282bad2
Update 0004_add_gnomad_svs.py
bpblanken Feb 20, 2025
cf5875c
Update 0004_add_gnomad_svs.py
bpblanken Feb 20, 2025
026631f
comment
bpblanken Feb 20, 2025
94c30fa
Merge branch 'benb/sv_gnomad_v4_migration' of github.com:broadinstitu…
bpblanken Feb 20, 2025
4b3f678
ruff
bpblanken Feb 20, 2025
5c0cbef
Merge branch 'dev' of github.com:broadinstitute/seqr-loading-pipeline…
bpblanken Feb 23, 2025
4278cc3
Merge branch 'dev' of github.com:broadinstitute/seqr-loading-pipeline…
bpblanken Feb 23, 2025
c80c04b
Merge pull request #1048 from broadinstitute/benb/sv_gnomad_v4_migration
jklugherz Feb 26, 2025
21d0527
Update 0004_add_gnomad_svs.py
jklugherz Mar 3, 2025
bfde429
Update 0004_add_gnomad_svs.py
jklugherz Mar 4, 2025
b87bd2e
do alleles field validation only if it exists on ht
jklugherz Mar 4, 2025
19b81ca
handle set of dataset types during allele type validation
bpblanken Mar 5, 2025
a5330e5
this is a cleaner approach
bpblanken Mar 5, 2025
e4f2b80
format
bpblanken Mar 5, 2025
0addb5b
Merge remote-tracking branch 'origin/benb/remove_hardcoded_datasettyp…
jklugherz Mar 5, 2025
fb0f6fe
Update reference_dataset.py
bpblanken Mar 5, 2025
5412715
run validation on sv get_ht
jklugherz Mar 5, 2025
49e90e3
Merge pull request #1053 from broadinstitute/benb/remove_hardcoded_da…
jklugherz Mar 5, 2025
416937d
Merge remote-tracking branch 'origin/dev' into sv-locus-alleles
jklugherz Mar 5, 2025
2511423
fix gnomad_svs ref data mock table
jklugherz Mar 5, 2025
aa0027b
Merge pull request #1052 from broadinstitute/sv-locus-alleles
jklugherz Mar 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions v03_pipeline/migrations/annotations/0004_add_gnomad_svs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
import hail as hl

from v03_pipeline.lib.annotations import sv
from v03_pipeline.lib.migration.base_migration import BaseMigration
from v03_pipeline.lib.model import DatasetType, ReferenceGenome
from v03_pipeline.lib.reference_datasets.reference_dataset import ReferenceDataset

# This vcf was generated with the gatk command:
#
# gatk SVConcordance --verbosity DEBUG --evaluation /var/seqr/phase4.seqr.gnomad_v4_tmp.vcf.gz
# --truth /var/seqr/gnomad.v4.1.sv.sites.modified.vcf.bgz
# --sequence-dictionary gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dict
#
# Followed by:
# bcftools annotate --rename-annots /var/seqr/remap /var/seqr/phase4.seqr.gnomad_v4_tmp.vcf.gz | bgzip > /var/seqr/phase4.seqr.gnomad_v4.vcf.gz
#
# where remap contains "INFO/TRUTH_VID GNOMAD_V4.1_TRUTH_VID"
PHASE_4_CALLSET_WITH_GNOMAD_V4 = 'gs://seqr-loading-temp/phase4.seqr.gnomad_v4.vcf.gz'


class AddGnomadSVs(BaseMigration):
reference_genome_dataset_types: frozenset[
tuple[ReferenceGenome, DatasetType]
] = frozenset(
((ReferenceGenome.GRCh38, DatasetType.SV),),
)

@staticmethod
def migrate(ht: hl.Table, **_) -> hl.Table:
mapping_ht = (
hl.import_vcf(
PHASE_4_CALLSET_WITH_GNOMAD_V4,
reference_genome=ReferenceGenome.GRCh38.value,
force_bgz=True,
)
.key_rows_by('rsid')
.rows()
)
ht = ht.annotate(
**{
'info.GNOMAD_V4.1_TRUTH_VID': mapping_ht[ht.key].info[
'GNOMAD_V4.1_TRUTH_VID'
],
},
)
gnomad_svs_ht = ReferenceDataset.gnomad_svs.get_ht(ReferenceGenome.GRCh38)
ht = ht.annotate(gnomad_svs=sv.gnomad_svs(ht, gnomad_svs_ht))
ht = ht.drop('info.GNOMAD_V4.1_TRUTH_VID')
return ht.annotate_globals(
versions=ht.globals.versions.annotate(gnomad_svs='1.0'),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't this already be the version- shouldn't we bump it?

Copy link
Contributor Author

@jklugherz jklugherz Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this migration will create the gnomad_svs table, but you're right, the globals annotation happens when we call ReferenceDataset.gnomad.get_ht so we don't need to set the version/enums here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooh, there's some confusing bits here.

ReferenceDataset.gnomad_svs.get_ht sets the globals on the gnomad_svs reference dataset (gs://seqr-reference-data/v3.1/GRCh38/gnomad_svs/1.0.ht if persisted). This code was setting the globals on the SV annotations table as they're not currently there. It's not overwhelmingly important, as we exclude gnomad_svs from consideration when computing the reference datasets to update, and it would also be annotated by the normal process because I kept separate for_reference_genome_dataset_type_annotations (which includes gnomad_svs) and for_reference_genome_dataset_type_annotations_updates methods (which does not). I felt the migration should include the correct globals on annotations table for completeness though!

Copy link
Contributor Author

@jklugherz jklugherz Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right I got jumbled, this is an annotation table migration.

This code was setting the globals on the SV annotations table as they're not currently there

This is good reason to keep the globals annotation in this migration. @hanars, I'm going to ad the globals versions and enums annotations back and re-request review.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it would be awesome if these had tests, but not entirely sure it would have caught this 😬.

enums=ht.globals.enums.annotate(gnomad_svs=hl.Struct()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels unnecessary, as this would already be in the existing table globals and does not seem to be changing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

)