-
Notifications
You must be signed in to change notification settings - Fork 20
Dev #1050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Dev #1050
Changes from all commits
Commits
Show all changes
26 commits
Select commit
Hold shift + click to select a range
f164733
gnomad v4 sv migration
bpblanken e8d9616
ruff
bpblanken abeb65e
Update 0004_add_gnomad_svs.py
bpblanken e88c303
Update 0004_add_gnomad_svs.py
bpblanken 3e3658e
Update 0004_add_gnomad_svs.py
bpblanken 282bad2
Update 0004_add_gnomad_svs.py
bpblanken cf5875c
Update 0004_add_gnomad_svs.py
bpblanken 026631f
comment
bpblanken 94c30fa
Merge branch 'benb/sv_gnomad_v4_migration' of github.com:broadinstitu…
bpblanken 4b3f678
ruff
bpblanken 5c0cbef
Merge branch 'dev' of github.com:broadinstitute/seqr-loading-pipeline…
bpblanken 4278cc3
Merge branch 'dev' of github.com:broadinstitute/seqr-loading-pipeline…
bpblanken c80c04b
Merge pull request #1048 from broadinstitute/benb/sv_gnomad_v4_migration
jklugherz 21d0527
Update 0004_add_gnomad_svs.py
jklugherz bfde429
Update 0004_add_gnomad_svs.py
jklugherz b87bd2e
do alleles field validation only if it exists on ht
jklugherz 19b81ca
handle set of dataset types during allele type validation
bpblanken a5330e5
this is a cleaner approach
bpblanken e4f2b80
format
bpblanken 0addb5b
Merge remote-tracking branch 'origin/benb/remove_hardcoded_datasettyp…
jklugherz fb0f6fe
Update reference_dataset.py
bpblanken 5412715
run validation on sv get_ht
jklugherz 49e90e3
Merge pull request #1053 from broadinstitute/benb/remove_hardcoded_da…
jklugherz 416937d
Merge remote-tracking branch 'origin/dev' into sv-locus-alleles
jklugherz 2511423
fix gnomad_svs ref data mock table
jklugherz aa0027b
Merge pull request #1052 from broadinstitute/sv-locus-alleles
jklugherz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
import unittest | ||
from unittest.mock import patch | ||
|
||
import hail as hl | ||
|
||
from v03_pipeline.lib.model import ReferenceGenome | ||
from v03_pipeline.lib.reference_datasets.reference_dataset import ReferenceDataset | ||
|
||
TEST_GNOMAD_SVS_RAW_HT = ( | ||
'v03_pipeline/var/test/reference_datasets/raw/gnomad_svs_from_vcf.ht' | ||
) | ||
|
||
|
||
class GnomadSVsTest(unittest.TestCase): | ||
@patch('v03_pipeline.lib.reference_datasets.gnomad_svs.vcf_to_ht') | ||
def test_gnomad_svs(self, mock_vcf_to_ht): | ||
mock_vcf_to_ht.return_value = hl.read_table(TEST_GNOMAD_SVS_RAW_HT) | ||
ht = ReferenceDataset.gnomad_svs.get_ht(ReferenceGenome.GRCh38) | ||
self.assertEqual( | ||
ht.collect(), | ||
[ | ||
hl.Struct( | ||
KEY='gnomAD-SV_v3_BND_chr1_1a45f73a', | ||
locus=hl.Locus( | ||
contig='chr1', | ||
position=10434, | ||
reference_genome=ReferenceGenome.GRCh38, | ||
), | ||
alleles=['N', '<BND>'], | ||
AF=0.11413399875164032, | ||
AC=8474, | ||
AN=74246, | ||
N_HET=8426, | ||
N_HOMREF=28673, | ||
), | ||
hl.Struct( | ||
KEY='gnomAD-SV_v3_BND_chr1_3fa36917', | ||
locus=hl.Locus( | ||
contig='chr1', | ||
position=10440, | ||
reference_genome=ReferenceGenome.GRCh38, | ||
), | ||
alleles=['N', '<BND>'], | ||
AF=0.004201000090688467, | ||
AC=466, | ||
AN=110936, | ||
N_HET=466, | ||
N_HOMREF=55002, | ||
), | ||
hl.Struct( | ||
KEY='gnomAD-SV_v3_BND_chr1_7bbf34b5', | ||
locus=hl.Locus( | ||
contig='chr1', | ||
position=10464, | ||
reference_genome=ReferenceGenome.GRCh38, | ||
), | ||
alleles=['N', '<BND>'], | ||
AF=0.03698499873280525, | ||
AC=3119, | ||
AN=84332, | ||
N_HET=3115, | ||
N_HOMREF=39049, | ||
), | ||
hl.Struct( | ||
KEY='gnomAD-SV_v3_BND_chr1_933a2971', | ||
locus=hl.Locus( | ||
contig='chr1', | ||
position=10450, | ||
reference_genome=ReferenceGenome.GRCh38, | ||
), | ||
alleles=['N', '<BND>'], | ||
AF=0.3238990008831024, | ||
AC=21766, | ||
AN=67200, | ||
N_HET=21616, | ||
N_HOMREF=11909, | ||
), | ||
hl.Struct( | ||
KEY='gnomAD-SV_v3_DUP_chr1_01c2781c', | ||
locus=hl.Locus( | ||
contig='chr1', | ||
position=10000, | ||
reference_genome=ReferenceGenome.GRCh38, | ||
), | ||
alleles=['N', '<DUP>'], | ||
AF=0.0019970000721514225, | ||
AC=139, | ||
AN=69594, | ||
N_HET=139, | ||
N_HOMREF=34658, | ||
), | ||
], | ||
) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
52 changes: 52 additions & 0 deletions
52
v03_pipeline/migrations/annotations/0004_add_gnomad_svs.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
import hail as hl | ||
|
||
from v03_pipeline.lib.annotations import sv | ||
from v03_pipeline.lib.migration.base_migration import BaseMigration | ||
from v03_pipeline.lib.model import DatasetType, ReferenceGenome | ||
from v03_pipeline.lib.reference_datasets.reference_dataset import ReferenceDataset | ||
|
||
# This vcf was generated with the gatk command: | ||
# | ||
# gatk SVConcordance --verbosity DEBUG --evaluation /var/seqr/phase4.seqr.gnomad_v4_tmp.vcf.gz | ||
# --truth /var/seqr/gnomad.v4.1.sv.sites.modified.vcf.bgz | ||
# --sequence-dictionary gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dict | ||
# | ||
# Followed by: | ||
# bcftools annotate --rename-annots /var/seqr/remap /var/seqr/phase4.seqr.gnomad_v4_tmp.vcf.gz | bgzip > /var/seqr/phase4.seqr.gnomad_v4.vcf.gz | ||
# | ||
# where remap contains "INFO/TRUTH_VID GNOMAD_V4.1_TRUTH_VID" | ||
PHASE_4_CALLSET_WITH_GNOMAD_V4 = 'gs://seqr-loading-temp/phase4.seqr.gnomad_v4.vcf.gz' | ||
|
||
|
||
class AddGnomadSVs(BaseMigration): | ||
reference_genome_dataset_types: frozenset[ | ||
tuple[ReferenceGenome, DatasetType] | ||
] = frozenset( | ||
((ReferenceGenome.GRCh38, DatasetType.SV),), | ||
) | ||
|
||
@staticmethod | ||
def migrate(ht: hl.Table, **_) -> hl.Table: | ||
mapping_ht = ( | ||
hl.import_vcf( | ||
PHASE_4_CALLSET_WITH_GNOMAD_V4, | ||
reference_genome=ReferenceGenome.GRCh38.value, | ||
force_bgz=True, | ||
) | ||
.key_rows_by('rsid') | ||
.rows() | ||
) | ||
ht = ht.annotate( | ||
**{ | ||
'info.GNOMAD_V4.1_TRUTH_VID': mapping_ht[ht.key].info[ | ||
'GNOMAD_V4.1_TRUTH_VID' | ||
], | ||
}, | ||
) | ||
gnomad_svs_ht = ReferenceDataset.gnomad_svs.get_ht(ReferenceGenome.GRCh38) | ||
ht = ht.annotate(gnomad_svs=sv.gnomad_svs(ht, gnomad_svs_ht)) | ||
ht = ht.drop('info.GNOMAD_V4.1_TRUTH_VID') | ||
return ht.annotate_globals( | ||
versions=ht.globals.versions.annotate(gnomad_svs='1.0'), | ||
enums=ht.globals.enums.annotate(gnomad_svs=hl.Struct()), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this feels unnecessary, as this would already be in the existing table globals and does not seem to be changing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
) |
Binary file modified
BIN
+0 Bytes
(100%)
v03_pipeline/var/test/reference_datasets/GRCh38/gnomad_svs/1.0.ht/.README.txt.crc
Binary file not shown.
Binary file modified
BIN
+0 Bytes
(100%)
v03_pipeline/var/test/reference_datasets/GRCh38/gnomad_svs/1.0.ht/.metadata.json.gz.crc
Binary file not shown.
2 changes: 1 addition & 1 deletion
2
v03_pipeline/var/test/reference_datasets/GRCh38/gnomad_svs/1.0.ht/README.txt
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,3 @@ | ||
This folder comprises a Hail (www.hail.is) native Table or MatrixTable. | ||
Written with version 0.2.133-4c60fddb171a | ||
Created at 2025/02/16 18:40:38 | ||
Created at 2025/03/05 12:27:53 |
Binary file added
BIN
+12 Bytes
...GRCh38/gnomad_svs/1.0.ht/index/part-0-6ba285cc-16c9-426c-9af3-382a9815db5f.idx/.index.crc
Binary file not shown.
Binary file added
BIN
+12 Bytes
...ad_svs/1.0.ht/index/part-0-6ba285cc-16c9-426c-9af3-382a9815db5f.idx/.metadata.json.gz.crc
Binary file not shown.
Binary file added
BIN
+130 Bytes
...sets/GRCh38/gnomad_svs/1.0.ht/index/part-0-6ba285cc-16c9-426c-9af3-382a9815db5f.idx/index
Binary file not shown.
Binary file added
BIN
+157 Bytes
.../gnomad_svs/1.0.ht/index/part-0-6ba285cc-16c9-426c-9af3-382a9815db5f.idx/metadata.json.gz
Binary file not shown.
Binary file removed
BIN
-12 Bytes
...GRCh38/gnomad_svs/1.0.ht/index/part-0-febb7dd0-28ce-479c-8ea7-9fe142bddf4c.idx/.index.crc
Binary file not shown.
Binary file removed
BIN
-12 Bytes
...ad_svs/1.0.ht/index/part-0-febb7dd0-28ce-479c-8ea7-9fe142bddf4c.idx/.metadata.json.gz.crc
Binary file not shown.
Binary file removed
BIN
-129 Bytes
...sets/GRCh38/gnomad_svs/1.0.ht/index/part-0-febb7dd0-28ce-479c-8ea7-9fe142bddf4c.idx/index
Binary file not shown.
Binary file removed
BIN
-158 Bytes
.../gnomad_svs/1.0.ht/index/part-0-febb7dd0-28ce-479c-8ea7-9fe142bddf4c.idx/metadata.json.gz
Binary file not shown.
Binary file modified
BIN
+30 Bytes
(110%)
v03_pipeline/var/test/reference_datasets/GRCh38/gnomad_svs/1.0.ht/metadata.json.gz
Binary file not shown.
Binary file modified
BIN
+0 Bytes
(100%)
v03_pipeline/var/test/reference_datasets/GRCh38/gnomad_svs/1.0.ht/rows/.metadata.json.gz.crc
Binary file not shown.
Binary file modified
BIN
+56 Bytes
(110%)
v03_pipeline/var/test/reference_datasets/GRCh38/gnomad_svs/1.0.ht/rows/metadata.json.gz
Binary file not shown.
Binary file added
BIN
+12 Bytes
...sets/GRCh38/gnomad_svs/1.0.ht/rows/parts/.part-0-6ba285cc-16c9-426c-9af3-382a9815db5f.crc
Binary file not shown.
Binary file removed
BIN
-12 Bytes
...sets/GRCh38/gnomad_svs/1.0.ht/rows/parts/.part-0-febb7dd0-28ce-479c-8ea7-9fe142bddf4c.crc
Binary file not shown.
Binary file added
BIN
+147 Bytes
..._datasets/GRCh38/gnomad_svs/1.0.ht/rows/parts/part-0-6ba285cc-16c9-426c-9af3-382a9815db5f
Binary file not shown.
Binary file removed
BIN
-125 Bytes
..._datasets/GRCh38/gnomad_svs/1.0.ht/rows/parts/part-0-febb7dd0-28ce-479c-8ea7-9fe142bddf4c
Binary file not shown.
Binary file added
BIN
+12 Bytes
v03_pipeline/var/test/reference_datasets/raw/gnomad_svs_from_vcf.ht/.README.txt.crc
Binary file not shown.
Binary file added
BIN
+8 Bytes
v03_pipeline/var/test/reference_datasets/raw/gnomad_svs_from_vcf.ht/._SUCCESS.crc
Binary file not shown.
Binary file added
BIN
+32 Bytes
v03_pipeline/var/test/reference_datasets/raw/gnomad_svs_from_vcf.ht/.metadata.json.gz.crc
Binary file not shown.
3 changes: 3 additions & 0 deletions
3
v03_pipeline/var/test/reference_datasets/raw/gnomad_svs_from_vcf.ht/README.txt
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
This folder comprises a Hail (www.hail.is) native Table or MatrixTable. | ||
Written with version 0.2.133-4c60fddb171a | ||
Created at 2025/03/04 14:19:46 |
Empty file.
Binary file added
BIN
+12 Bytes
...line/var/test/reference_datasets/raw/gnomad_svs_from_vcf.ht/globals/.metadata.json.gz.crc
Binary file not shown.
Binary file added
BIN
+239 Bytes
v03_pipeline/var/test/reference_datasets/raw/gnomad_svs_from_vcf.ht/globals/metadata.json.gz
Binary file not shown.
Binary file added
BIN
+12 Bytes
...pipeline/var/test/reference_datasets/raw/gnomad_svs_from_vcf.ht/globals/parts/.part-0.crc
Binary file not shown.
Binary file added
BIN
+36 Bytes
v03_pipeline/var/test/reference_datasets/raw/gnomad_svs_from_vcf.ht/globals/parts/part-0
Binary file not shown.
Binary file added
BIN
+12 Bytes
...w/gnomad_svs_from_vcf.ht/index/part-0-e3666fa7-5bc8-471d-ab31-f4fad8e9ebb6.idx/.index.crc
Binary file not shown.
Binary file added
BIN
+12 Bytes
...s_from_vcf.ht/index/part-0-e3666fa7-5bc8-471d-ab31-f4fad8e9ebb6.idx/.metadata.json.gz.crc
Binary file not shown.
Binary file added
BIN
+112 Bytes
...ts/raw/gnomad_svs_from_vcf.ht/index/part-0-e3666fa7-5bc8-471d-ab31-f4fad8e9ebb6.idx/index
Binary file not shown.
Binary file added
BIN
+185 Bytes
...ad_svs_from_vcf.ht/index/part-0-e3666fa7-5bc8-471d-ab31-f4fad8e9ebb6.idx/metadata.json.gz
Binary file not shown.
Binary file added
BIN
+2.6 KB
v03_pipeline/var/test/reference_datasets/raw/gnomad_svs_from_vcf.ht/metadata.json.gz
Binary file not shown.
Binary file added
BIN
+48 Bytes
...ipeline/var/test/reference_datasets/raw/gnomad_svs_from_vcf.ht/rows/.metadata.json.gz.crc
Binary file not shown.
Binary file added
BIN
+4.86 KB
v03_pipeline/var/test/reference_datasets/raw/gnomad_svs_from_vcf.ht/rows/metadata.json.gz
Binary file not shown.
Binary file added
BIN
+56 Bytes
...ts/raw/gnomad_svs_from_vcf.ht/rows/parts/.part-0-e3666fa7-5bc8-471d-ab31-f4fad8e9ebb6.crc
Binary file not shown.
Binary file added
BIN
+5.89 KB
...atasets/raw/gnomad_svs_from_vcf.ht/rows/parts/part-0-e3666fa7-5bc8-471d-ab31-f4fad8e9ebb6
Binary file not shown.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't this already be the version- shouldn't we bump it?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this migration will create the
gnomad_svs
table, but you're right, the globals annotation happens when we callReferenceDataset.gnomad.get_ht
so we don't need to set the version/enums here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oooh, there's some confusing bits here.
ReferenceDataset.gnomad_svs.get_ht
sets the globals on thegnomad_svs
reference dataset (gs://seqr-reference-data/v3.1/GRCh38/gnomad_svs/1.0.ht
if persisted). This code was setting the globals on the SV annotations table as they're not currently there. It's not overwhelmingly important, as we exclude gnomad_svs from consideration when computing the reference datasets to update, and it would also be annotated by the normal process because I kept separatefor_reference_genome_dataset_type_annotations
(which includesgnomad_svs
) andfor_reference_genome_dataset_type_annotations_updates
methods (which does not). I felt the migration should include the correct globals on annotations table for completeness though!Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right I got jumbled, this is an
annotation
table migration.This is good reason to keep the globals annotation in this migration. @hanars, I'm going to ad the globals
versions
andenums
annotations back and re-request review.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it would be awesome if these had tests, but not entirely sure it would have caught this 😬.