-
Notifications
You must be signed in to change notification settings - Fork 21
bugfix/improvement: handle duplicate seqr internal truth vid #1049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bugfix/improvement: handle duplicate seqr internal truth vid #1049
Conversation
…te/seqr-loading-pipelines into benb/sv_gnomad_v4_migration
…te/seqr-loading-pipelines into benb/handle_duplicate_SEQR_INTERNAL_TRUTH_VID
…s into benb/sv_gnomad_v4_migration
…te/seqr-loading-pipelines into benb/handle_duplicate_SEQR_INTERNAL_TRUTH_VID
…s into benb/sv_gnomad_v4_migration
…te/seqr-loading-pipelines into benb/handle_duplicate_SEQR_INTERNAL_TRUTH_VID
'info.SEQR_INTERNAL_TRUTH_VID', | ||
end_locus=sv.end_locus(mt), | ||
) | ||
return itertools.groupby( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason you are collecting results from hail and then grouping them out of hail instead of just doing the grouping in hail directly?
ht = mt.order_by('info.SEQR_INTERNAL_TRUTH_VID']).rows()
return ht.aggregate(
hl.agg.filter(
duplicate_internal_variant_ids.contains(
ht['info.SEQR_INTERNAL_TRUTH_VID'],
),
hl.agg.group_by(ht.info.SEQR_INTERNAL_TRUTH_VID, hl.agg.collect(ht.row))
)
for v in ( | ||
annotations_ht.filter( | ||
duplicate_internal_variant_ids.contains(annotations_ht.variant_id), | ||
).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feel like you are doing a lot of this outside of hail when it would be better to do it inside of hail. I feel like you could create a ht of the duplicate rows of the new callset, inner join with the relevant columns from the existing annotations table and then do this computation inside a single hail aggregation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the choice to do it outside hail was intentional, based on the assumption that the number of duplicate variants would be small in perpetuity, would fit in local memory, and would be easier to manage in native python than in hail. This is loosely based on the opinion (and not fact) that hail isn’t as useful for SVs as it is for SNV_INDELs and that we’re more maintainable and performant without the shuffles hail would induce.
), | ||
).globals.updates, | ||
) | ||
> 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this >0 check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just checking that there's been at least one update, but it's actually cleaner to just check the truthyness of hl.eval(hl.len(...)))
* First pass * Finish test * do it correctly :/ * ruff * correct feature flag * bump * fix test * fix svs * ruff * gnomad svs reference dataset * tweaks * add global to table * ruff * use the function * add missing enums struct * sprawl :/ * test finished! * ruff * change field name * Comment * first pass * unit test * put this back * delete * fixes * ruff * need loadable families * ruff * account for case when field is entirely missing * add new test file * frequenies correct * ruff * dict items * it worked! * ruff * stuff * updating to use hail call stats * tweak annotation syntax * test that variant aligns * ruff * bugfix/improvement: handle duplicate seqr internal truth vid (#1049) * gnomad v4 sv migration * ruff * Update 0004_add_gnomad_svs.py * Update 0004_add_gnomad_svs.py * Update 0004_add_gnomad_svs.py * Update 0004_add_gnomad_svs.py * Update 0004_add_gnomad_svs.py * comment * ruff * First pass * ruff * re-key * better * improve check on callset validity * fix test * ruff
No description provided.