Skip to content

bugfix/improvement: handle duplicate seqr internal truth vid #1049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

bpblanken
Copy link
Collaborator

No description provided.

@bpblanken bpblanken changed the base branch from benb/sv_gnomad_v4_migration to dev February 23, 2025 12:56
@bpblanken bpblanken changed the base branch from dev to benb/sv_gt_stats February 23, 2025 12:57
@bpblanken bpblanken changed the title Benb/handle duplicate seqr internal truth vid bugfix/improvment: handle duplicate seqr internal truth vid Feb 23, 2025
@bpblanken bpblanken changed the title bugfix/improvment: handle duplicate seqr internal truth vid bugfix/improvement: handle duplicate seqr internal truth vid Feb 23, 2025
@bpblanken bpblanken marked this pull request as ready for review February 25, 2025 16:31
@bpblanken bpblanken requested a review from a team as a code owner February 25, 2025 16:31
'info.SEQR_INTERNAL_TRUTH_VID',
end_locus=sv.end_locus(mt),
)
return itertools.groupby(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you are collecting results from hail and then grouping them out of hail instead of just doing the grouping in hail directly?

ht = mt.order_by('info.SEQR_INTERNAL_TRUTH_VID']).rows()
return ht.aggregate(
    hl.agg.filter(
        duplicate_internal_variant_ids.contains(
            ht['info.SEQR_INTERNAL_TRUTH_VID'],
        ),
        hl.agg.group_by(ht.info.SEQR_INTERNAL_TRUTH_VID, hl.agg.collect(ht.row))
)

for v in (
annotations_ht.filter(
duplicate_internal_variant_ids.contains(annotations_ht.variant_id),
).collect()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feel like you are doing a lot of this outside of hail when it would be better to do it inside of hail. I feel like you could create a ht of the duplicate rows of the new callset, inner join with the relevant columns from the existing annotations table and then do this computation inside a single hail aggregation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the choice to do it outside hail was intentional, based on the assumption that the number of duplicate variants would be small in perpetuity, would fit in local memory, and would be easier to manage in native python than in hail. This is loosely based on the opinion (and not fact) that hail isn’t as useful for SVs as it is for SNV_INDELs and that we’re more maintainable and performant without the shuffles hail would induce.

),
).globals.updates,
)
> 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this >0 check here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just checking that there's been at least one update, but it's actually cleaner to just check the truthyness of hl.eval(hl.len(...)))

@bpblanken bpblanken merged commit 2a72646 into benb/sv_gt_stats Mar 18, 2025
1 check passed
@bpblanken bpblanken deleted the benb/handle_duplicate_SEQR_INTERNAL_TRUTH_VID branch March 18, 2025 15:07
bpblanken added a commit that referenced this pull request Mar 18, 2025
* First pass

* Finish test

* do it correctly :/

* ruff

* correct feature flag

* bump

* fix test

* fix svs

* ruff

* gnomad svs reference dataset

* tweaks

* add global to table

* ruff

* use the function

* add missing enums struct

* sprawl :/

* test finished!

* ruff

* change field name

* Comment

* first pass

* unit test

* put this back

* delete

* fixes

* ruff

* need loadable families

* ruff

* account for case when field is entirely missing

* add new test file

* frequenies correct

* ruff

* dict items

* it worked!

* ruff

* stuff

* updating to use hail call stats

* tweak annotation syntax

* test that variant aligns

* ruff

* bugfix/improvement: handle duplicate seqr internal truth vid (#1049)

* gnomad v4 sv migration

* ruff

* Update 0004_add_gnomad_svs.py

* Update 0004_add_gnomad_svs.py

* Update 0004_add_gnomad_svs.py

* Update 0004_add_gnomad_svs.py

* Update 0004_add_gnomad_svs.py

* comment

* ruff

* First pass

* ruff

* re-key

* better

* improve check on callset validity

* fix test

* ruff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants