Skip to content

Parameter refactoring + cleanup. #809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 56 commits into from
Jul 1, 2024
Merged

Parameter refactoring + cleanup. #809

merged 56 commits into from
Jul 1, 2024

Conversation

bpblanken
Copy link
Collaborator

@bpblanken bpblanken commented Jun 12, 2024

List of changes:

  • removed callset_paths in favor of callset_path.
  • removed liftover_ref_path param in favor of an env var.
  • imputed_sex_path and filters_path are computed and enabled via env vars (and optionally disabled via a "skip" param).
  • Make use of luigi.util.inherits to reduce parameter copying.

@bpblanken bpblanken changed the title Benb/single callset path Parameter refactoring + cleanup. Jun 21, 2024
default=False,
parsing=luigi.BoolParameter.EXPLICIT_PARSING,
)
skip_check_sex_and_relatedness = luigi.BoolParameter(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some context on this:

  • I wanted options that defaulted to False here, so that we didn't need to pass a flag for our default case in airflow (which we are doing now for s&r, which has led to some confusion).
  • I wanted to limit these to options that would either be manually specified while debugging a run or would be enabled dependent on sample source.
  • skip_ prefix for consistency.
  • There are also env vars for the two features that we don't want to enable for local users, so we wouldn't have to hardcode skip_expect_filters=True and skip_check_sex_and_relatedness=True (though we could).
  • The "env var only" approach doesn't work well because env vars are set at the dataproc cluster level and I haven't been able to get job env vars to override cluster env vars correctly (one afternoon already wasted on this a few months ago).

@bpblanken bpblanken marked this pull request as ready for review June 24, 2024 13:50
@bpblanken bpblanken requested a review from a team as a code owner June 24, 2024 13:50
if (
not Env.EXPECT_WES_FILTERS
or not dataset_type.expect_filters(sample_type)
or 'part_one_outputs' not in callset_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are 'part_one_outputs' and 'part_two_outputs'?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matren395 can give some context here! I think it's just that internal exome callsets have always been delivered in two different folders, and we run an extra step to join the filters part onto the first part!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I'd believe this is too, yes!

WES callsets are delivered in two parts usually, with two folders for Part_One_Outputs and Part_Two_Outputs. Part_One_Outputs are the "normal" sharded VCFs, one set of sharded VCFs per chromosome, complete with tons of annotations and information. Part_Two_Outputs is I believe just one VCF per chromosome, containing the sites to filter the Part_One VCFs to .

)
is_new_gcnv_joint_call = luigi.BoolParameter(
default=False,
description='Is this a fully joint-called callset.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we also need parsing=luigi.BoolParameter.EXPLICIT_PARSING here?

self.project_guids,
self.project_remap_paths,
self.project_pedigree_paths,
self.imputed_sex_paths,
strict=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strict=False makes it so that if any of the 3 lists are of different length we'll never know - do we expect these to ever be of different lengths?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah! I think this is worth changing. ruff complains if you zip without the param... and I do think it's impossible for the lists to not be the same length with the way it's been simplified now....but throwing an error is definitely clearer!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting about ruff too because it looks like the parameter is defaulted to False

@bpblanken bpblanken merged commit e2d6433 into dev Jul 1, 2024
3 checks passed
@bpblanken bpblanken deleted the benb/single_callset_path branch July 1, 2024 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants