Skip to content

Build PyData prototype for GWAS analysis #20

@eric-czech

Description

@eric-czech

This issue tracks several more specific issues related to working towards a usable prototype.

Some things we should tackle for this are:

  • IO (PyData prototype IO #23)
    • Select IO libs to integrate and determine plugin system around them
  • Frontend (Explore Xarray as the basis for a genetic toolkit API #5)
    • Is Xarray really the right choice? Lack of support for out-of-core coords, uneven chunk sizes, and overlapping blockwise computations may become a huge hurdle.
  • Backend Dispatch (PyData prototype backend dispatching #24)
    • How do we dispatch to duck array backends and IO plugins?
  • Data Structures (Define Xarray data structures for PyData prototype #22)
    • We'll need to survey a reasonable portion of the space of all possible input structures
  • Methods
  • Simulation tools (PyData prototype simulation methods #31)
  • Testing (PyData prototype testing  #21)
    • How can we framework a solution for validation against external software, namely Hail? This will be very tedious without some abstraction
  • Indexing
    • Should users define indexes uniquely identifying variants/phenotypes or should we manage this internally?
    • Supporting PheWAS, HLA association studies, and alignment-free GWAS are examples where it would be good to leave this up to the user
    • For comparison Internal Hail implementations hard code checks on indexes being equal to ['locus', 'alleles'] -- I don't think we want this
  • Configuration
    • We should probably pin down a configuration framework early (it may be overkill but is always difficult to work in later)
    • Personally, I like the idea of making configuration objects live attributes with documentation like Pandas does (this makes inline lookups convenient) though integrating this with a file-backed configuration will require some leg-work
  • Dask DevOps
  • Sub-byte Representations
    • It might not be too ridiculous to support some simpler (ideally early) QC operations on bitpacked int arrays
    • Doing the packing at the dask/numpy level would look like this (an example from Matt)
    • Alistair has some related thoughts in this post
  • Enrichment
    • How do we add and represent data along axes (e.g. variants/samples)? The approach taken in Hail/Glow is to attach results of methods as new fields along the axes, and this is a good fit for new Dataset variables, but how will this work with multi-indexing? What happens if there are non-unique values? Is relying on Pandas indexing going to cause excessive memory overhead?
  • Limitations

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions