Build PyData prototype for GWAS analysis

This issue tracks several more specific issues related to working towards a usable prototype.

Some things we should tackle for this are:

- [ ] IO (https://github.com/related-sciences/gwas-analysis/issues/23)
  - Select IO libs to integrate and determine plugin system around them
- [ ] Frontend (https://github.com/related-sciences/gwas-analysis/issues/5)
  - Is Xarray really the right choice?  Lack of support for out-of-core coords, uneven chunk sizes, and overlapping blockwise computations may become a huge hurdle.
- [ ] Backend Dispatch (https://github.com/related-sciences/gwas-analysis/issues/24)
  - How do we dispatch to duck array backends and IO plugins?
- [ ] Data Structures (https://github.com/related-sciences/gwas-analysis/issues/22)
  - We'll need to survey a reasonable portion of the space of all possible input structures
- [ ] Methods 
  - Document and identify methods we actually need (https://github.com/related-sciences/gwas-analysis/issues/16)
  - Implementations (https://github.com/related-sciences/gwas-analysis/issues/30)
- [ ] Simulation tools (https://github.com/related-sciences/gwas-analysis/issues/31)
- [ ] Testing (https://github.com/related-sciences/gwas-analysis/issues/21)
  - How can we framework a solution for validation against external software, namely Hail?  This will be very tedious without some abstraction
- [ ] Indexing
  - Should users define indexes uniquely identifying variants/phenotypes or should we manage this internally?
  - Supporting PheWAS, HLA association studies, and alignment-free GWAS are examples where it would be good to leave this up to the user
  - For comparison Internal Hail implementations hard code checks on indexes being equal to ['locus', 'alleles'] -- I don't think we want this
- [ ] Configuration
  - We should probably pin down a configuration framework early (it may be overkill but is always difficult to work in later)
  - Personally, I like the idea of making configuration objects live attributes with documentation like Pandas does (this makes inline lookups convenient) though integrating this with a file-backed configuration will require some leg-work
- [ ] Dask DevOps
  - We need to know how to use Dask at scale
  - Figuring out what is going on with it in https://h2oai.github.io/db-benchmark/ would be a good exercise
- [ ] Sub-byte Representations
  - It might not be too ridiculous to support some simpler (ideally early) QC operations on bitpacked int arrays
  - Doing the packing at the dask/numpy level would look like [this](https://github.com/dask/dask-blog/pull/38#issuecomment-620015752) (an example from Matt)
  - Alistair has some related thoughts in this [post](https://discourse.smadstatgen.org/t/design-principles-for-a-new-gwas-toolkit/28/4)
- [ ] Enrichment
  - How do we add and represent data along axes (e.g. variants/samples)?  The approach taken in Hail/Glow is to attach results of methods as new fields along the axes, and this is a good fit for new Dataset variables, but how will this work with multi-indexing?  What happens if there are non-unique values?  Is relying on Pandas indexing going to cause excessive memory overhead?
- Limitations
  - Since we're targeting some future state for dependencies, we should make sure to keep track of what's missing:
    - https://github.com/related-sciences/rs-platform/issues/19#issuecomment-594211481 - Xarray/Dask/Numba
    - https://github.com/related-sciences/gwas-analysis/issues/5#issuecomment-612600051 - Xarray limitations in more detail

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Build PyData prototype for GWAS analysis #20

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Build PyData prototype for GWAS analysis #20

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions