-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Labels
Description
This issue tracks several more specific issues related to working towards a usable prototype.
Some things we should tackle for this are:
- IO (PyData prototype IO #23)
- Select IO libs to integrate and determine plugin system around them
- Frontend (Explore Xarray as the basis for a genetic toolkit API #5)
- Is Xarray really the right choice? Lack of support for out-of-core coords, uneven chunk sizes, and overlapping blockwise computations may become a huge hurdle.
- Backend Dispatch (PyData prototype backend dispatching #24)
- How do we dispatch to duck array backends and IO plugins?
- Data Structures (Define Xarray data structures for PyData prototype #22)
- We'll need to survey a reasonable portion of the space of all possible input structures
- Methods
- Document and identify methods we actually need (Determine core operations necessary in general-purpose GWAS toolkits #16)
- Implementations (PyData prototype genetics method implementations #30)
- Simulation tools (PyData prototype simulation methods #31)
- Testing (PyData prototype testing #21)
- How can we framework a solution for validation against external software, namely Hail? This will be very tedious without some abstraction
- Indexing
- Should users define indexes uniquely identifying variants/phenotypes or should we manage this internally?
- Supporting PheWAS, HLA association studies, and alignment-free GWAS are examples where it would be good to leave this up to the user
- For comparison Internal Hail implementations hard code checks on indexes being equal to ['locus', 'alleles'] -- I don't think we want this
- Configuration
- We should probably pin down a configuration framework early (it may be overkill but is always difficult to work in later)
- Personally, I like the idea of making configuration objects live attributes with documentation like Pandas does (this makes inline lookups convenient) though integrating this with a file-backed configuration will require some leg-work
- Dask DevOps
- We need to know how to use Dask at scale
- Figuring out what is going on with it in https://h2oai.github.io/db-benchmark/ would be a good exercise
- Sub-byte Representations
- Enrichment
- How do we add and represent data along axes (e.g. variants/samples)? The approach taken in Hail/Glow is to attach results of methods as new fields along the axes, and this is a good fit for new Dataset variables, but how will this work with multi-indexing? What happens if there are non-unique values? Is relying on Pandas indexing going to cause excessive memory overhead?
- Limitations
- Since we're targeting some future state for dependencies, we should make sure to keep track of what's missing:
- https://github.com/related-sciences/rs-platform/issues/19#issuecomment-594211481 - Xarray/Dask/Numba
- Explore Xarray as the basis for a genetic toolkit API #5 (comment) - Xarray limitations in more detail
- Since we're targeting some future state for dependencies, we should make sure to keep track of what's missing: