- 
                Notifications
    You must be signed in to change notification settings 
- Fork 6
Open
Labels
Description
This issue tracks several more specific issues related to working towards a usable prototype.
Some things we should tackle for this are:
-  IO (PyData prototype IO #23)
- Select IO libs to integrate and determine plugin system around them
 
-  Frontend (Explore Xarray as the basis for a genetic toolkit API #5)
- Is Xarray really the right choice? Lack of support for out-of-core coords, uneven chunk sizes, and overlapping blockwise computations may become a huge hurdle.
 
-  Backend Dispatch (PyData prototype backend dispatching #24)
- How do we dispatch to duck array backends and IO plugins?
 
-  Data Structures (Define Xarray data structures for PyData prototype #22)
- We'll need to survey a reasonable portion of the space of all possible input structures
 
-  Methods
- Document and identify methods we actually need (Determine core operations necessary in general-purpose GWAS toolkits #16)
- Implementations (PyData prototype genetics method implementations #30)
 
- Simulation tools (PyData prototype simulation methods #31)
-  Testing (PyData prototype testing  #21)
- How can we framework a solution for validation against external software, namely Hail? This will be very tedious without some abstraction
 
-  Indexing
- Should users define indexes uniquely identifying variants/phenotypes or should we manage this internally?
- Supporting PheWAS, HLA association studies, and alignment-free GWAS are examples where it would be good to leave this up to the user
- For comparison Internal Hail implementations hard code checks on indexes being equal to ['locus', 'alleles'] -- I don't think we want this
 
-  Configuration
- We should probably pin down a configuration framework early (it may be overkill but is always difficult to work in later)
- Personally, I like the idea of making configuration objects live attributes with documentation like Pandas does (this makes inline lookups convenient) though integrating this with a file-backed configuration will require some leg-work
 
-  Dask DevOps
- We need to know how to use Dask at scale
- Figuring out what is going on with it in https://h2oai.github.io/db-benchmark/ would be a good exercise
 
- Sub-byte Representations
-  Enrichment
- How do we add and represent data along axes (e.g. variants/samples)? The approach taken in Hail/Glow is to attach results of methods as new fields along the axes, and this is a good fit for new Dataset variables, but how will this work with multi-indexing? What happens if there are non-unique values? Is relying on Pandas indexing going to cause excessive memory overhead?
 
- Limitations
- Since we're targeting some future state for dependencies, we should make sure to keep track of what's missing:
- https://github.com/related-sciences/rs-platform/issues/19#issuecomment-594211481 - Xarray/Dask/Numba
- Explore Xarray as the basis for a genetic toolkit API #5 (comment) - Xarray limitations in more detail
 
 
- Since we're targeting some future state for dependencies, we should make sure to keep track of what's missing: