-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Labels
Description
We would like to move towards an API of only functions that act on or create Xarray datasets. The wrapper classes in core.py
should be removed and the conversion functions in them moved elsewhere.
A few problems that the wrappers were, at least in part, intended to solve are:
- What conventions should I/O readers adhere to when building datasets?
- Should they have default coordinates? This helps facilitate indexing/selecting data but it could be left up to the user.
- Should the strategy for representing missing values in floating point data be the same as the strategy from other readers of integer types? If we go the scikit-allele sentinel + boolean mask route then probably not but if we use masked Dask arrays then it likely makes more sense for the reader to be responsible for creating them.
- How should we represent phased genotypes? As far as I've seen, phasing could be specific to only variants, variants + samples, or an entire dataset so it may make sense for readers to return a 1D array, a 2D array, or global attributes (whatever is most appropriate).
- How do we assert dimensions and dtypes on datasets? Maybe we shouldn't do this at all, or there could be a functions to do this at the beginning of method functions (like scikit-learn).
- How do we standardize naming conventions for fields like
contig
,pos
,alleles
,GT
, etc.? - How do we make it clear which kinds of datasets can be converted to others? It is probably best to have functions for things like computing dosages, hard calls, GWAS encodings, allele counts etc. that take arrays/datasets and leave it up to users not to pass them anything that doesn't make sense.