-
Notifications
You must be signed in to change notification settings - Fork 167
Open
Labels
Milestone
Description
Please describe your wishes and possible alternatives to achieve the desired result.
Our pandas.DataFrame
is relatively small and is now clearly documented by the adapter Dataset2D
class. In theory anything that satisfies this Protocol
should work as a DataFrame
-like class in anndata
:
anndata/src/anndata/_core/xarray.py
Lines 33 to 51 in 401e0d1
class Dataset2D: | |
r""" | |
Bases :class:`~collections.abc.Mapping`\ [:class:`~collections.abc.Hashable`, :class:`~xarray.DataArray` | :class:`~anndata.experimental.backed.Dataset2D`\ ] | |
A wrapper class meant to enable working with lazy dataframe data according to | |
:class:`~anndata.AnnData`'s internal API. This class ensures that "dataframe-invariants" | |
are respected, namely that there is only one 1d dim and coord with the same name i.e., | |
like a :class:`pandas.DataFrame`. | |
You should not have to initiate this class yourself. Setting an :class:`xarray.Dataset` | |
into a relevant part of the :class:`~anndata.AnnData` object will attempt to wrap that | |
object in this object, trying to enforce the "dataframe-invariants." | |
Because xarray requires :attr:`xarray.Dataset.coords` to be in-memory, this class provides | |
handling for an out-of-memory index via :attr:`~anndata.experimental.backed.Dataset2D.true_index`. | |
This feature is helpful for loading remote data faster where the index itself may not be initially useful | |
for constructing the object e.g., cell ids. | |
""" | |
To me the concrete steps would be
- Refactor the current
Dataset2D
class to inherit from a runtime-checkableProtocol
and then replace all instances throughout the codebase ofpd.DataFrame
orDataset2D
with saidProtocol
. This will likely entail adding a new method to theProtocol
to handleanndata.concat
. - Remove the
xarray
/Dataset
dependency fromanndata
into its own package - Create new test cases that use both the
Protocol
and an actualpandas.DataFrame
object to test the functionality in the absence ofDataset2D
- Create other readers/adatapters for other dataframe-like libraries (
dask.DataFrame
,polar
,cudf
etc.)