Skip to content

DataFrame API for obs and var keys #2043

@ilan-gold

Description

@ilan-gold

Please describe your wishes and possible alternatives to achieve the desired result.

Our pandas.DataFrame is relatively small and is now clearly documented by the adapter Dataset2D class. In theory anything that satisfies this Protocol should work as a DataFrame-like class in anndata:

class Dataset2D:
r"""
Bases :class:`~collections.abc.Mapping`\ [:class:`~collections.abc.Hashable`, :class:`~xarray.DataArray` | :class:`~anndata.experimental.backed.Dataset2D`\ ]
A wrapper class meant to enable working with lazy dataframe data according to
:class:`~anndata.AnnData`'s internal API. This class ensures that "dataframe-invariants"
are respected, namely that there is only one 1d dim and coord with the same name i.e.,
like a :class:`pandas.DataFrame`.
You should not have to initiate this class yourself. Setting an :class:`xarray.Dataset`
into a relevant part of the :class:`~anndata.AnnData` object will attempt to wrap that
object in this object, trying to enforce the "dataframe-invariants."
Because xarray requires :attr:`xarray.Dataset.coords` to be in-memory, this class provides
handling for an out-of-memory index via :attr:`~anndata.experimental.backed.Dataset2D.true_index`.
This feature is helpful for loading remote data faster where the index itself may not be initially useful
for constructing the object e.g., cell ids.
"""

To me the concrete steps would be

  • Refactor the current Dataset2D class to inherit from a runtime-checkable Protocol and then replace all instances throughout the codebase of pd.DataFrame or Dataset2D with said Protocol. This will likely entail adding a new method to the Protocol to handle anndata.concat.
  • Remove the xarray/Dataset dependency from anndata into its own package
  • Create new test cases that use both the Protocol and an actual pandas.DataFrame object to test the functionality in the absence of Dataset2D
  • Create other readers/adatapters for other dataframe-like libraries (dask.DataFrame, polar, cudf etc.)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions