Skip to content

How to handle encoding #68

@TomNicholas

Description

@TomNicholas

Encoding is the part of the virtualization problem that I have the least clear idea of how to handle using this library.

The discussion in #42 shows that currently to roundtrip an xarray dataset to disk as kerchunk then open it with fsspec requires ignoring encoding of the time variable. (In general time variables are going to be the most common offenders for encoding mistakes, as different files in the same dataset can easily end up with different encoding options for times.)

Part of the problem is that encoding is kind of a second-class citizen in xarray's data model. It happens magically behind the scenes in the backend file opener system (see pydata/xarray#8548 for a great overview), and then is attached at the Variable level as var.encoding, but there aren't general rules for propagating it through operations (see pydata/xarray#1614 and pydata/xarray#6323). xr.concat will just keep the encoding of the first object passed, which for us could lead to an incorrect result.

There is also some ambiguity around whether certain types of encoding should be handled by xarray or zarr. For example scale and offset exist in the CF conventions and so are normally handled by xarray, but could also be handled by a Zarr codec. If all encoding were represented at the zarr level then virtual array concatenation (#5) could potentially solve this.

One practical workaround for a lot of cases would be to load the encoded variables into memory (i.e. "inline" them) - see #62. Then effectively xarray has decoded them, you do the concatenation in-memory, and re-encode them when you save back out to the new zarr store. For dimension coordinates you often might want to do this anyway.

Metadata

Metadata

Assignees

No one assigned

    Labels

    encodinghelp wantedExtra attention is neededxarrayRequires changes to xarray upstreamzarr-pythonRelevant to zarr-python upstream

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions