-
Notifications
You must be signed in to change notification settings - Fork 45
Description
Encoding is the part of the virtualization problem that I have the least clear idea of how to handle using this library.
The discussion in #42 shows that currently to roundtrip an xarray dataset to disk as kerchunk then open it with fsspec requires ignoring encoding of the time variable. (In general time variables are going to be the most common offenders for encoding mistakes, as different files in the same dataset can easily end up with different encoding options for times.)
Part of the problem is that encoding is kind of a second-class citizen in xarray's data model. It happens magically behind the scenes in the backend file opener system (see pydata/xarray#8548 for a great overview), and then is attached at the Variable
level as var.encoding
, but there aren't general rules for propagating it through operations (see pydata/xarray#1614 and pydata/xarray#6323). xr.concat
will just keep the encoding of the first object passed, which for us could lead to an incorrect result.
There is also some ambiguity around whether certain types of encoding should be handled by xarray or zarr. For example scale
and offset
exist in the CF conventions and so are normally handled by xarray, but could also be handled by a Zarr codec. If all encoding were represented at the zarr level then virtual array concatenation (#5) could potentially solve this.
One practical workaround for a lot of cases would be to load the encoded variables into memory (i.e. "inline" them) - see #62. Then effectively xarray has decoded them, you do the concatenation in-memory, and re-encode them when you save back out to the new zarr store. For dimension coordinates you often might want to do this anyway.