How to handle encoding

Encoding is the part of the virtualization problem that I have the least clear idea of how to handle using this library.

The [discussion in #42](https://github.com/TomNicholas/VirtualiZarr/pull/42#discussion_r1536040324) shows that currently to roundtrip an xarray dataset to disk as kerchunk then open it with fsspec requires ignoring encoding of the time variable. (In general time variables are going to be the most common offenders for encoding mistakes, as different files in the same dataset can easily end up with different encoding options for times.)

Part of the problem is that encoding is kind of a second-class citizen in xarray's data model. It happens magically behind the scenes in the backend file opener system (see https://github.com/pydata/xarray/issues/8548 for a great overview), and then is attached at the `Variable` level as `var.encoding`, but there aren't general rules for propagating it through operations (see https://github.com/pydata/xarray/issues/1614 and https://github.com/pydata/xarray/issues/6323). `xr.concat` will just keep the encoding of the first object passed, which for us could lead to an incorrect result.

There is also some ambiguity around whether certain types of encoding should be handled by xarray or zarr. For example `scale` and `offset` exist in the CF conventions and so are normally handled by xarray, but could also be handled by a Zarr codec. If all encoding were represented at the zarr level then virtual array concatenation (#5) could potentially solve this.

One practical workaround for a lot of cases would be to load the encoded variables into memory (i.e. "inline" them) - see #62. Then effectively xarray has decoded them, you do the concatenation in-memory, and re-encode them when you save back out to the new zarr store. For dimension coordinates you often might want to do this anyway.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to handle encoding #68

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to handle encoding #68

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions