-
Notifications
You must be signed in to change notification settings - Fork 45
Description
Stores written by VirtualiZarr handle scaling/offset via a FixedScaleOffset
filter in dataset encoding, rather than with CF-convention-style attributes, for good reasons. Unfortunately it looks like this is applied before masking with _FillValue
, when it should be applied afterward (as _FillValue
should be the same type as the packed data).
It looks like there's been a ton of discussion on here around fill values, encoding, scaling, etc., but I haven't found anybody yet who's run into this exact issue. I'm torn on whether this is a Zarr issue or a VirtualiZarr issue but in the end I decided to post here, as I think it could only ever come up in this context.
As an MVE let's take some low-precision floats between 0-100 that we want to store as uint8
with a scale_factor
of 0.5 and a _FillValue
of 255:
import xarray as xr
import numpy as np
ds = xr.Dataset(
{
"x": ("t", np.array([0.55, np.nan, 99.45]))
},
coords={
"t": [0,1,2],
}
)
ds.to_netcdf("mve.nc", encoding={'x': {'dtype': 'uint8', 'scale_factor': 0.5, '_FillValue': 255}}, mode='w')
Then our packed data will have values about twice the actual data, potentially ranging from 0-200, with 255 replacing NaNs. xr.decode_cf
handles this as we'd expect, returning our original data (with some loss of precision due to our own choices of course):
file_ds = xr.open_dataset("mve.nc", mask_and_scale=False)
print(file_ds.x.encoding)
print(file_ds.x.attrs)
print(file_ds.x.values)
print(xr.decode_cf(file_ds).x.values)
{'dtype': dtype('uint8'), 'zlib': False, 'szip': False, 'zstd': False, 'bzip2': False, 'blosc': False, 'shuffle': False, 'complevel': 0, 'fletcher32': False, 'contiguous': True, 'chunksizes': None, 'source': '/home/charriso/org/work_projects/mve.nc', 'original_shape': (3,)}
{'_FillValue': np.uint8(255), 'scale_factor': np.float64(0.5)}
[ 1 255 199]
[ 0.5 nan 99.5]
Whereas after round-tripping through virtualizarr, scale_factor
has been applied before masking the data, so we just get 255/2 in there instead of NaN (I checked that masking isn't just getting skipped - manually changing the fill_value
in the json to 127.5
results in proper masking):
from virtualizarr import open_virtual_dataset
manifest_ds = open_virtual_dataset("mve.nc")
manifest_ds.virtualize.to_kerchunk('vds.json', format='json')
vds = xr.open_dataset(
'vds.json',
engine='kerchunk',
mask_and_scale=False,
)
print(vds.x.encoding)
print(vds.x.attrs)
print(vds.x.values)
print(xr.decode_cf(vds).x.values)
/home/charriso/micromamba/envs/fillval_mve/lib/python3.12/site-packages/zarr/core/metadata/v2.py:192: UserWarning: Found an empty list of filters in the array metadata document. This is contrary to the Zarr V2 specification, and will cause an error in the future. Use None (or Null in a JSON document) instead of an empty list of filters.
warnings.warn(msg, UserWarning, stacklevel=1)
{'chunks': (3,), 'preferred_chunks': {'t': 3}, 'compressors': (), 'filters': (FixedScaleOffset(scale=2.0, offset=0, dtype='<f8', astype='|u1'),), 'shards': None, 'dtype': dtype('float64')}
{'_FillValue': np.float64(255.0)}
[ 0.5 127.5 99.5]
[ 0.5 127.5 99.5]
I notice this warning about an empty list of filters, not sure if that is at all relevant here.
One could work around this by manually editing .zmetadata
to have the scaled fill_value
rather than the original, but a better fix would be great! I'm happy to make a PR somewhere but given the extensive existing debate on Zarr/CF/xarray encoding I figured I'd make this post first.