Description
What happened?
When using the chunk
function to change the chunk sizes of a Dataset (or DataArray, which uses the Dataset implementation of chunk
), the chunk sizes of the Dask arrays are changed, but the "chunks" entry of the encoding
attributes are not changed accordingly. This causes the raising of a NotImplementedError when attempting to write the Dataset to a zarr (and presumably other formats as well).
Looking at the implementation of chunk
, every variable is rechunked using the _maybe_chunk
function, which actually has the parameter overwrite_encoded_chunks
to control just this behavior. However, it is an optional parameter which defaults to False, and the call in chunk
does not provide a value for this parameter, nor does it offer the caller to influence it (by having an overwrite_encoded_chunks
parameter itself, for example).
I do not know why this default value was chosen as False, or what could break if it was changed to True, but looking at the documentation, it seems the opposite of the intended effect. From the documentation of to_zarr
:
Zarr chunks are determined in the following way:
From the chunks attribute in each variable’s encoding (can be set via Dataset.chunk).
Which is exactly what it doesn't.
What did you expect to happen?
I would expect the "chunks" entry of the encoding
attribute to be changed to reflect the new chunking scheme.
Minimal Complete Verifiable Example
import xarray as xr
import numpy as np
# Create a test Dataset with dimension x and y, each of size 100, and a chunksize of 50
ds_original = xr.Dataset({"my_var": (["x", "y"], np.random.randn(100, 100))})
# Since 'chunk' does not work, manually set encoding
ds_original .my_var.encoding["chunks"] = (50, 50)
# To best showcase the real-life example, write it to file and read it back again.
# The same could be achieved by just calling .chunk() with chunksizes of 25, but this feels more 'complete'
filepath = "~/chunk_test.zarr"
ds_original.to_zarr(filepath)
ds = xr.open_zarr(filepath)
# Check the chunksizes and "chunks" encoding
print(ds.my_var.chunks)
# >>> ((50, 50), (50, 50))
print(ds.my_var.encoding["chunks"])
# >>> (50, 50)
# Rechunk the Dataset
ds = ds.chunk({"x": 25, "y": 25})
# The chunksizes have changed
print(ds.my_var.chunks)
# >>> ((25, 25, 25, 25), (25, 25, 25, 25))
# But the encoding value remains the same
print(ds.my_var.encoding["chunks"])
# >>> (50, 50)
# Attempting to write this back to zarr raises an error
ds.to_zarr("~/chunk_test_rechunked.zarr")
# NotImplementedError: Specified zarr chunks encoding['chunks']=(50, 50) for variable named 'my_var' would overlap multiple dask chunks ((25, 25, 25, 25), (25, 25, 25, 25)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`.
MVCE confirmation
- Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- Complete example — the example is self-contained, including all data and the text of any traceback.
- Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- New issue — a search of GitHub Issues suggests this is not a duplicate.
Relevant log output
No response
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None
python: 3.10.6 (main, May 29 2023, 11:10:38) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.16.3-microsoft-standard-WSL2
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.7
libnetcdf: 4.8.1
xarray: 2023.7.0
pandas: 1.5.3
numpy: 1.24.2
scipy: 1.10.0
netCDF4: 1.5.8
pydap: None
h5netcdf: 0.12.0
h5py: 3.6.0
Nio: None
zarr: 2.14.1
cftime: 1.5.2
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: 1.3.6
dask: 2022.01.0+dfsg
distributed: 2022.01.0+ds.1
matplotlib: 3.5.1
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.1.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 59.6.0
pip: 23.2.1
conda: None
pytest: 7.2.2
mypy: 1.1.1
IPython: 7.31.1
sphinx: None