Description
What happened?
Problem summary
Appending to a Zarr dataset with xarray.to_zarr()
in a Dask cluster context sometimes leads to silent data loss, depending on chunking order or batch size.
data loss with xarray 2024.9.0
My dataset is a multi-dim dataset (TIME and WAVELENGTH). I have variables which are functions of only TIME, and others of both dims in the same dataset. WAVELENGTH is fixed while TIME is growing.
My input files are NetCDF4 accessed via s3fs (worth noting as there are serialization issues between xarray and ffspec).
First, I'm creating a brand new zarr dataset with a first batch of files. The zarr dataset seem to be all good.
However on appending to the zarr from an open_mfdataset call, with the following options, I noticed when plotting data from the newly created zarr that some data is missing in random positions
ds.to_zarr(
self.store,
mode="a-",
write_empty_chunks=False,
compute=True,
consolidated=True,
append_dim="TIME",
safe_chunks=True,
)
safe_chunks is set to True as the processing is being done on a coiled cluster and the xarray doc talks about avoiding race conditions with this option.
I've tried various hacks (forcing a compute, sort ...) without any luck. I noticed that doing a ds.sortby(["TIME", "WAVELENGTH"]
prior to writing the data will "drop" chunks of data in a different spot that if I sort the dataset in another order ds.sortby([ "WAVELENGTH", "TIME"]
for example. This to me doesn't make any sense.
if I'm running the same code on the same data, the missing data will always be at the same location (which could potentially exclude a race condition), however If I'm changing something, the size of my batch of files, or the order of dims, such as explained above, I will have data loss in another spot of the dataset.
The only way I've succeeded not to have chunks disappearing silently has been to process and append to the zarr one NetCDF file at a time only rather than opening a batch a batch of NetCDF4 files with open_mfdataset (note that I have to use the h5netcdf engine as input files are on S3).
See for example what I mean by data loss. Note that None of the plots should have any gaps.
I forced some checks on the ds prior to calling to_zarr to ensure that some don't have any nan values as suggested by the plot and I can confirm that this is not the case. The problem lies in the to_zarr()
call.
upgrading xarray to 20256.1 - different behaviour.
Any xarray version post xarray 2024.9.0 has caused chunking problems in my work. But I decided to try again.
If I upgrade to a version of xarray strictly greater than 2024.9.0, in my case 2025.6.1, it's impossible for me to append to the zarr dataset with these options:
ds.to_zarr(
self.store,
mode="a-",
write_empty_chunks=False,
compute=True,
consolidated=True,
append_dim="TIME",
safe_chunks=True,
)
I get
File "/home/lbesnard/miniforge3/envs/AodnCloudOptimised/lib/python3.12/site-packages/xarray/backends/chunks.py", line 237, in validate_grid_chunks_alignment
raise ValueError(
ValueError: Specified Zarr chunks encoding['chunks']=(10000,) for variable named 'LONGITUDE' would overlap multiple Dask chunks. Check the chunk at position 0, which has a size of 10000 on dimension 0. It is unaligned with backend chunks of size
10000 in region slice(199134, None, None). Writing this array in parallel with Dask could lead to corrupted data. To resolve this issue, consider one of the following options: - Rechunk the array using `chunk()`. - Modify or delete `encoding['c
hunks']`. - Set `safe_chunks=False`. - Enable automatic chunks alignment with `align_chunks=True`.
First I'll admit that I don't understand this error message I've seen so often. In this scenario, I'm only appending data, not overwriting existing chunks. Regardless, I'm trying to proceed.
The tip to delete the encoding is, I find, not helpful. I used safe_chunks=False
which seem counter intuitive since I'm running my process on a cluster and goes against the doc. I don't even understand how one could decide to not have safe_chunks if it's corrupting the data.
The doc mentions:
Note: Even with these validations it can still be unsafe to write two or more chunked arrays in the same location in parallel if they are not writing in independent regions, for those cases it is better to use a synchronizer.
However I've tried many times to find good information on how to use such a synchronizer for a remote dask cluster without success. So I ended up assuming that a dask scheduler would probably synchonize correctly the dask scheduler (but this is out of context for this issue I think).
Anyway, at this stage I'm desperate, and give it a go. I also tried the new feature align_chunks (which I was following here #9914) without any success.
safe_chunks=False
is a bit more successful (maybe 1 run out of 5). Sometimes, I don't seem to have data loss (but maybe I do somewhere else in the dataset), sometimes I do. But at the moment, I have absolutely no confidence whatsoever in the zarr dataset I'm creating.
Again, another example of the randomness of data loss
MVCE ?
I understand the need to create a MVCE, however when an issue like this appears in the context of 1000 + files being processed, the use of dask and xarray on a cluster, for a "normal" xarray user, and not a dev, it's near impossible to achieve.
seems to share some similarities with
#8882
What did you expect to happen?
No response
Minimal Complete Verifiable Example
MVCE confirmation
- Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- Complete example — the example is self-contained, including all data and the text of any traceback.
- Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
- New issue — a search of GitHub Issues suggests this is not a duplicate.
- Recent environment — the issue occurs with the latest version of xarray and its dependencies.
Relevant log output
Anything else we need to know?
No response
Environment
with xarray 2025.6.1
xarray: 2025.6.1
pandas: 2.3.0
numpy: 1.26.4
scipy: 1.15.3
netCDF4: 1.6.5
pydap: None
h5netcdf: 1.6.1
h5py: 3.11.0
zarr: 2.18.3
cftime: 1.6.4.post1
nc_time_axis: 1.4.1
iris: None
bottleneck: 1.5.0
dask: 2025.5.1
distributed: 2025.5.1
matplotlib: 3.10.3
cartopy: 0.24.1
seaborn: 0.13.2
numbagg: 0.9.0
fsspec: 2025.3.2
cupy: None
pint: None
sparse: None
flox: 0.10.4
numpy_groupies: 0.11.3
setuptools: 80.9.0
pip: 25.1.1
conda: None
pytest: 8.4.0
mypy: 1.16.1
IPython: 7.34.0
sphinx: 8.1.3