to_zarr() via dask with silent data loss on append

### What happened?

### Problem summary

Appending to a Zarr dataset with `xarray.to_zarr()` in a Dask cluster context sometimes leads to silent data loss, depending on chunking order or batch size.

## data loss with  xarray 2024.9.0 

My dataset is a multi-dim dataset (TIME and WAVELENGTH). I have variables which are functions of only TIME, and others of both dims in the same dataset. WAVELENGTH is fixed while TIME is growing.

My input files are NetCDF4 accessed via s3fs (worth noting as there are serialization issues between xarray and ffspec).
First, I'm creating a brand new zarr dataset with a first batch of files. The zarr dataset seem to be all good.

However on appending to the zarr from an open_mfdataset call, with the following options, I noticed when plotting data from the newly created zarr that some data is missing in random positions 

```   
     ds.to_zarr(
            self.store,
            mode="a-",
            write_empty_chunks=False,
            compute=True,
            consolidated=True,
            append_dim="TIME",
            safe_chunks=True,
        )

```
safe_chunks is set to True as the processing is being done on a coiled cluster and the xarray doc talks about avoiding race conditions with this option.

I've tried various hacks (forcing a compute, sort ...) without any luck. I noticed that doing a ```ds.sortby(["TIME", "WAVELENGTH"]``` prior to writing the data will "drop" chunks of data in a different spot that if I sort the dataset in another order ```ds.sortby([ "WAVELENGTH", "TIME"]``` for example. This to me doesn't make any sense.

if I'm running the same code on the same data, the missing data will always be at the same location (which could potentially exclude a race condition), however If I'm changing something, the size of my batch of files, or the order of dims, such as explained above, I will have data loss in another spot of the dataset.

The only way I've succeeded not to have chunks disappearing silently has been to process and append to the zarr one NetCDF file at a time only rather than opening a batch a batch of NetCDF4 files with open_mfdataset (note that I have to use the h5netcdf engine as input files are on S3).

See for example what I mean by data loss. Note that None of the plots should have any gaps.

![Image](https://github.com/user-attachments/assets/87dbe244-b67e-4e4b-9d43-ec017a9370ec)

And a different run:
![Image](https://github.com/user-attachments/assets/157b491e-3316-4994-b30b-5e3538fd4a0b)

I forced some checks on the ds prior to calling to_zarr to ensure that some don't have any nan values as suggested by the plot and I can confirm that this is not the case. The problem lies in the ```to_zarr()``` call.

## upgrading xarray to 20256.1 - different behaviour.
Any xarray version post xarray 2024.9.0 has caused chunking problems in my work. But I decided to try again.

If I upgrade to a version of xarray strictly greater than 2024.9.0, in my case 2025.6.1, it's impossible for me to append to the zarr dataset with these options:
```python
        ds.to_zarr(
            self.store,
            mode="a-",
            write_empty_chunks=False,
            compute=True, 
            consolidated=True,
            append_dim="TIME",
            safe_chunks=True,
        )
```
I get

```
  File "/home/lbesnard/miniforge3/envs/AodnCloudOptimised/lib/python3.12/site-packages/xarray/backends/chunks.py", line 237, in validate_grid_chunks_alignment
    raise ValueError(
ValueError: Specified Zarr chunks encoding['chunks']=(10000,) for variable named 'LONGITUDE' would overlap multiple Dask chunks. Check the chunk at position 0, which has a size of 10000 on dimension 0. It is unaligned with backend chunks of size
 10000 in region slice(199134, None, None). Writing this array in parallel with Dask could lead to corrupted data. To resolve this issue, consider one of the following options: - Rechunk the array using `chunk()`. - Modify or delete `encoding['c
hunks']`. - Set `safe_chunks=False`. - Enable automatic chunks alignment with `align_chunks=True`.
```
First I'll admit that I don't understand this error message I've seen so often. In this scenario, I'm only appending data, not overwriting existing chunks. Regardless, I'm trying to proceed.

The tip to delete the encoding is, I find, not helpful. I used ```safe_chunks=False``` which seem counter intuitive since I'm running my process on a cluster and goes against the doc. I don't even understand how one could decide to not have safe_chunks if it's corrupting the data. 
The doc mentions:

> Note: Even with these validations it can still be unsafe to write two or more chunked arrays in the same location in parallel if they are not writing in independent regions, for those cases it is better to use a synchronizer.

However I've tried many times to find good information on how to use such a synchronizer for a remote dask cluster without success. So I ended up assuming that a dask scheduler would probably synchonize correctly the dask scheduler (but this is out of context for this issue I think).

Anyway, at this stage I'm desperate, and give it a go. I also tried the new feature align_chunks (which I was following here https://github.com/pydata/xarray/issues/9914) without any success.

```safe_chunks=False``` is a bit more successful (maybe 1 run out of 5). Sometimes, I don't seem to have data loss (but maybe I do somewhere else in the dataset), sometimes I do. But at the moment, I have absolutely no confidence whatsoever in the zarr dataset I'm creating.

Again, another example of the randomness of data loss

![Image](https://github.com/user-attachments/assets/55261f37-1ebd-4694-8927-7bbb1a884ef0)

## MVCE ?

I understand the need to create a MVCE, however when an issue like this appears in the context of 1000 + files being processed, the use of dask and xarray on a cluster, for a "normal" xarray user, and not a dev, it's near impossible to achieve. 

seems to share some similarities with
https://github.com/pydata/xarray/issues/8882

 

### What did you expect to happen?

_No response_

### Minimal Complete Verifiable Example

```Python

```

### MVCE confirmation

- [ ] Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
- [ ] Complete example — the example is self-contained, including all data and the text of any traceback.
- [ ] Verifiable example — the example copy & pastes into an IPython prompt or [Binder notebook](https://mybinder.org/v2/gh/pydata/xarray/main?urlpath=lab/tree/doc/examples/blank_template.ipynb), returning the result.
- [ ] New issue — a search of GitHub Issues suggests this is not a duplicate.
- [ ] Recent environment — the issue occurs with the latest version of xarray and its dependencies.

### Relevant log output

```Python

```

### Anything else we need to know?

_No response_

### Environment

with xarray 2025.6.1

<details>
INSTALLED VERSIONS
------------------
commit: None
python: 3.12.11 | packaged by conda-forge | (main, Jun  4 2025, 14:45:31) [GCC 13.3.0]
python-bits: 64
OS: Linux
OS-release: 6.8.0-39-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: en_IE.UTF-8
LOCALE: ('C', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.9.3-development

xarray: 2025.6.1
pandas: 2.3.0
numpy: 1.26.4
scipy: 1.15.3
netCDF4: 1.6.5
pydap: None
h5netcdf: 1.6.1
h5py: 3.11.0
zarr: 2.18.3
cftime: 1.6.4.post1
nc_time_axis: 1.4.1
iris: None
bottleneck: 1.5.0
dask: 2025.5.1
distributed: 2025.5.1
matplotlib: 3.10.3
cartopy: 0.24.1
seaborn: 0.13.2
numbagg: 0.9.0
fsspec: 2025.3.2
cupy: None
pint: None
sparse: None
flox: 0.10.4
numpy_groupies: 0.11.3
setuptools: 80.9.0
pip: 25.1.1
conda: None
pytest: 8.4.0
mypy: 1.16.1
IPython: 7.34.0
sphinx: 8.1.3

</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

to_zarr() via dask with silent data loss on append #10501

What happened?

Problem summary

data loss with xarray 2024.9.0

upgrading xarray to 20256.1 - different behaviour.

MVCE ?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

to_zarr() via dask with silent data loss on append #10501

Description

What happened?

Problem summary

data loss with xarray 2024.9.0

upgrading xarray to 20256.1 - different behaviour.

MVCE ?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions