xarray.align Inflating Source NetCDF Data #8176

matthewyakubiw · 2022-11-02T17:36:04Z

matthewyakubiw
Nov 2, 2022

What is your issue?

Hi there,

I've been experiencing a peculiar issue that I hope I could get some further insight on. Some background - I'm processing data from raw NetCDF with Xarray to be written to a Zarr store. The hope is that I can append new pieces of data as they are received.
The code is as follows:
First I write an empty Zarr store to prepare for incoming data. Each of the dimensions/coordinates are pre-populated with all known values. Dimension w is one million in length. The others vary in length but are generally < 50

empty_ds = xarray.Dataset(
    coords=dict(
        a=(a), b=(b), c=(c), d=(d), e=(e), f=(f), g=(g),  h=(h), w=(w)
    ),
    attrs=dict(description=f"Dataset for {a[0]}")
)

ds.to_zarr('./zarr', compute=False, mode="w")

I then need to open individual NetCDF files, whose coordinates are a subset of those that exist in the Zarr store. As you can see below, the dataset is ~8MB per file.

I then perform an alignment between the Zarr store, which has all coordinate values, and the individual NetCDF files contents to make sure dimensions and coordinates match:

ds1 = xarray.open_mfdataset('data.nc')
zarr_ds = xarray.open_zarr(".zarr/") 

a, b = xarray.align(zarr_ds, ds1, join="outer", exclude="h") #we exclude h because its the dimension we will append by
b.to_zarr("./zarr/", mode="a", append_dim="h")

These operations work, however the new aligned dataset b that I want to write to the Zarr store has blown up in size as a result:

And after writing back to the Zarr store, results in a large file, whose side should theoretically be ~8MB before compression.

Does anyone have insight as to why this might be happening? I've tried changing chunking settings when both opening and writing data, changing the dtypes of dimensions, etc. My hunch is that it has something to do with the dimension that is one million in length. For context, the data variable contains one million data points that correspond to the one million values of w. Thanks for the help!

max-sixty · 2022-11-02T18:57:48Z

max-sixty
Nov 2, 2022
Maintainer

Even without looking at the size of the data, the dimensions of the array seem to have grown a lot!

The size of the data seems consistent with that (i.e. 2*27*4*3/2*48*8MB=125GB, though not exactly 58GB, looking at the difference in the "Array Shape" between the two images)

Does looking at the values of the dimension that's grown to size 48 help — are those possibly different, even though you don't intend them to be, and causing it to be huge (but sparse!)?

Does that make sense?

0 replies

matthewyakubiw · 2022-11-02T20:09:36Z

matthewyakubiw
Nov 2, 2022
Author

The size of the data seems consistent with that (i.e. 2*27*4*3/2*48*8MB=125GB, though not exactly 58GB, looking at the difference in the "Array Shape" between the two images)

This would be correct if we added that many 8MB data sets, but we just added one! So I'd expect the entire store to be roughly that size (8MB) before compression since in the code example above we are just adding one data array to the Zarr data store. Would you be able to clarify this calculation 2*27*4*3/2*48*8MB=125GB ?

Does looking at the values of the dimension that's grown to size 48 help — are those possibly different, even though you don't intend them to be, and causing it to be huge (but sparse!)?

Does that make sense?

So the reason why all the dimension sizes have grown is because we initialize the Zarr with the set of all possibilities that the coordinates might be (read somewhere that this is what you should do - initialize an empty Zarr with the coords/dimensions you will need). Then what I figured I needed to do was align the dimensions of the .nc file with that of the Zarr store, so that when the data array is added the coordinates and dimensions line up - this is why we see the dimensions grow.

Is this the correct approach? I found the limitation/design of to_zarr() that allows you to only append to one dimension meant that if you want to append new data to a Zarr store, all dimensions/coords must pre-exist in both the Zarr store and individual DataSet that you'd like to append. This is what drives the decision to perform xarray.align.

Without aligning, I would get error messages that complained about dimension sizes being different. Thanks again for your help!

0 replies

dcherian · 2022-11-03T15:00:47Z

dcherian
Nov 3, 2022
Maintainer

I think perhaps the region kwarg is a better solution than appending for your case: https://docs.xarray.dev/en/stable/user-guide/io.html#appending-to-existing-zarr-stores

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

xarray.align Inflating Source NetCDF Data #8176

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

xarray.align Inflating Source NetCDF Data #8176

Uh oh!

Uh oh!

matthewyakubiw Nov 2, 2022

What is your issue?

Replies: 3 comments

Uh oh!

max-sixty Nov 2, 2022 Maintainer

Uh oh!

matthewyakubiw Nov 2, 2022 Author

Uh oh!

dcherian Nov 3, 2022 Maintainer

matthewyakubiw
Nov 2, 2022

max-sixty
Nov 2, 2022
Maintainer

matthewyakubiw
Nov 2, 2022
Author

dcherian
Nov 3, 2022
Maintainer