How to achieve a good compression ratio when saving to zarr #7570

Marston · 2023-02-28T19:56:27Z

Marston
Feb 28, 2023

Curious how Blosc compression works when appending to a DirectoryStore using to_zarr(). An example pseudocode

import xarray as xr
import zarr as za

nlat = 700
nlon = 2000
nlev =20
chunks3d={'time':1, 'feature': 1, 'lev': nlev, 'lat': nlat//7, 'lon': nlon//10}
chunks2d={'time':1, 'feature': 1, 'lev': 1, 'lat': nlat//7, 'lon': nlon//10}
compressor = za.Blosc(cname="zstd", clevel=9, shuffle=2)
g3d = 'f3d'
g2d = 'f2d'
store = 'test.zarr'

Append = False
# Each ts in datasets represent a unique variable but is generically called 'varb3d'
# Each varb3d dims: (time=1, feature=1, nlev=nlev, lat=nlat, lon=nlon)
for ts in datasets:
    if Append:
         ts.chunk(chunks3d).to_zarr(store, group=g3d, append_dim='feature', consolidated=True)
    else:
         ts.chunk(chunks3d).to_zarr(store, group=g3d, consolidated=True, encoding={'varb3d': {"compressor": compressor}})
         Append = True

The above code works but the compression level change from 6 to 9 didn't reduce the size by much. So, I'm wondering if the first variable in the loop is actually compressed? It is not intuitive that it is. If I add the compression in initial creation phase of the file, I get a key error. Curious if the whole array is compressed during the append process or will only appended arrays are compressed?
I'm guessing there is some copying going on under the hood and all the variables are compressed in the end.

Answered by rabernat

Mar 9, 2023

I don't know what your data source is, but in general, floating point data don't compress well using lossless compression. I would not expect any meaningful size difference between clevel 6 vs 9.

View full answer

rabernat · 2023-03-09T20:45:12Z

rabernat
Mar 9, 2023
Maintainer

I don't know what your data source is, but in general, floating point data don't compress well using lossless compression. I would not expect any meaningful size difference between clevel 6 vs 9.

10 replies

rabernat Mar 10, 2023
Maintainer

This seems straightforward enough, but how is this read back into an uncompressed dataArray or dataset?

Reading is super simple. You just open the file and read it. No special decompression is required.

Marston Mar 10, 2023
Author

Super!
Seems almost too good to be true. Will test this and see if it is as advertised.
Thank you.

Marston Mar 10, 2023
Author

One last question on this part:

bitinfo = xb.get_bitinformation(ds, dim="lon")  # calling bitinformation.jl.bitinformation

Is the compression applied here via a specified dim? So if I have an array with dims: time,xx,lev,lat,lon, how is this handled? All the data is covered by the lat and lon but what about the other dims?

rabernat Mar 10, 2023
Maintainer

You've exceeded my knowledge. Sounds like a good question for the xbitinfo issue tracker! 😉

Marston Mar 10, 2023
Author

Hehe...indeed. Will post there. Thanks for your tremendous help hitherto :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to achieve a good compression ratio when saving to zarr #7570

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 10 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to achieve a good compression ratio when saving to zarr #7570

Uh oh!

Marston Feb 28, 2023

Replies: 1 comment · 10 replies

Uh oh!

rabernat Mar 9, 2023 Maintainer

Uh oh!

rabernat Mar 10, 2023 Maintainer

Uh oh!

Marston Mar 10, 2023 Author

Uh oh!

Marston Mar 10, 2023 Author

Uh oh!

rabernat Mar 10, 2023 Maintainer

Uh oh!

Marston Mar 10, 2023 Author

Marston
Feb 28, 2023

Replies: 1 comment 10 replies

rabernat
Mar 9, 2023
Maintainer

rabernat Mar 10, 2023
Maintainer

Marston Mar 10, 2023
Author

Marston Mar 10, 2023
Author

rabernat Mar 10, 2023
Maintainer

Marston Mar 10, 2023
Author