Skip to content

Commit 0b6eb02

Browse files
authored
Add a rechunking example (#681)
1 parent 9f67ec8 commit 0b6eb02

File tree

2 files changed

+70
-0
lines changed

2 files changed

+70
-0
lines changed

docs/examples/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,5 +10,6 @@ how-to-run
1010
basic-array-ops
1111
zarr
1212
xarray
13+
rechunking
1314
pangeo
1415
```

docs/examples/rechunking.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
---
2+
file_format: mystnb
3+
kernelspec:
4+
name: python3
5+
---
6+
# Rechunking
7+
8+
This example uses Xarray to rechunk a dataset.
9+
10+
Install the package pre-requisites by running the following:
11+
12+
```shell
13+
pip install cubed cubed-xarray xarray pooch netCDF4
14+
```
15+
16+
## Open dataset
17+
18+
Start by importing Xarray - note that we don't need to import Cubed or `cubed-xarray`, since they will be picked up automatically.
19+
20+
```{code-cell} ipython3
21+
import xarray as xr
22+
23+
xr.set_options(display_expand_attrs=False);
24+
```
25+
26+
We open an Xarray dataset (in netCDF format) using the usual `open_dataset` function. By specifying `chunks={}` we ensure that the dataset is chunked using the on-disk chunking (here it is the netCDF file chunking). The `chunked_array_type` argument specifies which chunked array type to use - Cubed in this case.
27+
28+
```{code-cell} ipython3
29+
ds = xr.tutorial.open_dataset(
30+
"air_temperature", chunked_array_type="cubed", chunks={}
31+
)
32+
```
33+
34+
The `air` data variable is a `cubed.Array`, and we can see that this small dataset has a single on-disk chunk.
35+
36+
```{code-cell} ipython3
37+
ds["air"]
38+
```
39+
40+
## Rechunk
41+
42+
To change the chunking we use Xarray's `chunk` function:
43+
44+
```{code-cell} ipython3
45+
rds = ds.chunk({'time':1}, chunked_array_type="cubed")
46+
```
47+
48+
Looking at the `air` data variable again, we can see that it is now chunked along the time dimension.
49+
50+
```{code-cell} ipython3
51+
rds["air"]
52+
```
53+
54+
## Save to Zarr
55+
56+
Since Cubed has a lazy computation model, the data has not been loaded from disk yet. We can save a copy of the rechunked dataset by calling `to_zarr`:
57+
58+
```{code-cell} ipython3
59+
rds.to_zarr("rechunked_air_temperature.zarr", mode="w", consolidated=True);
60+
```
61+
62+
This will run a computation that loads the input data and writes it out to a Zarr store on the local filesystem with the new chunking. We can check that it worked by re-loading from disk using `xarray.open_dataset` and checking that the chunks are the same:
63+
64+
```{code-cell} ipython3
65+
ds = xr.open_dataset(
66+
"rechunked_air_temperature.zarr", chunked_array_type="cubed", chunks={}
67+
)
68+
assert ds.chunks == rds.chunks
69+
```

0 commit comments

Comments
 (0)