@@ -741,6 +741,65 @@ instance and pass this, as follows:
741
741
.. _Google Cloud Storage : https://cloud.google.com/storage/
742
742
.. _gcsfs : https://github.com/fsspec/gcsfs
743
743
744
+ .. _io.zarr.distributed_writes :
745
+
746
+ Distributed writes
747
+ ~~~~~~~~~~~~~~~~~~
748
+
749
+ Xarray will natively use dask to write in parallel to a zarr store, which should
750
+ satisfy most moderately sized datasets. For more flexible parallelization, we
751
+ can use ``region `` to write to limited regions of arrays in an existing Zarr
752
+ store.
753
+
754
+ To scale this up to writing large datasets, first create an initial Zarr store
755
+ without writing all of its array data. This can be done by first creating a
756
+ ``Dataset `` with dummy values stored in :ref: `dask <dask >`, and then calling
757
+ ``to_zarr `` with ``compute=False `` to write only metadata (including ``attrs ``)
758
+ to Zarr:
759
+
760
+ .. ipython :: python
761
+ :suppress:
762
+
763
+ ! rm - rf path/ to/ directory.zarr
764
+
765
+ .. ipython :: python
766
+
767
+ import dask.array
768
+
769
+ # The values of this dask array are entirely irrelevant; only the dtype,
770
+ # shape and chunks are used
771
+ dummies = dask.array.zeros(30 , chunks = 10 )
772
+ ds = xr.Dataset({" foo" : (" x" , dummies)}, coords = {" x" : np.arange(30 )})
773
+ path = " path/to/directory.zarr"
774
+ # Now we write the metadata without computing any array values
775
+ ds.to_zarr(path, compute = False )
776
+
777
+ Now, a Zarr store with the correct variable shapes and attributes exists that
778
+ can be filled out by subsequent calls to ``to_zarr ``.
779
+ Setting ``region="auto" `` will open the existing store and determine the
780
+ correct alignment of the new data with the existing dimensions, or as an
781
+ explicit mapping from dimension names to Python ``slice `` objects indicating
782
+ where the data should be written (in index space, not label space), e.g.,
783
+
784
+ .. ipython :: python
785
+
786
+ # For convenience, we'll slice a single dataset, but in the real use-case
787
+ # we would create them separately possibly even from separate processes.
788
+ ds = xr.Dataset({" foo" : (" x" , np.arange(30 ))}, coords = {" x" : np.arange(30 )})
789
+ # Any of the following region specifications are valid
790
+ ds.isel(x = slice (0 , 10 )).to_zarr(path, region = " auto" )
791
+ ds.isel(x = slice (10 , 20 )).to_zarr(path, region = {" x" : " auto" })
792
+ ds.isel(x = slice (20 , 30 )).to_zarr(path, region = {" x" : slice (20 , 30 )})
793
+
794
+ Concurrent writes with ``region `` are safe as long as they modify distinct
795
+ chunks in the underlying Zarr arrays (or use an appropriate ``lock ``).
796
+
797
+ As a safety check to make it harder to inadvertently override existing values,
798
+ if you set ``region `` then *all * variables included in a Dataset must have
799
+ dimensions included in ``region ``. Other variables (typically coordinates)
800
+ need to be explicitly dropped and/or written in a separate calls to ``to_zarr ``
801
+ with ``mode='a' ``.
802
+
744
803
Zarr Compressors and Filters
745
804
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
746
805
@@ -767,37 +826,6 @@ For example:
767
826
Not all native zarr compression and filtering options have been tested with
768
827
xarray.
769
828
770
- .. _io.zarr.consolidated_metadata :
771
-
772
- Consolidated Metadata
773
- ~~~~~~~~~~~~~~~~~~~~~
774
-
775
- Xarray needs to read all of the zarr metadata when it opens a dataset.
776
- In some storage mediums, such as with cloud object storage (e.g. `Amazon S3 `_),
777
- this can introduce significant overhead, because two separate HTTP calls to the
778
- object store must be made for each variable in the dataset.
779
- By default Xarray uses a feature called
780
- *consolidated metadata *, storing all metadata for the entire dataset with a
781
- single key (by default called ``.zmetadata ``). This typically drastically speeds
782
- up opening the store. (For more information on this feature, consult the
783
- `zarr docs on consolidating metadata <https://zarr.readthedocs.io/en/latest/tutorial.html#consolidating-metadata >`_.)
784
-
785
- By default, xarray writes consolidated metadata and attempts to read stores
786
- with consolidated metadata, falling back to use non-consolidated metadata for
787
- reads. Because this fall-back option is so much slower, xarray issues a
788
- ``RuntimeWarning `` with guidance when reading with consolidated metadata fails:
789
-
790
- Failed to open Zarr store with consolidated metadata, falling back to try
791
- reading non-consolidated metadata. This is typically much slower for
792
- opening a dataset. To silence this warning, consider:
793
-
794
- 1. Consolidating metadata in this existing store with
795
- :py:func: `zarr.consolidate_metadata `.
796
- 2. Explicitly setting ``consolidated=False ``, to avoid trying to read
797
- consolidate metadata.
798
- 3. Explicitly setting ``consolidated=True ``, to raise an error in this case
799
- instead of falling back to try reading non-consolidated metadata.
800
-
801
829
.. _io.zarr.appending :
802
830
803
831
Modifying existing Zarr stores
@@ -856,59 +884,6 @@ order, e.g., for time-stepping a simulation:
856
884
)
857
885
ds2.to_zarr(" path/to/directory.zarr" , append_dim = " t" )
858
886
859
- Finally, you can use ``region `` to write to limited regions of existing arrays
860
- in an existing Zarr store. This is a good option for writing data in parallel
861
- from independent processes.
862
-
863
- To scale this up to writing large datasets, the first step is creating an
864
- initial Zarr store without writing all of its array data. This can be done by
865
- first creating a ``Dataset `` with dummy values stored in :ref: `dask <dask >`,
866
- and then calling ``to_zarr `` with ``compute=False `` to write only metadata
867
- (including ``attrs ``) to Zarr:
868
-
869
- .. ipython :: python
870
- :suppress:
871
-
872
- ! rm - rf path/ to/ directory.zarr
873
-
874
- .. ipython :: python
875
-
876
- import dask.array
877
-
878
- # The values of this dask array are entirely irrelevant; only the dtype,
879
- # shape and chunks are used
880
- dummies = dask.array.zeros(30 , chunks = 10 )
881
- ds = xr.Dataset({" foo" : (" x" , dummies)}, coords = {" x" : np.arange(30 )})
882
- path = " path/to/directory.zarr"
883
- # Now we write the metadata without computing any array values
884
- ds.to_zarr(path, compute = False )
885
-
886
- Now, a Zarr store with the correct variable shapes and attributes exists that
887
- can be filled out by subsequent calls to ``to_zarr ``.
888
- Setting ``region="auto" `` will open the existing store and determine the
889
- correct alignment of the new data with the existing coordinates, or as an
890
- explicit mapping from dimension names to Python ``slice `` objects indicating
891
- where the data should be written (in index space, not label space), e.g.,
892
-
893
- .. ipython :: python
894
-
895
- # For convenience, we'll slice a single dataset, but in the real use-case
896
- # we would create them separately possibly even from separate processes.
897
- ds = xr.Dataset({" foo" : (" x" , np.arange(30 ))}, coords = {" x" : np.arange(30 )})
898
- # Any of the following region specifications are valid
899
- ds.isel(x = slice (0 , 10 )).to_zarr(path, region = " auto" )
900
- ds.isel(x = slice (10 , 20 )).to_zarr(path, region = {" x" : " auto" })
901
- ds.isel(x = slice (20 , 30 )).to_zarr(path, region = {" x" : slice (20 , 30 )})
902
-
903
- Concurrent writes with ``region `` are safe as long as they modify distinct
904
- chunks in the underlying Zarr arrays (or use an appropriate ``lock ``).
905
-
906
- As a safety check to make it harder to inadvertently override existing values,
907
- if you set ``region `` then *all * variables included in a Dataset must have
908
- dimensions included in ``region ``. Other variables (typically coordinates)
909
- need to be explicitly dropped and/or written in a separate calls to ``to_zarr ``
910
- with ``mode='a' ``.
911
-
912
887
.. _io.zarr.writing_chunks :
913
888
914
889
Specifying chunks in a zarr store
@@ -978,6 +953,38 @@ length of each dimension by using the shorthand chunk size ``-1``:
978
953
The number of chunks on Tair matches our dask chunks, while there is now only a single
979
954
chunk in the directory stores of each coordinate.
980
955
956
+ .. _io.zarr.consolidated_metadata :
957
+
958
+ Consolidated Metadata
959
+ ~~~~~~~~~~~~~~~~~~~~~
960
+
961
+ Xarray needs to read all of the zarr metadata when it opens a dataset.
962
+ In some storage mediums, such as with cloud object storage (e.g. `Amazon S3 `_),
963
+ this can introduce significant overhead, because two separate HTTP calls to the
964
+ object store must be made for each variable in the dataset.
965
+ By default Xarray uses a feature called
966
+ *consolidated metadata *, storing all metadata for the entire dataset with a
967
+ single key (by default called ``.zmetadata ``). This typically drastically speeds
968
+ up opening the store. (For more information on this feature, consult the
969
+ `zarr docs on consolidating metadata <https://zarr.readthedocs.io/en/latest/tutorial.html#consolidating-metadata >`_.)
970
+
971
+ By default, xarray writes consolidated metadata and attempts to read stores
972
+ with consolidated metadata, falling back to use non-consolidated metadata for
973
+ reads. Because this fall-back option is so much slower, xarray issues a
974
+ ``RuntimeWarning `` with guidance when reading with consolidated metadata fails:
975
+
976
+ Failed to open Zarr store with consolidated metadata, falling back to try
977
+ reading non-consolidated metadata. This is typically much slower for
978
+ opening a dataset. To silence this warning, consider:
979
+
980
+ 1. Consolidating metadata in this existing store with
981
+ :py:func: `zarr.consolidate_metadata `.
982
+ 2. Explicitly setting ``consolidated=False ``, to avoid trying to read
983
+ consolidate metadata.
984
+ 3. Explicitly setting ``consolidated=True ``, to raise an error in this case
985
+ instead of falling back to try reading non-consolidated metadata.
986
+
987
+
981
988
.. _io.iris :
982
989
983
990
Iris
0 commit comments