Skip to content

Commit 6f0e753

Browse files
committed
Merge branch 'main' into micro_datasetisel
* main: Split out distributed writes in zarr docs (#9132)
2 parents 314c72b + 3fd162e commit 6f0e753

File tree

1 file changed

+91
-84
lines changed

1 file changed

+91
-84
lines changed

doc/user-guide/io.rst

Lines changed: 91 additions & 84 deletions
Original file line numberDiff line numberDiff line change
@@ -741,6 +741,65 @@ instance and pass this, as follows:
741741
.. _Google Cloud Storage: https://cloud.google.com/storage/
742742
.. _gcsfs: https://github.com/fsspec/gcsfs
743743

744+
.. _io.zarr.distributed_writes:
745+
746+
Distributed writes
747+
~~~~~~~~~~~~~~~~~~
748+
749+
Xarray will natively use dask to write in parallel to a zarr store, which should
750+
satisfy most moderately sized datasets. For more flexible parallelization, we
751+
can use ``region`` to write to limited regions of arrays in an existing Zarr
752+
store.
753+
754+
To scale this up to writing large datasets, first create an initial Zarr store
755+
without writing all of its array data. This can be done by first creating a
756+
``Dataset`` with dummy values stored in :ref:`dask <dask>`, and then calling
757+
``to_zarr`` with ``compute=False`` to write only metadata (including ``attrs``)
758+
to Zarr:
759+
760+
.. ipython:: python
761+
:suppress:
762+
763+
! rm -rf path/to/directory.zarr
764+
765+
.. ipython:: python
766+
767+
import dask.array
768+
769+
# The values of this dask array are entirely irrelevant; only the dtype,
770+
# shape and chunks are used
771+
dummies = dask.array.zeros(30, chunks=10)
772+
ds = xr.Dataset({"foo": ("x", dummies)}, coords={"x": np.arange(30)})
773+
path = "path/to/directory.zarr"
774+
# Now we write the metadata without computing any array values
775+
ds.to_zarr(path, compute=False)
776+
777+
Now, a Zarr store with the correct variable shapes and attributes exists that
778+
can be filled out by subsequent calls to ``to_zarr``.
779+
Setting ``region="auto"`` will open the existing store and determine the
780+
correct alignment of the new data with the existing dimensions, or as an
781+
explicit mapping from dimension names to Python ``slice`` objects indicating
782+
where the data should be written (in index space, not label space), e.g.,
783+
784+
.. ipython:: python
785+
786+
# For convenience, we'll slice a single dataset, but in the real use-case
787+
# we would create them separately possibly even from separate processes.
788+
ds = xr.Dataset({"foo": ("x", np.arange(30))}, coords={"x": np.arange(30)})
789+
# Any of the following region specifications are valid
790+
ds.isel(x=slice(0, 10)).to_zarr(path, region="auto")
791+
ds.isel(x=slice(10, 20)).to_zarr(path, region={"x": "auto"})
792+
ds.isel(x=slice(20, 30)).to_zarr(path, region={"x": slice(20, 30)})
793+
794+
Concurrent writes with ``region`` are safe as long as they modify distinct
795+
chunks in the underlying Zarr arrays (or use an appropriate ``lock``).
796+
797+
As a safety check to make it harder to inadvertently override existing values,
798+
if you set ``region`` then *all* variables included in a Dataset must have
799+
dimensions included in ``region``. Other variables (typically coordinates)
800+
need to be explicitly dropped and/or written in a separate calls to ``to_zarr``
801+
with ``mode='a'``.
802+
744803
Zarr Compressors and Filters
745804
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
746805

@@ -767,37 +826,6 @@ For example:
767826
Not all native zarr compression and filtering options have been tested with
768827
xarray.
769828

770-
.. _io.zarr.consolidated_metadata:
771-
772-
Consolidated Metadata
773-
~~~~~~~~~~~~~~~~~~~~~
774-
775-
Xarray needs to read all of the zarr metadata when it opens a dataset.
776-
In some storage mediums, such as with cloud object storage (e.g. `Amazon S3`_),
777-
this can introduce significant overhead, because two separate HTTP calls to the
778-
object store must be made for each variable in the dataset.
779-
By default Xarray uses a feature called
780-
*consolidated metadata*, storing all metadata for the entire dataset with a
781-
single key (by default called ``.zmetadata``). This typically drastically speeds
782-
up opening the store. (For more information on this feature, consult the
783-
`zarr docs on consolidating metadata <https://zarr.readthedocs.io/en/latest/tutorial.html#consolidating-metadata>`_.)
784-
785-
By default, xarray writes consolidated metadata and attempts to read stores
786-
with consolidated metadata, falling back to use non-consolidated metadata for
787-
reads. Because this fall-back option is so much slower, xarray issues a
788-
``RuntimeWarning`` with guidance when reading with consolidated metadata fails:
789-
790-
Failed to open Zarr store with consolidated metadata, falling back to try
791-
reading non-consolidated metadata. This is typically much slower for
792-
opening a dataset. To silence this warning, consider:
793-
794-
1. Consolidating metadata in this existing store with
795-
:py:func:`zarr.consolidate_metadata`.
796-
2. Explicitly setting ``consolidated=False``, to avoid trying to read
797-
consolidate metadata.
798-
3. Explicitly setting ``consolidated=True``, to raise an error in this case
799-
instead of falling back to try reading non-consolidated metadata.
800-
801829
.. _io.zarr.appending:
802830

803831
Modifying existing Zarr stores
@@ -856,59 +884,6 @@ order, e.g., for time-stepping a simulation:
856884
)
857885
ds2.to_zarr("path/to/directory.zarr", append_dim="t")
858886
859-
Finally, you can use ``region`` to write to limited regions of existing arrays
860-
in an existing Zarr store. This is a good option for writing data in parallel
861-
from independent processes.
862-
863-
To scale this up to writing large datasets, the first step is creating an
864-
initial Zarr store without writing all of its array data. This can be done by
865-
first creating a ``Dataset`` with dummy values stored in :ref:`dask <dask>`,
866-
and then calling ``to_zarr`` with ``compute=False`` to write only metadata
867-
(including ``attrs``) to Zarr:
868-
869-
.. ipython:: python
870-
:suppress:
871-
872-
! rm -rf path/to/directory.zarr
873-
874-
.. ipython:: python
875-
876-
import dask.array
877-
878-
# The values of this dask array are entirely irrelevant; only the dtype,
879-
# shape and chunks are used
880-
dummies = dask.array.zeros(30, chunks=10)
881-
ds = xr.Dataset({"foo": ("x", dummies)}, coords={"x": np.arange(30)})
882-
path = "path/to/directory.zarr"
883-
# Now we write the metadata without computing any array values
884-
ds.to_zarr(path, compute=False)
885-
886-
Now, a Zarr store with the correct variable shapes and attributes exists that
887-
can be filled out by subsequent calls to ``to_zarr``.
888-
Setting ``region="auto"`` will open the existing store and determine the
889-
correct alignment of the new data with the existing coordinates, or as an
890-
explicit mapping from dimension names to Python ``slice`` objects indicating
891-
where the data should be written (in index space, not label space), e.g.,
892-
893-
.. ipython:: python
894-
895-
# For convenience, we'll slice a single dataset, but in the real use-case
896-
# we would create them separately possibly even from separate processes.
897-
ds = xr.Dataset({"foo": ("x", np.arange(30))}, coords={"x": np.arange(30)})
898-
# Any of the following region specifications are valid
899-
ds.isel(x=slice(0, 10)).to_zarr(path, region="auto")
900-
ds.isel(x=slice(10, 20)).to_zarr(path, region={"x": "auto"})
901-
ds.isel(x=slice(20, 30)).to_zarr(path, region={"x": slice(20, 30)})
902-
903-
Concurrent writes with ``region`` are safe as long as they modify distinct
904-
chunks in the underlying Zarr arrays (or use an appropriate ``lock``).
905-
906-
As a safety check to make it harder to inadvertently override existing values,
907-
if you set ``region`` then *all* variables included in a Dataset must have
908-
dimensions included in ``region``. Other variables (typically coordinates)
909-
need to be explicitly dropped and/or written in a separate calls to ``to_zarr``
910-
with ``mode='a'``.
911-
912887
.. _io.zarr.writing_chunks:
913888

914889
Specifying chunks in a zarr store
@@ -978,6 +953,38 @@ length of each dimension by using the shorthand chunk size ``-1``:
978953
The number of chunks on Tair matches our dask chunks, while there is now only a single
979954
chunk in the directory stores of each coordinate.
980955

956+
.. _io.zarr.consolidated_metadata:
957+
958+
Consolidated Metadata
959+
~~~~~~~~~~~~~~~~~~~~~
960+
961+
Xarray needs to read all of the zarr metadata when it opens a dataset.
962+
In some storage mediums, such as with cloud object storage (e.g. `Amazon S3`_),
963+
this can introduce significant overhead, because two separate HTTP calls to the
964+
object store must be made for each variable in the dataset.
965+
By default Xarray uses a feature called
966+
*consolidated metadata*, storing all metadata for the entire dataset with a
967+
single key (by default called ``.zmetadata``). This typically drastically speeds
968+
up opening the store. (For more information on this feature, consult the
969+
`zarr docs on consolidating metadata <https://zarr.readthedocs.io/en/latest/tutorial.html#consolidating-metadata>`_.)
970+
971+
By default, xarray writes consolidated metadata and attempts to read stores
972+
with consolidated metadata, falling back to use non-consolidated metadata for
973+
reads. Because this fall-back option is so much slower, xarray issues a
974+
``RuntimeWarning`` with guidance when reading with consolidated metadata fails:
975+
976+
Failed to open Zarr store with consolidated metadata, falling back to try
977+
reading non-consolidated metadata. This is typically much slower for
978+
opening a dataset. To silence this warning, consider:
979+
980+
1. Consolidating metadata in this existing store with
981+
:py:func:`zarr.consolidate_metadata`.
982+
2. Explicitly setting ``consolidated=False``, to avoid trying to read
983+
consolidate metadata.
984+
3. Explicitly setting ``consolidated=True``, to raise an error in this case
985+
instead of falling back to try reading non-consolidated metadata.
986+
987+
981988
.. _io.iris:
982989

983990
Iris

0 commit comments

Comments
 (0)