|
| 1 | +Data types |
| 2 | +========== |
| 3 | + |
| 4 | +Zarr's data type model |
| 5 | +---------------------- |
| 6 | + |
| 7 | +Every Zarr array has a "data type", which defines the meaning and physical layout of the |
| 8 | +array's elements. As Zarr Python is tightly integrated with `NumPy <https://numpy.org/doc/stable/>`_, |
| 9 | +it's easy to create arrays with NumPy data types: |
| 10 | + |
| 11 | +.. code-block:: python |
| 12 | +
|
| 13 | + >>> import zarr |
| 14 | + >>> import numpy as np |
| 15 | + >>> z = zarr.create_array(store={}, shape=(10,), dtype=np.dtype('uint8')) |
| 16 | + >>> z |
| 17 | + <Array memory:... shape=(10,) dtype=uint8> |
| 18 | +
|
| 19 | +Unlike NumPy arrays, Zarr arrays are designed to accessed by Zarr |
| 20 | +implementations in different programming languages. This means Zarr data types must be interpreted |
| 21 | +correctly when clients read an array. Each Zarr data type defines procedures for |
| 22 | +encoding and decoding both the data type itself, and scalars from that data type to and from Zarr array metadata. And these serialization procedures |
| 23 | +depend on the Zarr format. |
| 24 | + |
| 25 | +Data types in Zarr version 2 |
| 26 | +----------------------------- |
| 27 | + |
| 28 | +Version 2 of the Zarr format defined its data types relative to |
| 29 | +`NumPy's data types <https://numpy.org/doc/2.1/reference/arrays.dtypes.html#data-type-objects-dtype>`_, |
| 30 | +and added a few non-NumPy data types as well. Thus the JSON identifier for a NumPy-compatible data |
| 31 | +type is just the NumPy ``str`` attribute of that data type: |
| 32 | + |
| 33 | +.. code-block:: python |
| 34 | +
|
| 35 | + >>> import zarr |
| 36 | + >>> import numpy as np |
| 37 | + >>> import json |
| 38 | + >>> |
| 39 | + >>> store = {} |
| 40 | + >>> np_dtype = np.dtype('int64') |
| 41 | + >>> z = zarr.create_array(store=store, shape=(1,), dtype=np_dtype, zarr_format=2) |
| 42 | + >>> dtype_meta = json.loads(store['.zarray'].to_bytes())["dtype"] |
| 43 | + >>> dtype_meta |
| 44 | + '<i8' |
| 45 | + >>> assert dtype_meta == np_dtype.str |
| 46 | +
|
| 47 | +.. note:: |
| 48 | + The ``<`` character in the data type metadata encodes the |
| 49 | + `endianness <https://numpy.org/doc/2.2/reference/generated/numpy.dtype.byteorder.html>`_, |
| 50 | + or "byte order", of the data type. Following NumPy's example, |
| 51 | + in Zarr version 2 each data type has an endianness where applicable. |
| 52 | + However, Zarr version 3 data types do not store endianness information. |
| 53 | + |
| 54 | +In addition to defining a representation of the data type itself (which in the example above was |
| 55 | +just a simple string ``"<i8"``), Zarr also |
| 56 | +defines a metadata representation for scalars associated with each data type. This is necessary |
| 57 | +because Zarr arrays have a ``JSON``-serializable ``fill_value`` attribute that defines a scalar value to use when reading |
| 58 | +uninitialized chunks of a Zarr array. |
| 59 | +Integer and float scalars are stored as ``JSON`` numbers, except for special floats like ``NaN``, |
| 60 | +positive infinity, and negative infinity, which are stored as strings. |
| 61 | + |
| 62 | +More broadly, each Zarr data type defines its own rules for how scalars of that type are stored in |
| 63 | +``JSON``. |
| 64 | + |
| 65 | + |
| 66 | +Data types in Zarr version 3 |
| 67 | +----------------------------- |
| 68 | + |
| 69 | +Zarr V3 brings several key changes to how data types are represented: |
| 70 | + |
| 71 | +- Zarr V3 identifies the basic data types as strings like ``"int8"``, ``"int16"``, etc. |
| 72 | + |
| 73 | + By contrast, Zarr V2 uses the NumPy character code representation for data types: |
| 74 | + In Zarr V2, ``int8`` is represented as ``"|i1"``. |
| 75 | +- A Zarr V3 data type does not have endianness. This is a departure from Zarr V2, where multi-byte |
| 76 | + data types are defined with endianness information. Instead, Zarr V3 requires that endianness, |
| 77 | + where applicable, is specified in the ``codecs`` attribute of array metadata. |
| 78 | +- While some Zarr V3 data types are identified by strings, others can be identified by a ``JSON`` |
| 79 | + object. For example, consider this specification of a ``datetime`` data type: |
| 80 | + |
| 81 | + .. code-block:: json |
| 82 | +
|
| 83 | + { |
| 84 | + "name": "numpy.datetime64", |
| 85 | + "configuration": { |
| 86 | + "unit": "s", |
| 87 | + "scale_factor": 10 |
| 88 | + } |
| 89 | + } |
| 90 | +
|
| 91 | +
|
| 92 | + Zarr V2 generally uses structured string representations to convey the same information. The |
| 93 | + data type given in the previous example would be represented as the string ``">M[10s]"`` in |
| 94 | + Zarr V2. This is more compact, but can be harder to parse. |
| 95 | + |
| 96 | +For more about data types in Zarr V3, see the |
| 97 | +`V3 specification <https://zarr-specs.readthedocs.io/en/latest/v3/data-types/index.html>`_. |
| 98 | + |
| 99 | +Data types in Zarr Python |
| 100 | +------------------------- |
| 101 | + |
| 102 | +The two Zarr formats that Zarr Python supports specify data types in two different ways: |
| 103 | +data types in Zarr version 2 are encoded as NumPy-compatible strings, while data types in Zarr version |
| 104 | +3 are encoded as either strings or ``JSON`` objects, |
| 105 | +and the Zarr V3 data types don't have any associated endianness information, unlike Zarr V2 data types. |
| 106 | + |
| 107 | +To abstract over these syntactical and semantic differences, Zarr Python uses a class called |
| 108 | +`ZDType <../api/zarr/dtype/index.html#zarr.dtype.ZDType>`_ provide Zarr V2 and Zarr V3 compatibility |
| 109 | +routines for ""native" data types. In this context, a "native" data type is a Python class, |
| 110 | +typically defined in another library, that models an array's data type. For example, ``np.uint8`` is a native |
| 111 | +data type defined in NumPy, which Zarr Python wraps with a ``ZDType`` instance called |
| 112 | +`UInt8 <../api/zarr/dtype/index.html#zarr.dtype.ZDType>`_. |
| 113 | + |
| 114 | +Each data type supported by Zarr Python is modeled by ``ZDType`` subclass, which provides an |
| 115 | +API for the following operations: |
| 116 | + |
| 117 | +- Wrapping / unwrapping a native data type |
| 118 | +- Encoding / decoding a data type to / from Zarr V2 and Zarr V3 array metadata. |
| 119 | +- Encoding / decoding a scalar value to / from Zarr V2 and Zarr V3 array metadata. |
| 120 | + |
| 121 | + |
| 122 | +Example Usage |
| 123 | +~~~~~~~~~~~~~ |
| 124 | + |
| 125 | +Create a ``ZDType`` from a native data type: |
| 126 | + |
| 127 | +.. code-block:: python |
| 128 | +
|
| 129 | + >>> from zarr.core.dtype import Int8 |
| 130 | + >>> import numpy as np |
| 131 | + >>> int8 = Int8.from_native_dtype(np.dtype('int8')) |
| 132 | +
|
| 133 | +Convert back to native data type: |
| 134 | + |
| 135 | +.. code-block:: python |
| 136 | +
|
| 137 | + >>> native_dtype = int8.to_native_dtype() |
| 138 | + >>> assert native_dtype == np.dtype('int8') |
| 139 | +
|
| 140 | +Get the default scalar value for the data type: |
| 141 | + |
| 142 | +.. code-block:: python |
| 143 | +
|
| 144 | + >>> default_value = int8.default_scalar() |
| 145 | + >>> assert default_value == np.int8(0) |
| 146 | +
|
| 147 | +
|
| 148 | +Serialize to JSON for Zarr V2 and V3 |
| 149 | + |
| 150 | +.. code-block:: python |
| 151 | +
|
| 152 | + >>> json_v2 = int8.to_json(zarr_format=2) |
| 153 | + >>> json_v2 |
| 154 | + {'name': '|i1', 'object_codec_id': None} |
| 155 | + >>> json_v3 = int8.to_json(zarr_format=3) |
| 156 | + >>> json_v3 |
| 157 | + 'int8' |
| 158 | +
|
| 159 | +Serialize a scalar value to JSON: |
| 160 | + |
| 161 | +.. code-block:: python |
| 162 | +
|
| 163 | + >>> json_value = int8.to_json_scalar(42, zarr_format=3) |
| 164 | + >>> json_value |
| 165 | + 42 |
| 166 | +
|
| 167 | +Deserialize a scalar value from JSON: |
| 168 | + |
| 169 | +.. code-block:: python |
| 170 | +
|
| 171 | + >>> scalar_value = int8.from_json_scalar(42, zarr_format=3) |
| 172 | + >>> assert scalar_value == np.int8(42) |
0 commit comments