Skip to content

Commit 6798466

Browse files
d-v-bjhammanilan-goldrabernat
authored
refactor v3 data types (#2874)
* modernize typing * lint * new dtypes * rename base dtype, change type to kind * start working on JSON serialization * get json de/serialization largely working, and start making tests pass * tweak json type guards * fix dtype sizes, adjust fill value parsing in from_dict, fix tests * mid-refactor commit * working form for dtype classes * remove unused code * use wrap / unwrap instead of to_dtype / from_dtype; push into v2 codebase * push into v2 * remove endianness kwarg to methods, make it an instance variable instead * make wrapping safe by default * dtype-specific tests * more tests, fix void type default value logic * fix dtype mechanics in bytescodec * remove __post_init__ magic in favor of more explicit declaration * fix tests * refactor data types * start design doc * more design doc * update docs * fix sphinx warnings * tweak docs * info about v3 data types * adjust note * fix: use unparametrized types in direct assignment * start fixing config * Update src/zarr/core/_info.py Co-authored-by: Joe Hamman <jhamman1@gmail.com> * add placeholder disclaimer to v3 data types summary * make example runnable * placeholder section for adding a custom dtype * define native data type and native scalar * update data type names * fix config test failures * call to_dtype once in blosc evolve_from_array_spec * refactor dtypewrapper -> zdtype * update code examples in docs; remove native endianness * adjust type annotations * fix info tests to use zdtype * remove dead code and add code coverage exemption to zarr format checks * fix: add special check for resolving int32 on windows * add dtype entry point test * remove default parameters for parametric dtypes; add mixin classes for numpy dtypes; define zdtypelike * Update docs/user-guide/data_types.rst Co-authored-by: Ilan Gold <ilanbassgold@gmail.com> * refactor: use inheritance to remove boilerplate in dtype definitions * update data types documentation, and expose core/dtype module to autodoc * add failing endianness round-trip test * fix endianness * additional check in test_explicit_endianness * add failing test for round-tripping vlen strings * route object dtype arrays to vlen string dtype when numpy > 2 * relax endianness mismatch to a warning instead of an error * use public dtype module for docs instead of special-casing the core dype module * use public dtype module for docs instead of special-casing the core dype module * silence mypy error about array indexing * add release note * fix doctests, excluding config tests * revert addition of linkage between dtype endianness and bytes codec endianness * remove Any types * add docstring for wrapper module * simplify config and docs * update config test * fix S dtype test for v2 * fully remove v3jsonencoder * refactor dtype module structure * add timedelta64 * refactor time dtypes * widen dtype test strategies * modify structured dtype fill value rt to avoid to_dict * wip: begin creating isomorphic test suite for dtypes * finish common tests * wip: test infrastructure for dtypes * wip: use class-based tests for all dtypes * fill out more tests, and adjust sized dtypes * wip: json schema test * add casting tests * use relative link for changes * typo * make bytes codec dtype logic a bit more literate * increase deadline to 500ms * fewer commented sections of problematic lru_store_cache section of the sharding codecs * add link to gh issue about lru_cache for sharding codec * attempt to speed up hypothesis tests by reducing max array size * clean up docs * remove placeholder * make final example section doctested and more readable * revert change to auto chunking * revert quotation of literal type * lint * fix broken code block * specialize test to handle stringdtype changes coming in numpy 2.3 * add docstring to _TestZDType class * type hints * expand changelog * tweak docstring * support v3 nan strings in JSON for float dtypes * revert removal of metadata chunk grid attribute * use none to denote default fill value; remove old structured tests; use cast_value where appropriate * add item size abstraction * rename fixed-length string dtypes, and be strict about the numpy object dtype (i.e., refuse to match it) * remove vestigial use of to_dtype().itemsize() * remove another vestigial use of to_dtype().itemsize() * emit warning about unstable dtype when serializing Structured dtype to JSON * put string dtypes in the strings module * make tests isomorphic to source code * remove old string logic * use scale_factor and unit in cast_value for datetime * add regression testing against v2.18 * truncate U and S scalars in _cast_value_unsafe * docstrings and simplification for regression tests * changes necessary for linting with regression tests * improve method names, refactor type hints with typeddictionaries, fix registry load frequency, add object_codec_id for v2 json deserialization * fix storage info discrepancy in docs * fix docstring that was troubling sphinx * wip: add vlen-bytes * add vlen-bytes * replace placeholder text with links to a github issue * refactor fixed-length bytes dtypes * more v3 unstable dtype warnings, and their exemptions from tests * clean up typeddicts * update docstrings * Update docs/user-guide/data_types.rst Co-authored-by: Ryan Abernathey <ryan.abernathey@gmail.com> * refactor wrapper to allow subclasses to freely define their own type guards for native dtype and json input * make method definition order consistent * allow structured scalars to be np.void * use a common function signature for from_json by packing the object_codec_id in a typeddict for zarr v2 metadata * fix dtype doc example --------- Co-authored-by: Joe Hamman <jhamman1@gmail.com> Co-authored-by: Ilan Gold <ilanbassgold@gmail.com> Co-authored-by: Ryan Abernathey <ryan.abernathey@gmail.com>
1 parent 11d488d commit 6798466

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

80 files changed

+7067
-1647
lines changed

changes/2874.feature.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
Adds zarr-specific data type classes. This replaces the internal use of numpy data types for zarr
2+
v2 and a fixed set of string enums for zarr v3. This change is largely internal, but it does
3+
change the type of the ``dtype`` and ``data_type`` fields on the ``ArrayV2Metadata`` and
4+
``ArrayV3Metadata`` classes. It also changes the JSON metadata representation of the
5+
variable-length string data type, but the old metadata representation can still be
6+
used when reading arrays. The logic for automatically choosing the chunk encoding for a given data
7+
type has also changed, and this necessitated changes to the ``config`` API.
8+
9+
For more on this new feature, see the `documentation </user-guide/data_types.html>`_

docs/user-guide/arrays.rst

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -182,7 +182,7 @@ which can be used to print useful diagnostics, e.g.::
182182
>>> z.info
183183
Type : Array
184184
Zarr format : 3
185-
Data type : DataType.int32
185+
Data type : Int32(endianness='little')
186186
Fill value : 0
187187
Shape : (10000, 10000)
188188
Chunk shape : (1000, 1000)
@@ -200,7 +200,7 @@ prints additional diagnostics, e.g.::
200200
>>> z.info_complete()
201201
Type : Array
202202
Zarr format : 3
203-
Data type : DataType.int32
203+
Data type : Int32(endianness='little')
204204
Fill value : 0
205205
Shape : (10000, 10000)
206206
Chunk shape : (1000, 1000)
@@ -248,7 +248,7 @@ built-in delta filter::
248248
The default compressor can be changed by setting the value of the using Zarr's
249249
:ref:`user-guide-config`, e.g.::
250250

251-
>>> with zarr.config.set({'array.v2_default_compressor.numeric': {'id': 'blosc'}}):
251+
>>> with zarr.config.set({'array.v2_default_compressor.default': {'id': 'blosc'}}):
252252
... z = zarr.create_array(store={}, shape=(100000000,), chunks=(1000000,), dtype='int32', zarr_format=2)
253253
>>> z.filters
254254
()
@@ -288,7 +288,7 @@ Here is an example using a delta filter with the Blosc compressor::
288288
>>> z.info
289289
Type : Array
290290
Zarr format : 3
291-
Data type : DataType.int32
291+
Data type : Int32(endianness='little')
292292
Fill value : 0
293293
Shape : (10000, 10000)
294294
Chunk shape : (1000, 1000)
@@ -603,7 +603,7 @@ Sharded arrays can be created by providing the ``shards`` parameter to :func:`za
603603
>>> a.info_complete()
604604
Type : Array
605605
Zarr format : 3
606-
Data type : DataType.uint8
606+
Data type : UInt8()
607607
Fill value : 0
608608
Shape : (10000, 10000)
609609
Shard shape : (1000, 1000)
@@ -612,10 +612,10 @@ Sharded arrays can be created by providing the ``shards`` parameter to :func:`za
612612
Read-only : False
613613
Store type : LocalStore
614614
Filters : ()
615-
Serializer : BytesCodec(endian=<Endian.little: 'little'>)
615+
Serializer : BytesCodec(endian=None)
616616
Compressors : (ZstdCodec(level=0, checksum=False),)
617617
No. bytes : 100000000 (95.4M)
618-
No. bytes stored : 3981552
618+
No. bytes stored : 3981473
619619
Storage ratio : 25.1
620620
Shards Initialized : 100
621621

docs/user-guide/config.rst

Lines changed: 25 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -43,39 +43,30 @@ This is the current default configuration::
4343

4444
>>> zarr.config.pprint()
4545
{'array': {'order': 'C',
46-
'v2_default_compressor': {'bytes': {'checksum': False,
47-
'id': 'zstd',
48-
'level': 0},
49-
'numeric': {'checksum': False,
50-
'id': 'zstd',
51-
'level': 0},
52-
'string': {'checksum': False,
46+
'v2_default_compressor': {'default': {'checksum': False,
5347
'id': 'zstd',
54-
'level': 0}},
55-
'v2_default_filters': {'bytes': [{'id': 'vlen-bytes'}],
56-
'numeric': None,
57-
'raw': None,
58-
'string': [{'id': 'vlen-utf8'}]},
59-
'v3_default_compressors': {'bytes': [{'configuration': {'checksum': False,
60-
'level': 0},
61-
'name': 'zstd'}],
62-
'numeric': [{'configuration': {'checksum': False,
48+
'level': 0},
49+
'variable-length-string': {'checksum': False,
50+
'id': 'zstd',
51+
'level': 0}},
52+
'v2_default_filters': {'default': None,
53+
'variable-length-string': [{'id': 'vlen-utf8'}]},
54+
'v3_default_compressors': {'default': [{'configuration': {'checksum': False,
6355
'level': 0},
6456
'name': 'zstd'}],
65-
'string': [{'configuration': {'checksum': False,
66-
'level': 0},
67-
'name': 'zstd'}]},
68-
'v3_default_filters': {'bytes': [], 'numeric': [], 'string': []},
69-
'v3_default_serializer': {'bytes': {'name': 'vlen-bytes'},
70-
'numeric': {'configuration': {'endian': 'little'},
71-
'name': 'bytes'},
72-
'string': {'name': 'vlen-utf8'}},
73-
'write_empty_chunks': False},
74-
'async': {'concurrency': 10, 'timeout': None},
75-
'buffer': 'zarr.core.buffer.cpu.Buffer',
76-
'codec_pipeline': {'batch_size': 1,
77-
'path': 'zarr.core.codec_pipeline.BatchedCodecPipeline'},
78-
'codecs': {'blosc': 'zarr.codecs.blosc.BloscCodec',
57+
'variable-length-string': [{'configuration': {'checksum': False,
58+
'level': 0},
59+
'name': 'zstd'}]},
60+
'v3_default_filters': {'default': [], 'variable-length-string': []},
61+
'v3_default_serializer': {'default': {'configuration': {'endian': 'little'},
62+
'name': 'bytes'},
63+
'variable-length-string': {'name': 'vlen-utf8'}},
64+
'write_empty_chunks': False},
65+
'async': {'concurrency': 10, 'timeout': None},
66+
'buffer': 'zarr.core.buffer.cpu.Buffer',
67+
'codec_pipeline': {'batch_size': 1,
68+
'path': 'zarr.core.codec_pipeline.BatchedCodecPipeline'},
69+
'codecs': {'blosc': 'zarr.codecs.blosc.BloscCodec',
7970
'bytes': 'zarr.codecs.bytes.BytesCodec',
8071
'crc32c': 'zarr.codecs.crc32c_.Crc32cCodec',
8172
'endian': 'zarr.codecs.bytes.BytesCodec',
@@ -85,7 +76,7 @@ This is the current default configuration::
8576
'vlen-bytes': 'zarr.codecs.vlen_utf8.VLenBytesCodec',
8677
'vlen-utf8': 'zarr.codecs.vlen_utf8.VLenUTF8Codec',
8778
'zstd': 'zarr.codecs.zstd.ZstdCodec'},
88-
'default_zarr_format': 3,
89-
'json_indent': 2,
90-
'ndbuffer': 'zarr.core.buffer.cpu.NDBuffer',
91-
'threading': {'max_workers': None}}
79+
'default_zarr_format': 3,
80+
'json_indent': 2,
81+
'ndbuffer': 'zarr.core.buffer.cpu.NDBuffer',
82+
'threading': {'max_workers': None}}

docs/user-guide/consolidated_metadata.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ that can be used.:
4747
>>> from pprint import pprint
4848
>>> pprint(dict(sorted(consolidated_metadata.items())))
4949
{'a': ArrayV3Metadata(shape=(1,),
50-
data_type=<DataType.float64: 'float64'>,
50+
data_type=Float64(endianness='little'),
5151
chunk_grid=RegularChunkGrid(chunk_shape=(1,)),
5252
chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
5353
separator='/'),
@@ -60,7 +60,7 @@ that can be used.:
6060
node_type='array',
6161
storage_transformers=()),
6262
'b': ArrayV3Metadata(shape=(2, 2),
63-
data_type=<DataType.float64: 'float64'>,
63+
data_type=Float64(endianness='little'),
6464
chunk_grid=RegularChunkGrid(chunk_shape=(2, 2)),
6565
chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
6666
separator='/'),
@@ -73,7 +73,7 @@ that can be used.:
7373
node_type='array',
7474
storage_transformers=()),
7575
'c': ArrayV3Metadata(shape=(3, 3, 3),
76-
data_type=<DataType.float64: 'float64'>,
76+
data_type=Float64(endianness='little'),
7777
chunk_grid=RegularChunkGrid(chunk_shape=(3, 3, 3)),
7878
chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
7979
separator='/'),

docs/user-guide/data_types.rst

Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
Data types
2+
==========
3+
4+
Zarr's data type model
5+
----------------------
6+
7+
Every Zarr array has a "data type", which defines the meaning and physical layout of the
8+
array's elements. As Zarr Python is tightly integrated with `NumPy <https://numpy.org/doc/stable/>`_,
9+
it's easy to create arrays with NumPy data types:
10+
11+
.. code-block:: python
12+
13+
>>> import zarr
14+
>>> import numpy as np
15+
>>> z = zarr.create_array(store={}, shape=(10,), dtype=np.dtype('uint8'))
16+
>>> z
17+
<Array memory:... shape=(10,) dtype=uint8>
18+
19+
Unlike NumPy arrays, Zarr arrays are designed to accessed by Zarr
20+
implementations in different programming languages. This means Zarr data types must be interpreted
21+
correctly when clients read an array. Each Zarr data type defines procedures for
22+
encoding and decoding both the data type itself, and scalars from that data type to and from Zarr array metadata. And these serialization procedures
23+
depend on the Zarr format.
24+
25+
Data types in Zarr version 2
26+
-----------------------------
27+
28+
Version 2 of the Zarr format defined its data types relative to
29+
`NumPy's data types <https://numpy.org/doc/2.1/reference/arrays.dtypes.html#data-type-objects-dtype>`_,
30+
and added a few non-NumPy data types as well. Thus the JSON identifier for a NumPy-compatible data
31+
type is just the NumPy ``str`` attribute of that data type:
32+
33+
.. code-block:: python
34+
35+
>>> import zarr
36+
>>> import numpy as np
37+
>>> import json
38+
>>>
39+
>>> store = {}
40+
>>> np_dtype = np.dtype('int64')
41+
>>> z = zarr.create_array(store=store, shape=(1,), dtype=np_dtype, zarr_format=2)
42+
>>> dtype_meta = json.loads(store['.zarray'].to_bytes())["dtype"]
43+
>>> dtype_meta
44+
'<i8'
45+
>>> assert dtype_meta == np_dtype.str
46+
47+
.. note::
48+
The ``<`` character in the data type metadata encodes the
49+
`endianness <https://numpy.org/doc/2.2/reference/generated/numpy.dtype.byteorder.html>`_,
50+
or "byte order", of the data type. Following NumPy's example,
51+
in Zarr version 2 each data type has an endianness where applicable.
52+
However, Zarr version 3 data types do not store endianness information.
53+
54+
In addition to defining a representation of the data type itself (which in the example above was
55+
just a simple string ``"<i8"``), Zarr also
56+
defines a metadata representation for scalars associated with each data type. This is necessary
57+
because Zarr arrays have a ``JSON``-serializable ``fill_value`` attribute that defines a scalar value to use when reading
58+
uninitialized chunks of a Zarr array.
59+
Integer and float scalars are stored as ``JSON`` numbers, except for special floats like ``NaN``,
60+
positive infinity, and negative infinity, which are stored as strings.
61+
62+
More broadly, each Zarr data type defines its own rules for how scalars of that type are stored in
63+
``JSON``.
64+
65+
66+
Data types in Zarr version 3
67+
-----------------------------
68+
69+
Zarr V3 brings several key changes to how data types are represented:
70+
71+
- Zarr V3 identifies the basic data types as strings like ``"int8"``, ``"int16"``, etc.
72+
73+
By contrast, Zarr V2 uses the NumPy character code representation for data types:
74+
In Zarr V2, ``int8`` is represented as ``"|i1"``.
75+
- A Zarr V3 data type does not have endianness. This is a departure from Zarr V2, where multi-byte
76+
data types are defined with endianness information. Instead, Zarr V3 requires that endianness,
77+
where applicable, is specified in the ``codecs`` attribute of array metadata.
78+
- While some Zarr V3 data types are identified by strings, others can be identified by a ``JSON``
79+
object. For example, consider this specification of a ``datetime`` data type:
80+
81+
.. code-block:: json
82+
83+
{
84+
"name": "numpy.datetime64",
85+
"configuration": {
86+
"unit": "s",
87+
"scale_factor": 10
88+
}
89+
}
90+
91+
92+
Zarr V2 generally uses structured string representations to convey the same information. The
93+
data type given in the previous example would be represented as the string ``">M[10s]"`` in
94+
Zarr V2. This is more compact, but can be harder to parse.
95+
96+
For more about data types in Zarr V3, see the
97+
`V3 specification <https://zarr-specs.readthedocs.io/en/latest/v3/data-types/index.html>`_.
98+
99+
Data types in Zarr Python
100+
-------------------------
101+
102+
The two Zarr formats that Zarr Python supports specify data types in two different ways:
103+
data types in Zarr version 2 are encoded as NumPy-compatible strings, while data types in Zarr version
104+
3 are encoded as either strings or ``JSON`` objects,
105+
and the Zarr V3 data types don't have any associated endianness information, unlike Zarr V2 data types.
106+
107+
To abstract over these syntactical and semantic differences, Zarr Python uses a class called
108+
`ZDType <../api/zarr/dtype/index.html#zarr.dtype.ZDType>`_ provide Zarr V2 and Zarr V3 compatibility
109+
routines for ""native" data types. In this context, a "native" data type is a Python class,
110+
typically defined in another library, that models an array's data type. For example, ``np.uint8`` is a native
111+
data type defined in NumPy, which Zarr Python wraps with a ``ZDType`` instance called
112+
`UInt8 <../api/zarr/dtype/index.html#zarr.dtype.ZDType>`_.
113+
114+
Each data type supported by Zarr Python is modeled by ``ZDType`` subclass, which provides an
115+
API for the following operations:
116+
117+
- Wrapping / unwrapping a native data type
118+
- Encoding / decoding a data type to / from Zarr V2 and Zarr V3 array metadata.
119+
- Encoding / decoding a scalar value to / from Zarr V2 and Zarr V3 array metadata.
120+
121+
122+
Example Usage
123+
~~~~~~~~~~~~~
124+
125+
Create a ``ZDType`` from a native data type:
126+
127+
.. code-block:: python
128+
129+
>>> from zarr.core.dtype import Int8
130+
>>> import numpy as np
131+
>>> int8 = Int8.from_native_dtype(np.dtype('int8'))
132+
133+
Convert back to native data type:
134+
135+
.. code-block:: python
136+
137+
>>> native_dtype = int8.to_native_dtype()
138+
>>> assert native_dtype == np.dtype('int8')
139+
140+
Get the default scalar value for the data type:
141+
142+
.. code-block:: python
143+
144+
>>> default_value = int8.default_scalar()
145+
>>> assert default_value == np.int8(0)
146+
147+
148+
Serialize to JSON for Zarr V2 and V3
149+
150+
.. code-block:: python
151+
152+
>>> json_v2 = int8.to_json(zarr_format=2)
153+
>>> json_v2
154+
{'name': '|i1', 'object_codec_id': None}
155+
>>> json_v3 = int8.to_json(zarr_format=3)
156+
>>> json_v3
157+
'int8'
158+
159+
Serialize a scalar value to JSON:
160+
161+
.. code-block:: python
162+
163+
>>> json_value = int8.to_json_scalar(42, zarr_format=3)
164+
>>> json_value
165+
42
166+
167+
Deserialize a scalar value from JSON:
168+
169+
.. code-block:: python
170+
171+
>>> scalar_value = int8.from_json_scalar(42, zarr_format=3)
172+
>>> assert scalar_value == np.int8(42)

docs/user-guide/groups.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,7 @@ property. E.g.::
128128
>>> bar.info_complete()
129129
Type : Array
130130
Zarr format : 3
131-
Data type : DataType.int64
131+
Data type : Int64(endianness='little')
132132
Fill value : 0
133133
Shape : (1000000,)
134134
Chunk shape : (100000,)
@@ -145,7 +145,7 @@ property. E.g.::
145145
>>> baz.info
146146
Type : Array
147147
Zarr format : 3
148-
Data type : DataType.float32
148+
Data type : Float32(endianness='little')
149149
Fill value : 0.0
150150
Shape : (1000, 1000)
151151
Chunk shape : (100, 100)

docs/user-guide/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ User guide
88

99
installation
1010
arrays
11+
data_types
1112
groups
1213
attributes
1314
storage

0 commit comments

Comments
 (0)