Converter for MDIO dataset spec to Xarray Dataset and serialize it to Zarr #571

dmitriyrepin · 2025-07-11T20:16:36Z

Outstanding work

Add more unit tests for internal functions
Add a test comparing expected and actual .zmetadata for the serialized dataset

…alidate* functions

…this PR

src/mdio/schemas/v1/dataset_serializer.py

BrianMichell · 2025-07-11T21:35:12Z

src/mdio/schemas/v1/dataset_serializer.py

+    if isinstance(data_type, ScalarType):
+        return fill_value_map.get(data_type)
+    if isinstance(data_type, StructuredType):
+        return "AAAAAAAAAAAAAAAA"  # BUG: this does not work!!!


https://github.com/TGSAI/mdio-cpp/blob/main/mdio/dataset_factory.h#L306C5-L317C6
Here's the implementation in C++.

I think this may handle what we need: https://github.com/zarr-developers/zarr-python/blob/ea4d7e96c0738526bbf08bb99ed0ea23a6081836/src/zarr/core/dtype/npy/common.py#L241-L258

+1 on the encoding by using zarr function. The fill value of structured types are all base64 encoded and the zarr function @BrianMichell pointed at handles this iirc.

BrianMichell · 2025-07-13T19:57:47Z

tests/unit/v1/test_dataset_serializer.py

+
+    # file_name = "XYZ"
+    file_name = f"{xr_ds.attrs['name']}"
+    to_zarr(xr_ds, f"test-data/{file_name}.zarr", mode="w")


We should be using the PyTest tmp_path fixture. That way we don't need to add test-data as a gitignore or have to manually manage test artifacts.

+1 on this but an even simpler way would be to use an in memory store in zarr like memory://path_to_zarr.

tasansal

looks good! i put in lots of nitpick

_dev/DEVELOPERS_NOTES.md

tasansal · 2025-07-14T14:06:46Z

.gitignore

please use the existing tmp directory and avoid adding more patterns to .gitignore.

_dev/zmetadata.cpp.json

_dev/zmetadata.python.json

tests/unit/v1/helpers.py

tasansal · 2025-07-14T14:10:42Z

src/mdio/schemas/v1/dataset_serializer.py

please add single line docstrings to all functions that are missing it. please do expanded docstrings if logic is complicated.

src/mdio/schemas/v1/dataset_serializer.py

tasansal · 2025-07-14T14:13:38Z

src/mdio/schemas/v1/dataset_serializer.py

+    if isinstance(data_type, ScalarType):
+        return fill_value_map.get(data_type)
+    if isinstance(data_type, StructuredType):
+        return "AAAAAAAAAAAAAAAA"  # BUG: this does not work!!!


+1 on the encoding by using zarr function. The fill value of structured types are all base64 encoded and the zarr function @BrianMichell pointed at handles this iirc.

tasansal · 2025-07-14T14:14:32Z

src/mdio/schemas/v1/dataset_serializer.py

+        # Let's store the data array for the second pass
+        data_arrays[v.name] = data_array
+
+    # Second pass: Add non-dimension coordinates to the data arrays


remove comment

@tasansal This was actually a hint for a discussion :-)
Just by looking at an example at https://zarr.readthedocs.io/en/stable/user-guide/arrays.html#compressors
I was hoping that the following would work out of the box, but it does not.

"compressor": _to_dictionary(v.compressor)

I see the following error:

>_compressor = parse_compressor(compressor[0]) > return numcodecs.get_codec(data) E - numcodecs.errors.UnknownCodecError: codec not available: 'None'"

Thus, we have to have a conversion function. Shouldn't this be addressed at the schema layer?
It would be nice to have the schema that can be serialized to JSON dictionary that can be used directly to parametrize the encoder.

src/mdio/schemas/v1/dataset_serializer.py

dmitriyrepin and others added 29 commits June 24, 2025 20:17

schema_v1-dataset_builder-add_dimension

9dd9fbc

Merge remote-tracking branch 'upstream/v1' into v1

f88531e

First take on add_dimension(), add_coordinate(), add_variable()

1358f95

Finished add_dimension, add_coordinate, add_variable

e5261cb

Work on build

95c01d8

Generalize _to_dictionary()

46f82f0

build

0dc7cc8

Dataset Build - pass one

79863ac

Merge the latest TGSAI/mdio-python:v1 branch

ec480f1

Merge branch 'v1' into v1

fa81ea2

Revert .container changes

4b2b163

PR review: remove DEVELOPER_NOTES.md

c532c3b

PR Review: add_coordinate() should accept only data_type: ScalarType

08798cd

PR review: add_variable() data_type remove default

e8febe4

RE review: do not add dimension variable

0a4be3f

PR Review: get api version from the package version

7b25d6b

PR Review: remove add_dimension_coordinate

7ca3ed8

PR Review: add_coordinate() remove data_type default value

4d1ec9c

PR Review: improve unit tests by extracting common functionality in v…

99fcf43

…alidate* functions

Remove the Dockerfile changes. They are not supposed to be a part of …

0778fdd

…this PR

PR Review: run ruff

7e74567

PR Review: fix pre-commit errors

0aaa5f6

remove some noqa overrides

1904dee

Writing XArray / Zarr

4c7c833

gitignore

4b39ffa

Merge remote-tracking branch 'upstream/v1' into v1

e772a4f

to_zarr() fix compression

cea7308

Fix precommit issues

850135e

Use only make_campos_3d_acceptance_dataset

82f1960

BrianMichell requested changes Jul 13, 2025

View reviewed changes

tasansal changed the title ~~Converting MDIO dataset to XArray DataArray and wring it to ZARR~~ Converter for MDIO dataset spec to Xarray Dataset and serialize it to Zarr Jul 14, 2025

tasansal requested changes Jul 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Converter for MDIO dataset spec to Xarray Dataset and serialize it to Zarr #571

Converter for MDIO dataset spec to Xarray Dataset and serialize it to Zarr #571

dmitriyrepin commented Jul 11, 2025

Uh oh!

Uh oh!

Uh oh!

BrianMichell Jul 11, 2025

Uh oh!

tasansal Jul 14, 2025

Uh oh!

BrianMichell Jul 13, 2025

Uh oh!

tasansal Jul 14, 2025

Uh oh!

tasansal left a comment

Uh oh!

Uh oh!

tasansal Jul 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tasansal Jul 14, 2025

Uh oh!

Uh oh!

tasansal Jul 14, 2025

Uh oh!

tasansal Jul 14, 2025

Uh oh!

dmitriyrepin Jul 14, 2025

Uh oh!

Uh oh!

Uh oh!

Converter for MDIO dataset spec to Xarray Dataset and serialize it to Zarr #571

Are you sure you want to change the base?

Converter for MDIO dataset spec to Xarray Dataset and serialize it to Zarr #571

Conversation

dmitriyrepin commented Jul 11, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tasansal left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!