Skip to content

Converter for MDIO dataset spec to Xarray Dataset and serialize it to Zarr #571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 29 commits into
base: v1
Choose a base branch
from

Conversation

dmitriyrepin
Copy link

Outstanding work

  • Add more unit tests for internal functions
  • Add a test comparing expected and actual .zmetadata for the serialized dataset

if isinstance(data_type, ScalarType):
return fill_value_map.get(data_type)
if isinstance(data_type, StructuredType):
return "AAAAAAAAAAAAAAAA" # BUG: this does not work!!!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the encoding by using zarr function. The fill value of structured types are all base64 encoded and the zarr function @BrianMichell pointed at handles this iirc.


# file_name = "XYZ"
file_name = f"{xr_ds.attrs['name']}"
to_zarr(xr_ds, f"test-data/{file_name}.zarr", mode="w")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be using the PyTest tmp_path fixture. That way we don't need to add test-data as a gitignore or have to manually manage test artifacts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this but an even simpler way would be to use an in memory store in zarr like memory://path_to_zarr.

@tasansal tasansal changed the title Converting MDIO dataset to XArray DataArray and wring it to ZARR Converter for MDIO dataset spec to Xarray Dataset and serialize it to Zarr Jul 14, 2025
Copy link
Collaborator

@tasansal tasansal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! i put in lots of nitpick

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use the existing tmp directory and avoid adding more patterns to .gitignore.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add single line docstrings to all functions that are missing it. please do expanded docstrings if logic is complicated.

if isinstance(data_type, ScalarType):
return fill_value_map.get(data_type)
if isinstance(data_type, StructuredType):
return "AAAAAAAAAAAAAAAA" # BUG: this does not work!!!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the encoding by using zarr function. The fill value of structured types are all base64 encoded and the zarr function @BrianMichell pointed at handles this iirc.

# Let's store the data array for the second pass
data_arrays[v.name] = data_array

# Second pass: Add non-dimension coordinates to the data arrays
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tasansal This was actually a hint for a discussion :-)
Just by looking at an example at https://zarr.readthedocs.io/en/stable/user-guide/arrays.html#compressors
I was hoping that the following would work out of the box, but it does not.

   "compressor": _to_dictionary(v.compressor)

I see the following error:

>_compressor = parse_compressor(compressor[0])
> return numcodecs.get_codec(data)
E - numcodecs.errors.UnknownCodecError: codec not available: 'None'"

Thus, we have to have a conversion function. Shouldn't this be addressed at the schema layer?
It would be nice to have the schema that can be serialized to JSON dictionary that can be used directly to parametrize the encoder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants