-
Notifications
You must be signed in to change notification settings - Fork 15
Converter for MDIO dataset spec to Xarray Dataset and serialize it to Zarr #571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v1
Are you sure you want to change the base?
Conversation
…alidate* functions
if isinstance(data_type, ScalarType): | ||
return fill_value_map.get(data_type) | ||
if isinstance(data_type, StructuredType): | ||
return "AAAAAAAAAAAAAAAA" # BUG: this does not work!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/TGSAI/mdio-cpp/blob/main/mdio/dataset_factory.h#L306C5-L317C6
Here's the implementation in C++.
I think this may handle what we need: https://github.com/zarr-developers/zarr-python/blob/ea4d7e96c0738526bbf08bb99ed0ea23a6081836/src/zarr/core/dtype/npy/common.py#L241-L258
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the encoding by using zarr function. The fill value of structured types are all base64 encoded and the zarr function @BrianMichell pointed at handles this iirc.
|
||
# file_name = "XYZ" | ||
file_name = f"{xr_ds.attrs['name']}" | ||
to_zarr(xr_ds, f"test-data/{file_name}.zarr", mode="w") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be using the PyTest tmp_path
fixture. That way we don't need to add test-data
as a gitignore or have to manually manage test artifacts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this but an even simpler way would be to use an in memory store in zarr like memory://path_to_zarr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! i put in lots of nitpick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use the existing tmp
directory and avoid adding more patterns to .gitignore
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add single line docstrings to all functions that are missing it. please do expanded docstrings if logic is complicated.
if isinstance(data_type, ScalarType): | ||
return fill_value_map.get(data_type) | ||
if isinstance(data_type, StructuredType): | ||
return "AAAAAAAAAAAAAAAA" # BUG: this does not work!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the encoding by using zarr function. The fill value of structured types are all base64 encoded and the zarr function @BrianMichell pointed at handles this iirc.
# Let's store the data array for the second pass | ||
data_arrays[v.name] = data_array | ||
|
||
# Second pass: Add non-dimension coordinates to the data arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tasansal This was actually a hint for a discussion :-)
Just by looking at an example at https://zarr.readthedocs.io/en/stable/user-guide/arrays.html#compressors
I was hoping that the following would work out of the box, but it does not.
"compressor": _to_dictionary(v.compressor)
I see the following error:
>_compressor = parse_compressor(compressor[0])
> return numcodecs.get_codec(data)
E - numcodecs.errors.UnknownCodecError: codec not available: 'None'"
Thus, we have to have a conversion function. Shouldn't this be addressed at the schema layer?
It would be nice to have the schema that can be serialized to JSON dictionary that can be used directly to parametrize the encoder.
Outstanding work