-
Notifications
You must be signed in to change notification settings - Fork 2
Description
First, thank you for making MATLAB/Zarr integration a priority—this work will be highly valueable as more and more data moves to the cloud.
I’m part of the development team behind MatNWB (https://github.com/NeurodataWithoutBorders/matnwb), a MATLAB package for reading and writing files of the Neurodata Without Borders (NWB) format. We’re interested in implementing support for Zarr as an alternative backend to HDF5 for NWB-files, and the MATLAB-support-for-Zarr-files looks like a very promising starting point.
While testing NWB-Zarr files exported with PyNWB, we ran into read failures whenever a dataset has dtype |O
(Python object). Below is a minimal reproduction:
% MATLAB R2024b + commit 3a7b0a3 of this repo
data = zarrread(zarrFilePath);
Observed error
Error using zarrread (line 15)
Python Error: ValueError: FAILED_PRECONDITION: Error opening "zarr" driver: Error reading local file
"~/zarr_matlab/test_data/test_zarr_sub_anm00239123_ses_20170627T093549_ecephys_and_ogen.nwb.zarr/file_create_date/.zarray":
Error parsing object member "dtype": Unsupported zarr dtype: "|O" [source
locations='tensorstore/driver/zarr/dtype.cc:225\ntensorstore/driver/zarr/dtype.cc:324\ntensorstore/driver/zarr/dtype.cc:356\ntensorstore/internal/json_binding/json_binding.h:865\ntensorstore/internal/json_binding/json_binding.h:830\ntensorstore/internal/json_binding/json_binding.h:388\ntensorstore/driver/zarr/driver.cc:108\ntensorstore/driver/kvs_backed_chunk_driver.cc:1162\ntensorstore/internal/cache/kvs_backed_cache.h:208\ntensorstore/driver/driver.cc:112']
[tensorstore_spec='{\"context\":{\"cache_pool\":{},\"data_copy_concurrency\":{},\"file_io_concurrency\":{},\"file_io_locking\":{},\"file_io_memmap\":false,\"file_io_sync\":true},\"driver\":\"zarr\",\"kvstore\":{\"driver\":\"file\",\"path\":\"/Users/Eivind/Code/MATLAB/Sandbox/CN/zarr_matlab/test_data/test_zarr_sub_anm00239123_ses_20170627T093549_ecephys_and_ogen.nwb.zarr/file_create_date/\"}}']
The dataset in question is attached below.
Expected behavior
For NWB, object dtypes typically contain variable-length UTF-8 strings or JSON-encoded metadata blobs. Ideally, they’d be returned as MATLAB cell arrays of char/string.
Investigation so far
- PyNWB relies on zarr-python (v2.18) which stores object arrays as VLEN metadata + bytes — tensorstore appears to not support reading this data type (tensorstore cannot open vlen UTF8 string written with Zarr-Python google/tensorstore#103 (comment)).
Questions
- Are you already tracking support for object dtypes in tensorstore or your MATLAB layer?
- Would you be interested in working to support this and/or accept PRs with read/write support for object types.
Preliminary workaround
zInfo = zarrinfo(zarrFilePath);
if strcmp(zInfo.dtype, '|O')
data = read_zarr_object(zarrFilePath);
else
data = zarrread(zarrFilePath);
end
read_zarr_object.m
function result = read_zarr_object(zarrPath)
z = py.zarr.open_array(zarrPath, pyargs('mode', 'r'));
% Create a slice object: slice(None) means ':'
pySlice = py.slice(py.None);
% Read the array with explicit slicing
sliceFcn = py.getattr(z, '__getitem__');
rawData = sliceFcn(pySlice);
matCell = cell(rawData.tolist());
pyElem = matCell{1}; % There's only one element
if isa(pyElem, 'py.bytes')
result = char(pyElem.decode('utf-8'));
elseif isa(pyElem, 'py.str')
result = char(pyElem);
elseif isa(pyElem, 'py.hdmf_zarr.utils.ZarrReference')
% Decode as json
result = char(pyElem);
result = strrep(result, '''', '"');
result = jsondecode(result);
else
error('Unhandled type: %s', class(pyElem));
end
end
Reproduction materials
- Test dataset: file_create_date.zip
- zarr metadata snippet:
{
"dtype": "|O",
"fill_value": 0,
"filters": [
{
"id": "vlen-bytes"
}
]
}