Skip to content

Support for Zarr object dtype ("|O") datasets #112

@ehennestad

Description

@ehennestad

First, thank you for making MATLAB/Zarr integration a priority—this work will be highly valueable as more and more data moves to the cloud.

I’m part of the development team behind MatNWB (https://github.com/NeurodataWithoutBorders/matnwb), a MATLAB package for reading and writing files of the Neurodata Without Borders (NWB) format. We’re interested in implementing support for Zarr as an alternative backend to HDF5 for NWB-files, and the MATLAB-support-for-Zarr-files looks like a very promising starting point.

While testing NWB-Zarr files exported with PyNWB, we ran into read failures whenever a dataset has dtype |O (Python object). Below is a minimal reproduction:

% MATLAB R2024b + commit 3a7b0a3 of this repo
data = zarrread(zarrFilePath);

Observed error

Error using zarrread (line 15)
Python Error: ValueError: FAILED_PRECONDITION: Error opening "zarr" driver: Error reading local file
"~/zarr_matlab/test_data/test_zarr_sub_anm00239123_ses_20170627T093549_ecephys_and_ogen.nwb.zarr/file_create_date/.zarray":
Error parsing object member "dtype": Unsupported zarr dtype: "|O" [source
locations='tensorstore/driver/zarr/dtype.cc:225\ntensorstore/driver/zarr/dtype.cc:324\ntensorstore/driver/zarr/dtype.cc:356\ntensorstore/internal/json_binding/json_binding.h:865\ntensorstore/internal/json_binding/json_binding.h:830\ntensorstore/internal/json_binding/json_binding.h:388\ntensorstore/driver/zarr/driver.cc:108\ntensorstore/driver/kvs_backed_chunk_driver.cc:1162\ntensorstore/internal/cache/kvs_backed_cache.h:208\ntensorstore/driver/driver.cc:112']
[tensorstore_spec='{\"context\":{\"cache_pool\":{},\"data_copy_concurrency\":{},\"file_io_concurrency\":{},\"file_io_locking\":{},\"file_io_memmap\":false,\"file_io_sync\":true},\"driver\":\"zarr\",\"kvstore\":{\"driver\":\"file\",\"path\":\"/Users/Eivind/Code/MATLAB/Sandbox/CN/zarr_matlab/test_data/test_zarr_sub_anm00239123_ses_20170627T093549_ecephys_and_ogen.nwb.zarr/file_create_date/\"}}']

The dataset in question is attached below.

Expected behavior

For NWB, object dtypes typically contain variable-length UTF-8 strings or JSON-encoded metadata blobs. Ideally, they’d be returned as MATLAB cell arrays of char/string.

Investigation so far

Questions

  1. Are you already tracking support for object dtypes in tensorstore or your MATLAB layer?
  2. Would you be interested in working to support this and/or accept PRs with read/write support for object types.

Preliminary workaround

    zInfo = zarrinfo(zarrFilePath);
    if strcmp(zInfo.dtype, '|O')
        data = read_zarr_object(zarrFilePath);
    else
        data = zarrread(zarrFilePath);
    end

read_zarr_object.m

function result = read_zarr_object(zarrPath)
    
    z = py.zarr.open_array(zarrPath, pyargs('mode', 'r'));

    % Create a slice object: slice(None) means ':'
    pySlice = py.slice(py.None);
    
    % Read the array with explicit slicing
    sliceFcn = py.getattr(z, '__getitem__');
    rawData = sliceFcn(pySlice);

    matCell = cell(rawData.tolist());
    pyElem = matCell{1};  % There's only one element

    if isa(pyElem, 'py.bytes')
        result = char(pyElem.decode('utf-8'));
    elseif isa(pyElem, 'py.str')
        result = char(pyElem);
    elseif isa(pyElem, 'py.hdmf_zarr.utils.ZarrReference')
        % Decode as json
        result = char(pyElem);
        result = strrep(result, '''', '"');
        result = jsondecode(result);
    else
        error('Unhandled type: %s', class(pyElem));
    end    
end

Reproduction materials

    {
        "dtype": "|O",
        "fill_value": 0,
        "filters": [
            {
                "id": "vlen-bytes"
            }
        ]
    }

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions