Skip to content

v3 codec structure in zarr.json #298

@d-v-b

Description

@d-v-b

the v3 spec states that codecs are stored in a JSON array under the key codecs. But the spec also states that the list of codecs is structured:

...the list of codecs must be of the following form:

zero or more array -> array codecs; followed by
exactly one array -> bytes codec; followed by
zero or more bytes -> bytes codecs.

This is actually a lot of semantic load for something simple like a JSON array. Instead of using a JSON array, I believe that the above structure could be expressed much better (where "expressed better" means "conveys intent more clearly, with no loss of information, and minimal added complexity") by using a JSON object with the following structure:

{
"codecs": {
    "array_array": [], # array of array -> array codecs, possibly empty
    "array_bytes": {"name": "bytes"}, # single of array -> bytes codec, required
    "bytes_bytes": [], # array of bytes -> bytes codecs, possibly empty
}

I am noting this because over in the zarr-python v3 implementation effort, we have written something like the above data structure as part of the basic parsing of the contents of zarr.json. In fact I think this data structure will arise in any implementation, because implementations must represent the structure of the codecs, and that structure is not captured at all by the JSON array representation. But, as I show here, it is trivial to describe the codec structure explicitly with JSON. A corollary benefit is that the above proposed data structure expresses much better the constraint that there be just 1 array -> bytes codec, which would reduce some validation burden from implementations.

So, if we care about making this easier for implementations (and I think making it easy for implementations also makes it easier for users), we should considering this change to zarr.json. There is no change to the semantics of the spec, but it makes zarr.json more clear. I understand that people may not want to change the spec. But I consider that a separate question from whether the current spec has defects that could in principle be fixed, such as the one described here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions