Skip to content

Lessons to learn from STAC's extensibility #316

@TomAugspurger

Description

@TomAugspurger

As mentioned in #309, I ran across some challenges with how the Zarr v3 spec does extensions. I think that we might be able to learn some lessons from how STAC handles extensions.


tl/dr: I think Zarr would benefit from a better extension story that removed the need to have any involvement from anyone other than the extension author and any tooling wishing to use that extension. JSON schema + a zarr_extensions field on Group and Array would get us most of the way there. The current requirements of must_understand: false and name: URL in the extension objects feels like a weaker version of this.


How STAC does extensibility

STAC is a JSON-based format for cataloging geospatial assets. https://github.com/radiantearth/stac-spec/blob/master/extensions/README.md#overview lays out how STAC allows itself to be extended, but there are a few key components

  1. STAC uses jsonschema to define schemes for both the core metadata and extensions.
  2. All STAC objects (Collection, Item, etc.) include a stac_version field.
  3. All STAC objects (Collection, Item) include a stac_extensions array with a list of URLs to JSON Schema definitions that can be used for validation.

Together, these are sufficient to allow extensions to extend basically any part of STAC without any involvement from the core of STAC. Tooling built around STAC coordinates through stac_extensions For example, a validator can load the JSON schema definitions for the core metadata (using the stac_version field) and all extensions (using the URLs in stac_extensions) and validate a document against those schemas. Libraries wishing to use some feature can check for the presence of a specific stac_extension URL.

You also get the ability to version things separately. The core metadata can be at 1.0.0, while the proj extension is a 2.0.0 without issue.

How that might apply to Zarr

Two immediate reactions to the thought of applying that to Zarr:

  1. Zarr does have JSON documents for describing the metadata of nodes in a Zarr hierarchy. We could pretty easily take the same concepts and apply them more or less directly to the Group and Array definitions (and possibly other fields within; STAC does this as well for, e.g. Assets which live inside an Item).
  2. STAC is entirely JSON-based, while much of Zarr concerns how binary blobs are stored, transformed, etc. While portions of these extension points might be configured (and validated by JSON schema) in the metadata document, much of it will lie outside.

How does this relate to what zarr has today?

I'm not sure. I was confused about some things reading https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#extension-points. The spec seems overly prescriptive about putting keys in the top level of the metadata:

The array metadata object must not contain any other names. Those are reserved for future versions of this specification. An implementation must fail to open Zarr hierarchies, groups or arrays with unknown metadata fields, with the exception of objects with a "must_understand": false key-value pair.

STAC / JSON schema takes the opposite approach to their metadata documents. Any extra fields are allowed and ignored by default, but schemas (core or extension) can define required fields.

Specifications for new extensions are recommended to be published in the zarr-developers/zarr-specs repository via the ZEP process. If a specification is published decentralized (e.g. for initial experimentation or due to a very specialized scope), it must use a URL in the name key of its metadata, which identifies the publishing organization or individual, and should point to the specification of the extension.

Having a central place to advertise extensions is great. But to me having to write a ZEP feels like a pretty high bar. STAC extensions are quick and easy to create, and that's led to a lot of experimentation and eventual stabilization in STAC core. And some institutions will have private STAC extensions that they never intend to publish. IMO the extension story should lead with that and offer a zarr-extensions repository / organization for commonly used extensions / shared maintenance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions