Skip to content

add support for top-level custom zarr extensions using json schema references #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

geospatial-jeff
Copy link
Owner

@geospatial-jeff geospatial-jeff commented Jul 8, 2025

The goal of this PR is to provide an example of how python-zarr could be extended to support custom extensions. This PR is not complete, the intent is to inform discussion around ZEP10. There are several high level goals:

  1. Don't make any breaking changes to the existing Zarr spec, which means extensions should be stored at the top-level of each node.
  2. Follow STAC's implementation of extensions, which includes a top level key linking to an array of remote JSON Schema references which validate the extensions present in the node.
  3. Provide a consistent mechanism that works similarly across "custom" extensions and "official" extensions (ex. chunk_grid, data_type).
  4. Align with the zarr-python approach of modeling each node as a class.

This PR proposes the addition of the extension_schemas key to the Zarr v3 spec. This is a physical key that may be present on any node type and contains an array of JSON Schemas indicating what extensions are present in the node. Similar to stac_extensions in the STAC spec.

It also proposes the addition of a logical extensions key to zarr-python which contains all extensions implemented by the node, allowing zarr-python to provide consistent access patterns to both custom and official Zarr extensions.

from zarr.core.group import GroupMetadata

group_metadata = {
    "attributes": {},
    "zarr_format": 3,
    "consolidated_metadata": None,
    "node_type": "group",
    # A list of JSON schemas, one for each extension implemented by the node
    "extension_schemas": [
        "https://raw.githubusercontent.com/geospatial-jeff/cog2zarr/refs/heads/pydantic-zarr/jsonschemas/gdal.json"
    ],
    # Top-level extensions
    "geo": {
        "name": "gdal",
        "configuration": {
        "band_names": [
            "red",
            "green",
            "blue",
            "nir"
        ],
        "group_configuration": "chunky",
        "transform": [
            499980.0,
            10.0,
            0.0,
            5400000.0,
            0.0,
            -10.0
        ],
        "epsg": "EPSG:32633",
        "wkt": "PROJCS[\"WGS 84 / UTM zone 33N\",GEOGCS[\"WGS 84\",DATUM[\"WGS_1984\",SPHEROID[\"WGS 84\",6378137,298.257223563,AUTHORITY[\"EPSG\",\"7030\"]],AUTHORITY[\"EPSG\",\"6326\"]],PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AUTHORITY[\"EPSG\",\"4326\"]],PROJECTION[\"Transverse_Mercator\"],PARAMETER[\"latitude_of_origin\",0],PARAMETER[\"central_meridian\",15],PARAMETER[\"scale_factor\",0.9996],PARAMETER[\"false_easting\",500000],PARAMETER[\"false_northing\",0],UNIT[\"metre\",1,AUTHORITY[\"EPSG\",\"9001\"]],AXIS[\"Easting\",EAST],AXIS[\"Northing\",NORTH],AUTHORITY[\"EPSG\",\"32633\"]]",
        "projjson": "{\"$schema\":\"https://proj.org/schemas/v0.7/projjson.schema.json\",\"type\":\"ProjectedCRS\",\"name\":\"WGS 84 / UTM zone 33N\",\"base_crs\":{\"name\":\"WGS 84\",\"datum\":{\"type\":\"GeodeticReferenceFrame\",\"name\":\"World Geodetic System 1984\",\"ellipsoid\":{\"name\":\"WGS 84\",\"semi_major_axis\":6378137,\"inverse_flattening\":298.257223563}},\"coordinate_system\":{\"subtype\":\"ellipsoidal\",\"axis\":[{\"name\":\"Geodetic latitude\",\"abbreviation\":\"Lat\",\"direction\":\"north\",\"unit\":\"degree\"},{\"name\":\"Geodetic longitude\",\"abbreviation\":\"Lon\",\"direction\":\"east\",\"unit\":\"degree\"}]},\"id\":{\"authority\":\"EPSG\",\"code\":4326}},\"conversion\":{\"name\":\"UTM zone 33N\",\"method\":{\"name\":\"Transverse Mercator\",\"id\":{\"authority\":\"EPSG\",\"code\":9807}},\"parameters\":[{\"name\":\"Latitude of natural origin\",\"value\":0,\"unit\":\"degree\",\"id\":{\"authority\":\"EPSG\",\"code\":8801}},{\"name\":\"Longitude of natural origin\",\"value\":15,\"unit\":\"degree\",\"id\":{\"authority\":\"EPSG\",\"code\":8802}},{\"name\":\"Scale factor at natural origin\",\"value\":0.9996,\"unit\":\"unity\",\"id\":{\"authority\":\"EPSG\",\"code\":8805}},{\"name\":\"False easting\",\"value\":500000,\"unit\":\"metre\",\"id\":{\"authority\":\"EPSG\",\"code\":8806}},{\"name\":\"False northing\",\"value\":0,\"unit\":\"metre\",\"id\":{\"authority\":\"EPSG\",\"code\":8807}}]},\"coordinate_system\":{\"subtype\":\"Cartesian\",\"axis\":[{\"name\":\"Easting\",\"abbreviation\":\"\",\"direction\":\"east\",\"unit\":\"metre\"},{\"name\":\"Northing\",\"abbreviation\":\"\",\"direction\":\"north\",\"unit\":\"metre\"}]},\"id\":{\"authority\":\"EPSG\",\"code\":32633}}"
        }
    }
}
md = GroupMetadata.from_dict(group_metadata)
assert "geo" in md.extensions
d = md.to_dict()
assert "geo" in d
assert d == group_metadata

I have no yet updated ArrayMetadata to support this, however I think it's fairly clear how that would work. The only difference is the array node type currently implements several official Zarr extensions which would be rolled into the logical extensions member for consistent access alongside any other custom extensions.

Importantly, this PR doesn't update zarr-python to do anything with these custom extensions besides parsing / validating them. Whether or not a reader such as zarr-python does anything with an extension is based on extension maturity, target audience etc.

@d-v-b
Copy link

d-v-b commented Jul 8, 2025

ignoring the question of whether this is permitted by the spec, as that seems like the source of some disagreement, what is the advantage of putting the extensions at the top-level metadata document instead of in a dedicated extensions key , or in the attributes key?

@geospatial-jeff
Copy link
Owner Author

Only because I didn't want to make a breaking change and figured that the Zarr spec was already using top-level keys for extensions. If that's not the case because "extensions" like chunk_type are more "core-like" than "extension-like" then I agree it doesn't really matter where in the node they are placed.

The placement of extensions in the node is much less important than linking to explicit JSON Schemas that reference what extensions are included in the node and give readers a way to validate the node against those extension types.

@d-v-b
Copy link

d-v-b commented Jul 8, 2025

Zarr spec was already using top-level keys for extensions.

This is currently not happening. that might have been a source of confusion.

@geospatial-jeff
Copy link
Owner Author

geospatial-jeff commented Jul 8, 2025

I'm not sure how else to interpret the spec. As someone building an extension, I searched through the spec for recommendations on how to build my own extension. This led me to the Extensions section, notably the Guidance for extension authors section.

image

I then noticed there are already several Zarr extensions I could base my own extension from

image

And there is even an example of these extensions (stored at the top level)

image

I feel like this is a reasonable interpretation of the spec. If this isn't the intent then the spec should be rewritten to be more clear. I feel strongly the "Extensions" section should be removed from the spec; as these are part of "core". Including them under a section called "Extensions" is confusing.

@d-v-b
Copy link

d-v-b commented Jul 8, 2025

that text was added recently, and I agree that it's not very clear. sorry about that.

As someone building an extension, I searched through the spec for recommendations on how to build my own extension.

assuming your extension would be associated with an array, does your extension change anything about the process of reading / writing chunks for that array, or does the extension just provide information about the stuff inside the array? In the latter case, you can use the attributes field for your extension. This is what the attributes field is designed for, and people have been using it successfully for years. The former case (an extension that changes how chunks are read / written) is more complicated, and that's the focus of a lot of debate over in zarr-developers/zeps#67

@geospatial-jeff
Copy link
Owner Author

It changes how a library like rioxarray would approach indexing over chunks. For example rioxarray typically relies on a x and y variable for bounding box lookups, with the geo extension it can simply parse the affine transform included in the geo extension. While this does modify reading, I wouldn't expect zarr-python to ever support the geo extension because it is very domain specific and there are already libraries like rioxarray to handle this domain.

I currently have it stored in attributes on the main branch (see here), moving it to the top-level was for example's sake!

At the end of the day I only care about following Zarr's extension mechanisms as closely as possible so whatever I build is interoperable with the rest of the ecosystem.

@d-v-b
Copy link

d-v-b commented Jul 8, 2025

At the end of the day I only care about following Zarr's extension mechanisms as closely as possible so whatever I build is interoperable with the rest of the ecosystem.

this makes total sense, and I'm sorry that the spec right now isn't clear about how things work. If you do have the time I would find it useful if you could open an issue in the zarr specs repo to express your issues with the way the spec is currently written. I think the spec should be as clear as possible about things like this, so if we are failing there, that's something to fix.

@joshmoore
Copy link

joshmoore commented Jul 9, 2025

Big thanks for this work, @geospatial-jeff. 👍 I'll tend towards commenting on the ZEP, partially because I won't get or at least notice notifications from this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants