Skip to content

VirtualiZarr for incrementally-populated arrays #600

@DahnJ

Description

@DahnJ

This issue discusses what VirtualiZarr would need to enable incrementally-populated zarr arrays.

Background: Incrementally populated arrays

In the original issue (zarr-developers/zarr-specs#300), I described the pattern of populating Zarr arrays incrementally and on-demand, chunk-by-chunk.

I then described data structures we use to track and reason about initialized chunks with a potential future open-source solution.

A typical question the user might ask is "in this area, defined by physical coordinates or a geometry, what chunks were initialized?".

Image

There's a significant overlap between the responsibilities of MetaArrays and VirtualiZarr.

MetaArray example implementation

In DahnJ/metaarrays, I have provided an example implementation of constructing metaarrays from Icechunk's chunk_coordinates.

See walkthrough.ipynb for a walkthrough of the API.

Integration with VirtualiZarr

The following is a discussion of features of VirtualiZarr that would be needed to support metaarray-like functionality.

Read Icechunk manifests

VirtualiZarr needs to be able to read existing icechunk manifests. The issue earth-mover/icechunk#104 tracks this.

There's also a question around performance of reading all required information for ManifestArray, especially as compared to simply reading the initialized indexes through chunk_coordinates.

Subsetting in the label space

There is discussion (#51) and an implementation in #499
of being able to slice along chunk boundaries.

I still haven't looked into this in detail. How VirtualiZarr and MetaArrays represent the indexes is arguably the greatest difference

  • VirtualiZarr indexes individual pixels in label space
  • MetaArrays index chunks in label space

It's a question of whether the limitation of slicing only along chunk boundaries can be lifted. The user should be able to simply ask for any chunks intersecting the query's area and ideally also be able to provide a polygon as a query.

Optimization: Reading a subset

In earth-mover/icechunk#401 it is argued that 100 million chunks should cover everyone's needs.

However, the number of chunks in our datasets frequently goes into billions (largest is 600B), with at most 100s of millions of initialized chunks.

This might seem excessive, but incrementally-populated arrays bring a natural push towards small chunks. This is because chunk is the smallest writable unit – otherwise we would end up with partially-initialized chunks, which would make it difficult to reason about which data has been populated.

Furthermore, for certain usecases, such as visualisation, it is preferable to have small chunks.

Monthly time-series of high-resolution multi-band data can thus easily start getting into billion of chunks.

Optimization: Requests a subset of the manifest

Reading all manifests when we only need a subset is wasteful. This would have to be a feature of Icechunk and the manifest data format.

This is perhaps not necessary in the case of chunk_coordinates, which can fetch 10s of millions of initialized chunks in seconds. It might however be necessary if we wanted to load datasets with billions+ of (mostly unitialized) chunks into the ManifestArray.

Discussion

I would love to see MetaArrays dissolve into existing FOSS tooling.

However, it's not clear that VirtualiZarr and MetaArrays overlap sufficiently. Right now, it seems to me that MetaArrays would have to be a whole new duck array implementation alongside ManifestArray, including it's own reading method using chunk_coordinates. Perhaps that's stretching the responsibility of VirtualiZarr too far. I will still look into VirtualiZarr more to get a better idea about that.

A big question for me is to what extent any of this is useful to others. Perhaps MetaArrays are too tailored for our specific usecase and only subsets of the functionality should be open-sourced.

The Mapper class (see walkthrough.ipynb) could be a good example. However, it is not clear to me where it would live. Since it deals with label space, it belongs at the level of Xarray rather than Zarr itself. Or perhaps a translation between pixel and chunk indices could live in Zarr itself, making the implementation of MetaArrays easier whilst potentially being useful to other usecases.

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions