VirtualiZarr for incrementally-populated arrays

This issue discusses what VirtualiZarr would need to enable incrementally-populated zarr arrays.

# Background: Incrementally populated arrays
In the original issue (https://github.com/zarr-developers/zarr-specs/issues/300), I described the pattern of populating Zarr arrays incrementally and on-demand, chunk-by-chunk.

I then described data structures we use to track and reason about initialized chunks with a potential future open-source solution.

A typical question the user might ask is "in this area, defined by physical coordinates or a geometry, what chunks were initialized?".

![Image](https://github.com/user-attachments/assets/f4165f07-001f-4d65-bb80-e408c8162cdc)

There's a significant overlap between the responsibilities of MetaArrays and VirtualiZarr.


# MetaArray example implementation
In [DahnJ/metaarrays](https://github.com/DahnJ/metaarrays/tree/main), I have provided an example implementation of constructing metaarrays from Icechunk's [chunk_coordinates](https://icechunk.io/en/latest/icechunk-python/reference/#icechunk.Session.chunk_coordinates). 

See [walkthrough.ipynb](https://github.com/DahnJ/metaarrays/blob/main/walkthrough.ipynb) for a walkthrough of the API.

# Integration with VirtualiZarr
The following is a discussion of features of VirtualiZarr that would be needed to support metaarray-like functionality. 

### Read Icechunk manifests
VirtualiZarr needs to be able to read existing icechunk manifests. The issue https://github.com/earth-mover/icechunk/issues/104 tracks this.

There's also a question around performance of reading all required information for `ManifestArray`, especially as compared to simply reading the initialized indexes through `chunk_coordinates`.

### Subsetting in the label space
There is discussion (https://github.com/zarr-developers/VirtualiZarr/issues/51) and an implementation in https://github.com/zarr-developers/VirtualiZarr/pull/499
of being able to slice along chunk boundaries.

I still haven't looked into this in detail. How VirtualiZarr and MetaArrays represent the indexes is arguably the greatest difference
- VirtualiZarr indexes individual pixels in label space
- MetaArrays index chunks in label space

It's a question of whether the limitation of slicing only along chunk boundaries can be lifted. The user should be able to simply ask for any chunks intersecting the query's area and ideally also be able to provide a polygon as a query.

### Optimization: Reading a subset
In https://github.com/earth-mover/icechunk/issues/401 it is argued that 100 million chunks should cover everyone's needs.

However, the number of chunks in our datasets frequently goes into billions (largest is 600B), with at most 100s of millions of initialized chunks.

This might seem excessive, but incrementally-populated arrays bring a natural push towards small chunks. This is because chunk is the smallest writable unit – otherwise we would end up with partially-initialized chunks, which would make it difficult to reason about which data has been populated.

Furthermore, for certain usecases, such as visualisation, it is preferable to have small chunks. 

Monthly time-series of high-resolution multi-band data can thus easily start getting into billion of chunks.

### Optimization: Requests a subset of the manifest
Reading all manifests when we only need a subset is wasteful. This would have to be a feature of Icechunk and the manifest data format.

This is perhaps not necessary in the case of `chunk_coordinates`, which can fetch 10s of millions of initialized chunks in seconds. It might however be necessary if we wanted to load datasets with billions+ of (mostly unitialized) chunks into the `ManifestArray`.


# Discussion
I would love to see MetaArrays dissolve into existing FOSS tooling. 

However, it's not clear that VirtualiZarr and MetaArrays overlap sufficiently. Right now, it seems to me that MetaArrays would have to be a whole new duck array implementation alongside `ManifestArray`, including it's own reading method using `chunk_coordinates`. Perhaps that's stretching the responsibility of VirtualiZarr too far. I will still look into VirtualiZarr more to get a better idea about that.

A big question for me is to what extent any of this is useful to others. Perhaps MetaArrays are too tailored for our specific usecase and only subsets of the functionality should be open-sourced. 

The `Mapper` class (see  [walkthrough.ipynb](https://github.com/DahnJ/metaarrays/blob/main/walkthrough.ipynb)) could be a good example. However, it is not clear to me where it would live. Since it deals with label space, it belongs at the level of Xarray rather than Zarr itself. Or perhaps a translation between pixel and chunk indices could live in Zarr itself, making the implementation of MetaArrays easier whilst potentially being useful to other usecases.

![Image](https://github.com/user-attachments/assets/3d82ad7d-b4ac-4a04-878e-568109d1f7cf)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

VirtualiZarr for incrementally-populated arrays #600

Background: Incrementally populated arrays

MetaArray example implementation

Integration with VirtualiZarr

Read Icechunk manifests

Subsetting in the label space

Optimization: Reading a subset

Optimization: Requests a subset of the manifest

Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

VirtualiZarr for incrementally-populated arrays #600

Description

Background: Incrementally populated arrays

MetaArray example implementation

Integration with VirtualiZarr

Read Icechunk manifests

Subsetting in the label space

Optimization: Reading a subset

Optimization: Requests a subset of the manifest

Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions