From f1186ed4e785b4e6c14e9488263ab3e019a13cdb Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Fri, 28 Mar 2025 16:07:48 +0100 Subject: [PATCH 01/33] (docs): add a `zarr` v3 guide --- docs/zarr-v3.md | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) create mode 100644 docs/zarr-v3.md diff --git a/docs/zarr-v3.md b/docs/zarr-v3.md new file mode 100644 index 000000000..bb96ea051 --- /dev/null +++ b/docs/zarr-v3.md @@ -0,0 +1,25 @@ +# zarr-v3 Guide/Roadmap + +`anndata` now uses the much improved {mod}`zarr` v3 package and also [allows writing of datasets in the v3](https://anndata.readthedocs.io/en/stable/generated/anndata.settings.html#anndata.settings.zarr_write_format) format, with the exception of structured arrays. Users should notice a significant performance improvement, especially for cloud data, but also likely for local data as well. Here is a quick guide on some of our learnings so far: + +## Remote data + +We now provide the {func}`anndata.experimental.read_lazy` feature for reading as much of the {class}`~anndata.AnnData` object as lazily as possible, using {mod}`dask` and {mod}`xarray`. Please note that this feature is experimental and subject to change. To enable this functionality in a performant and feature-complete way for remote data sources, we use [consolidated metadata](https://zarr.readthedocs.io/en/stable/user-guide/consolidated_metadata.html) on the `zarr` store (written by default). Please note that this introduces consistency issues - if you update the structure of the underlying `zarr` store i.e., remove a column from `obs`, the consolidated metadata will no longer be valid. Further, note that without consolidated metadata, we cannot guarantee your stored `AnnData` object will be fully readable. And even if it is fully readable, it will almost certainly be much slower to read. + +There are two ways of opening remote [`zarr` stores from the `zarr-python` package](https://zarr.readthedocs.io/en/stable/api/zarr/storage/index.html), `fsspec` and `obstore`, and both can be used with `read_lazy`. [`obstore` claims to be more performant out-of-the-box](https://developmentseed.org/obstore/latest/performance), but notes that this claim has not been benchmarked with the `uvloop` event loop, which itself claims to be 2X more performant than the default event loop for `python`. + +## Local data + +Local data generally poses a different set of challenges. First, write speeds can be somewhat slow and second, the creation of many small files on a file system can slow down a filesystem. For the "many small files" problem, `zarr` has introduced [sharding in the v3 file format](https://zarr.readthedocs.io/en/stable/user-guide/performance.html#sharding). + +However, `zarr-python` can be slow with sharding throughput as well as writing throughput. If you wish to [speed up](https://github.com/LDeakin/zarr_benchmarks) either operation (or receive a moderate boost for reading in general), a [bridge to the `zarr` implementation in Rust](https://zarrs-python.readthedocs.io/en/latest/) can help with that. + +## GPU i/o + +At the moment, it is unlikely your `anndata` i/o will work if you use [`zarr.enable_gpu`](https://zarr.readthedocs.io/en/stable/user-guide/gpu.html#reading-data-into-device-memory). It's *possible* dense data i/o i.e., using {func}`anndata.io.read_elem` will work as expected, but this functionality is untested - sparse data, awkward arrays, and dataframes will not. `kvikio` currently provides a [`GDS`-enable store](https://docs.rapids.ai/api/kvikio/nightly/api/#kvikio.zarr.GDSStore) although there are no working compressors at the moment exported from the `zarr-python` package (work is [underway for `Zstd`](https://github.com/zarr-developers/zarr-python/pull/2863)). + +We anticipate enabling officially supporting this functionality officially for dense data, sparse data, and possibly awkward arrays in the next minor release, 0.13. + +## `async` + +At the moment, `anndata` exports no `async` functions. However, `zarr-python` is fully `async` and provides its own event-loop so that users like `anndata` can interact with a synchronous API while still beenfitting from `zarr-python`'s asynchronous functionality under that API. We anticipate providing `async` versions of {func}`anndata.io.read_elem` and {func}`anndata.experimental.read_dispatched` so that users can download data asynchronously without using the `zarr-python` event loop. We also would like to create an asynchronous partial reader to enable iterative streaming of a dataset. From c9bdccbaeb93616c8383de2c138fc7c4938016a4 Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Mon, 31 Mar 2025 10:50:19 +0200 Subject: [PATCH 02/33] (chore): clean up `async` docs --- docs/zarr-v3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/zarr-v3.md b/docs/zarr-v3.md index bb96ea051..59e8a5128 100644 --- a/docs/zarr-v3.md +++ b/docs/zarr-v3.md @@ -22,4 +22,4 @@ We anticipate enabling officially supporting this functionality officially for d ## `async` -At the moment, `anndata` exports no `async` functions. However, `zarr-python` is fully `async` and provides its own event-loop so that users like `anndata` can interact with a synchronous API while still beenfitting from `zarr-python`'s asynchronous functionality under that API. We anticipate providing `async` versions of {func}`anndata.io.read_elem` and {func}`anndata.experimental.read_dispatched` so that users can download data asynchronously without using the `zarr-python` event loop. We also would like to create an asynchronous partial reader to enable iterative streaming of a dataset. +At the moment, `anndata` exports no `async` functions. However, `zarr-python` has a fully `async` API and provides its own event-loop so that users like `anndata` can interact with a synchronous API while still beenfitting from `zarr-python`'s asynchronous functionality under that API. We anticipate providing `async` versions of {func}`anndata.io.read_elem` and {func}`anndata.experimental.read_dispatched` so that users can download data asynchronously without using the `zarr-python` event loop. We also would like to create an asynchronous partial reader to enable iterative streaming of a dataset. From 46c08240a8139993181472eb049a075bf92bf146 Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Mon, 31 Mar 2025 11:29:31 +0200 Subject: [PATCH 03/33] (chore): add sharding code snippet --- docs/zarr-v3.md | 31 ++++++++++++++++++++++++++++--- 1 file changed, 28 insertions(+), 3 deletions(-) diff --git a/docs/zarr-v3.md b/docs/zarr-v3.md index 59e8a5128..79aaf188a 100644 --- a/docs/zarr-v3.md +++ b/docs/zarr-v3.md @@ -10,9 +10,34 @@ There are two ways of opening remote [`zarr` stores from the `zarr-python` packa ## Local data -Local data generally poses a different set of challenges. First, write speeds can be somewhat slow and second, the creation of many small files on a file system can slow down a filesystem. For the "many small files" problem, `zarr` has introduced [sharding in the v3 file format](https://zarr.readthedocs.io/en/stable/user-guide/performance.html#sharding). - -However, `zarr-python` can be slow with sharding throughput as well as writing throughput. If you wish to [speed up](https://github.com/LDeakin/zarr_benchmarks) either operation (or receive a moderate boost for reading in general), a [bridge to the `zarr` implementation in Rust](https://zarrs-python.readthedocs.io/en/latest/) can help with that. +Local data generally poses a different set of challenges. First, write speeds can be somewhat slow and second, the creation of many small files on a file system can slow down a filesystem. For the "many small files" problem, `zarr` has introduced [sharding in the v3 file format](https://zarr.readthedocs.io/en/stable/user-guide/performance.html#sharding). Sharding requires knowledge of the array element you are writing, though, and therefore you will need to use {func}`anndata.experimental.write_dispatched` to use sharding: + +```python +import anndata as ad +from collections.abc import Mapping +from typing import Any +import zarr + +ad.settings.zarr_write_format = 3 # Absolutely crucial! Sharding is only for the v3 file format! + +def write_sharded(group: zarr.Group, adata: ad.AnnData): + def callback(func: ad.experimental.Write, g: zarr.Group, k: str, elem: ad.typing.RWAble, dataset_kwargs: Mapping[str, Any], iospec: ad.experimental.IOSpec): + if iospec.encoding_type in { "array" }: + dataset_kwargs = { "shards": tuple(int(2 ** (16 / len(elem.shape))) for _ in elem.shape), **dataset_kwargs} + dataset_kwargs["chunks"] = tuple(i // 2 for i in dataset_kwargs["shards"]) + elif iospec.encoding_type in { "csr_matrix", "csc_matrix" }: + dataset_kwargs = { "shards": (2**16,), "chunks": (2**8, ), **dataset_kwargs } + func(g, k, elem, dataset_kwargs=dataset_kwargs) + return ad.experimental.write_dispatched(group, "/", adata, callback=callback) +``` + +However, `zarr-python` can be slow with sharding throughput as well as writing throughput. If you wish to [speed up](https://github.com/LDeakin/zarr_benchmarks) either operation (or receive a moderate boost for reading in general), a [bridge to the `zarr` implementation in Rust](https://zarrs-python.readthedocs.io/en/latest/) can help with that: + +```python +import zarr +import zarrs +zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}) +``` ## GPU i/o From a3367a90cb389dbfe965b2c106758d0e4da540c1 Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Mon, 31 Mar 2025 13:12:51 +0200 Subject: [PATCH 04/33] (fix): use `np.dtypes.StringDtype` --- docs/zarr-v3.md | 4 ++++ src/anndata/_io/specs/methods.py | 4 ++-- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/docs/zarr-v3.md b/docs/zarr-v3.md index 79aaf188a..df9aa779d 100644 --- a/docs/zarr-v3.md +++ b/docs/zarr-v3.md @@ -33,6 +33,10 @@ def write_sharded(group: zarr.Group, adata: ad.AnnData): However, `zarr-python` can be slow with sharding throughput as well as writing throughput. If you wish to [speed up](https://github.com/LDeakin/zarr_benchmarks) either operation (or receive a moderate boost for reading in general), a [bridge to the `zarr` implementation in Rust](https://zarrs-python.readthedocs.io/en/latest/) can help with that: +``` +uv pip install zarrs +``` + ```python import zarr import zarrs diff --git a/src/anndata/_io/specs/methods.py b/src/anndata/_io/specs/methods.py index a8f4b1e44..7b7c330bb 100644 --- a/src/anndata/_io/specs/methods.py +++ b/src/anndata/_io/specs/methods.py @@ -632,7 +632,7 @@ def write_vlen_string_array_zarr( filters, dtype = ( ([VLenUTF8()], object) if ad.settings.zarr_write_format == 2 - else (None, str) + else (None, np.dtypes.StringDType()) ) f.create_array( k, @@ -1287,7 +1287,7 @@ def write_scalar_zarr( case 2, str(): filters, dtype = [VLenUTF8()], object case 3, str(): - filters, dtype = None, str + filters, dtype = None, np.dtypes.StringDType() case _, _: filters, dtype = None, np.array(value).dtype a = f.create_array( From be4066efc4b2160c04873063ca10fb57389037bf Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Mon, 31 Mar 2025 15:38:36 +0200 Subject: [PATCH 05/33] (chore): clarify docs --- docs/zarr-v3.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/zarr-v3.md b/docs/zarr-v3.md index df9aa779d..deacf7b63 100644 --- a/docs/zarr-v3.md +++ b/docs/zarr-v3.md @@ -1,6 +1,6 @@ # zarr-v3 Guide/Roadmap -`anndata` now uses the much improved {mod}`zarr` v3 package and also [allows writing of datasets in the v3](https://anndata.readthedocs.io/en/stable/generated/anndata.settings.html#anndata.settings.zarr_write_format) format, with the exception of structured arrays. Users should notice a significant performance improvement, especially for cloud data, but also likely for local data as well. Here is a quick guide on some of our learnings so far: +`anndata` now uses the much improved {mod}`zarr` v3 package and also [allows writing of datasets in the v3 format](https://anndata.readthedocs.io/en/stable/generated/anndata.settings.html#anndata.settings.zarr_write_format), with the exception of structured arrays. Users should notice a significant performance improvement, especially for cloud data, but also likely for local data as well. Here is a quick guide on some of our learnings so far: ## Remote data @@ -31,7 +31,7 @@ def write_sharded(group: zarr.Group, adata: ad.AnnData): return ad.experimental.write_dispatched(group, "/", adata, callback=callback) ``` -However, `zarr-python` can be slow with sharding throughput as well as writing throughput. If you wish to [speed up](https://github.com/LDeakin/zarr_benchmarks) either operation (or receive a moderate boost for reading in general), a [bridge to the `zarr` implementation in Rust](https://zarrs-python.readthedocs.io/en/latest/) can help with that: +However, `zarr-python` can be slow with sharding throughput as well as writing throughput. Thus if you wish to [speed up](https://github.com/LDeakin/zarr_benchmarks) either writing, sharding, or both (or receive a modest speed-boost for reading), a [bridge to the `zarr` implementation in Rust](https://zarrs-python.readthedocs.io/en/latest/) can help with that: ``` uv pip install zarrs @@ -43,6 +43,8 @@ import zarrs zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}) ``` +However, this pipeline is not compatible with all types of zarr store, especially remote stores. + ## GPU i/o At the moment, it is unlikely your `anndata` i/o will work if you use [`zarr.enable_gpu`](https://zarr.readthedocs.io/en/stable/user-guide/gpu.html#reading-data-into-device-memory). It's *possible* dense data i/o i.e., using {func}`anndata.io.read_elem` will work as expected, but this functionality is untested - sparse data, awkward arrays, and dataframes will not. `kvikio` currently provides a [`GDS`-enable store](https://docs.rapids.ai/api/kvikio/nightly/api/#kvikio.zarr.GDSStore) although there are no working compressors at the moment exported from the `zarr-python` package (work is [underway for `Zstd`](https://github.com/zarr-developers/zarr-python/pull/2863)). From 2d3f3aaa239c94bbaa391005d5ae35a5eccbbf81 Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Mon, 31 Mar 2025 16:02:26 +0200 Subject: [PATCH 06/33] more docs --- docs/zarr-v3.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/docs/zarr-v3.md b/docs/zarr-v3.md index deacf7b63..a4e01f387 100644 --- a/docs/zarr-v3.md +++ b/docs/zarr-v3.md @@ -43,7 +43,13 @@ import zarrs zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}) ``` -However, this pipeline is not compatible with all types of zarr store, especially remote stores. +However, this pipeline is not compatible with all types of zarr store, especially remote stores and there are limitations on where rust can give a performance boost for indexing. We therefore recommend this pipeline for writing full datasets and reading contiguous regions of data. + +## Codecs + +The default `zarr-python` v3 codec for `v3 file-format` is no longer `blosc` but `zstd`. While `zstd` is more widespread, you may find its performance to not meet your old expectations. Therefore, we recommend passing in the [`BloscCodec`](https://zarr.readthedocs.io/en/stable/api/zarr/codecs/index.html#zarr.codecs.BloscCodec) if you wish to return to the old behavior. + +There is currently a bug with `numcodecs` that prevents data written from other non-numcodecs `zstd` implementations from being read in by the default zarr pipeline (to which the above rust pipeline falls back if it cannot handle a datatype or indexing scheme, like `vlen-string`): https://github.com/zarr-developers/numcodecs/issues/424. The same applies to data that may eventually be written by the GPU `zstd` implementation (see below). ## GPU i/o From fa14046dd533aa803ca04ede955265f2c2817521 Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Mon, 31 Mar 2025 16:17:15 +0200 Subject: [PATCH 07/33] (fix): clarify `zstd` issue --- docs/zarr-v3.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/zarr-v3.md b/docs/zarr-v3.md index a4e01f387..530f3e127 100644 --- a/docs/zarr-v3.md +++ b/docs/zarr-v3.md @@ -43,13 +43,15 @@ import zarrs zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}) ``` -However, this pipeline is not compatible with all types of zarr store, especially remote stores and there are limitations on where rust can give a performance boost for indexing. We therefore recommend this pipeline for writing full datasets and reading contiguous regions of data. +However, this pipeline is not compatible with all types of zarr store, especially remote stores and there are limitations on where rust can give a performance boost for indexing. We therefore recommend this pipeline for writing full datasets and reading contiguous regions of said written data. ## Codecs The default `zarr-python` v3 codec for `v3 file-format` is no longer `blosc` but `zstd`. While `zstd` is more widespread, you may find its performance to not meet your old expectations. Therefore, we recommend passing in the [`BloscCodec`](https://zarr.readthedocs.io/en/stable/api/zarr/codecs/index.html#zarr.codecs.BloscCodec) if you wish to return to the old behavior. -There is currently a bug with `numcodecs` that prevents data written from other non-numcodecs `zstd` implementations from being read in by the default zarr pipeline (to which the above rust pipeline falls back if it cannot handle a datatype or indexing scheme, like `vlen-string`): https://github.com/zarr-developers/numcodecs/issues/424. The same applies to data that may eventually be written by the GPU `zstd` implementation (see below). +There is currently a bug with `numcodecs` that prevents data written from other non-numcodecs `zstd` implementations from being read in by the default zarr pipeline (to which the above rust pipeline falls back if it cannot handle a datatype or indexing scheme, like `vlen-string`): https://github.com/zarr-developers/numcodecs/issues/424. Thus is may be advisable to use `BloscCodec` with `zarr` v3 file format data if you wish to use the rust-accelerated pipeline until this issue is resolved. + +The same issue with `zstd` applies to data that may eventually be written by the GPU `zstd` implementation (see below). ## GPU i/o From cb019fca468ab35080d623eacf4fc15baa218efe Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Mon, 31 Mar 2025 16:25:33 +0200 Subject: [PATCH 08/33] (fix): links + index --- docs/index.md | 1 + docs/zarr-v3.md | 24 +++++++++++++++++------- 2 files changed, 18 insertions(+), 7 deletions(-) diff --git a/docs/index.md b/docs/index.md index 06aacaed6..dc55ca3bc 100644 --- a/docs/index.md +++ b/docs/index.md @@ -20,6 +20,7 @@ benchmarks contributing release-notes/index references +zarr-v3 ``` # News diff --git a/docs/zarr-v3.md b/docs/zarr-v3.md index 530f3e127..404b68fdb 100644 --- a/docs/zarr-v3.md +++ b/docs/zarr-v3.md @@ -1,16 +1,16 @@ # zarr-v3 Guide/Roadmap -`anndata` now uses the much improved {mod}`zarr` v3 package and also [allows writing of datasets in the v3 format](https://anndata.readthedocs.io/en/stable/generated/anndata.settings.html#anndata.settings.zarr_write_format), with the exception of structured arrays. Users should notice a significant performance improvement, especially for cloud data, but also likely for local data as well. Here is a quick guide on some of our learnings so far: +`anndata` now uses the much improved {mod}`zarr` v3 package and also allows writing of datasets in the [v3 format], with the exception of structured arrays. Users should notice a significant performance improvement, especially for cloud data, but also likely for local data as well. Here is a quick guide on some of our learnings so far: ## Remote data -We now provide the {func}`anndata.experimental.read_lazy` feature for reading as much of the {class}`~anndata.AnnData` object as lazily as possible, using {mod}`dask` and {mod}`xarray`. Please note that this feature is experimental and subject to change. To enable this functionality in a performant and feature-complete way for remote data sources, we use [consolidated metadata](https://zarr.readthedocs.io/en/stable/user-guide/consolidated_metadata.html) on the `zarr` store (written by default). Please note that this introduces consistency issues - if you update the structure of the underlying `zarr` store i.e., remove a column from `obs`, the consolidated metadata will no longer be valid. Further, note that without consolidated metadata, we cannot guarantee your stored `AnnData` object will be fully readable. And even if it is fully readable, it will almost certainly be much slower to read. +We now provide the {func}`anndata.experimental.read_lazy` feature for reading as much of the {class}`~anndata.AnnData` object as lazily as possible, using `dask` and {mod}`xarray`. Please note that this feature is experimental and subject to change. To enable this functionality in a performant and feature-complete way for remote data sources, we use [consolidated metadata] on the `zarr` store (written by default). Please note that this introduces consistency issues - if you update the structure of the underlying `zarr` store i.e., remove a column from `obs`, the consolidated metadata will no longer be valid. Further, note that without consolidated metadata, we cannot guarantee your stored `AnnData` object will be fully readable. And even if it is fully readable, it will almost certainly be much slower to read. -There are two ways of opening remote [`zarr` stores from the `zarr-python` package](https://zarr.readthedocs.io/en/stable/api/zarr/storage/index.html), `fsspec` and `obstore`, and both can be used with `read_lazy`. [`obstore` claims to be more performant out-of-the-box](https://developmentseed.org/obstore/latest/performance), but notes that this claim has not been benchmarked with the `uvloop` event loop, which itself claims to be 2X more performant than the default event loop for `python`. +There are two ways of opening remote [`zarr` stores] from the `zarr-python` package, `fsspec` and `obstore`, and both can be used with `read_lazy`. [`obstore` claims] to be more performant out-of-the-box, but notes that this claim has not been benchmarked with the `uvloop` event loop, which itself claims to be 2X more performant than the default event loop for `python`. ## Local data -Local data generally poses a different set of challenges. First, write speeds can be somewhat slow and second, the creation of many small files on a file system can slow down a filesystem. For the "many small files" problem, `zarr` has introduced [sharding in the v3 file format](https://zarr.readthedocs.io/en/stable/user-guide/performance.html#sharding). Sharding requires knowledge of the array element you are writing, though, and therefore you will need to use {func}`anndata.experimental.write_dispatched` to use sharding: +Local data generally poses a different set of challenges. First, write speeds can be somewhat slow and second, the creation of many small files on a file system can slow down a filesystem. For the "many small files" problem, `zarr` has introduced [sharding] in the v3 file format. Sharding requires knowledge of the array element you are writing, though, and therefore you will need to use {func}`anndata.experimental.write_dispatched` to use sharding: ```python import anndata as ad @@ -47,7 +47,7 @@ However, this pipeline is not compatible with all types of zarr store, especiall ## Codecs -The default `zarr-python` v3 codec for `v3 file-format` is no longer `blosc` but `zstd`. While `zstd` is more widespread, you may find its performance to not meet your old expectations. Therefore, we recommend passing in the [`BloscCodec`](https://zarr.readthedocs.io/en/stable/api/zarr/codecs/index.html#zarr.codecs.BloscCodec) if you wish to return to the old behavior. +The default `zarr-python` v3 codec for `v3 file-format` is no longer `blosc` but `zstd`. While `zstd` is more widespread, you may find its performance to not meet your old expectations. Therefore, we recommend passing in the [`BloscCodec`] if you wish to return to the old behavior. There is currently a bug with `numcodecs` that prevents data written from other non-numcodecs `zstd` implementations from being read in by the default zarr pipeline (to which the above rust pipeline falls back if it cannot handle a datatype or indexing scheme, like `vlen-string`): https://github.com/zarr-developers/numcodecs/issues/424. Thus is may be advisable to use `BloscCodec` with `zarr` v3 file format data if you wish to use the rust-accelerated pipeline until this issue is resolved. @@ -55,10 +55,20 @@ The same issue with `zstd` applies to data that may eventually be written by the ## GPU i/o -At the moment, it is unlikely your `anndata` i/o will work if you use [`zarr.enable_gpu`](https://zarr.readthedocs.io/en/stable/user-guide/gpu.html#reading-data-into-device-memory). It's *possible* dense data i/o i.e., using {func}`anndata.io.read_elem` will work as expected, but this functionality is untested - sparse data, awkward arrays, and dataframes will not. `kvikio` currently provides a [`GDS`-enable store](https://docs.rapids.ai/api/kvikio/nightly/api/#kvikio.zarr.GDSStore) although there are no working compressors at the moment exported from the `zarr-python` package (work is [underway for `Zstd`](https://github.com/zarr-developers/zarr-python/pull/2863)). +At the moment, it is unlikely your `anndata` i/o will work if you use [`zarr.enable_gpu`]. It's *possible* dense data i/o i.e., using {func}`anndata.io.read_elem` will work as expected, but this functionality is untested - sparse data, awkward arrays, and dataframes will not. `kvikio` currently provides a [`GDS`-enabled store] although there are no working compressors at the moment exported from the `zarr-python` package (work is underway for `Zstd`: https://github.com/zarr-developers/zarr-python/pull/2863). We anticipate enabling officially supporting this functionality officially for dense data, sparse data, and possibly awkward arrays in the next minor release, 0.13. -## `async` +## Asynchronous i/o At the moment, `anndata` exports no `async` functions. However, `zarr-python` has a fully `async` API and provides its own event-loop so that users like `anndata` can interact with a synchronous API while still beenfitting from `zarr-python`'s asynchronous functionality under that API. We anticipate providing `async` versions of {func}`anndata.io.read_elem` and {func}`anndata.experimental.read_dispatched` so that users can download data asynchronously without using the `zarr-python` event loop. We also would like to create an asynchronous partial reader to enable iterative streaming of a dataset. + + +[v3 format]: https://anndata.readthedocs.io/en/stable/generated/anndata.settings.html#anndata.settings.zarr_write_format +[consolidated metadata]: https://zarr.readthedocs.io/en/stable/user-guide/consolidated_metadata.html +[`zarr` stores]: https://zarr.readthedocs.io/en/stable/api/zarr/storage/index.html +[`obstore` claims]: https://developmentseed.org/obstore/latest/performance +[sharding]: https://zarr.readthedocs.io/en/stable/user-guide/performance.html#sharding +[`BloscCodec`]: https://zarr.readthedocs.io/en/stable/api/zarr/codecs/index.html#zarr.codecs.BloscCodec +[`zarr.enable_gpu`]: https://zarr.readthedocs.io/en/stable/user-guide/gpu.html#reading-data-into-device-memory +[`GDS`-enabled store]: https://docs.rapids.ai/api/kvikio/nightly/api/#kvikio.zarr.GDSStore From 87873d8ea7e8bdc347aadcbae71784b7fda2735c Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Mon, 31 Mar 2025 16:35:40 +0200 Subject: [PATCH 09/33] (chore): more small fixes --- docs/zarr-v3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/zarr-v3.md b/docs/zarr-v3.md index 404b68fdb..58d5f7657 100644 --- a/docs/zarr-v3.md +++ b/docs/zarr-v3.md @@ -31,7 +31,7 @@ def write_sharded(group: zarr.Group, adata: ad.AnnData): return ad.experimental.write_dispatched(group, "/", adata, callback=callback) ``` -However, `zarr-python` can be slow with sharding throughput as well as writing throughput. Thus if you wish to [speed up](https://github.com/LDeakin/zarr_benchmarks) either writing, sharding, or both (or receive a modest speed-boost for reading), a [bridge to the `zarr` implementation in Rust](https://zarrs-python.readthedocs.io/en/latest/) can help with that: +However, `zarr-python` can be slow with sharding throughput as well as writing throughput. Thus if you wish to speed up either writing, sharding, or both (or receive a modest speed-boost for reading), a bridge to the `zarr` implementation in Rust: https://zarrs-python.readthedocs.io/en/latest/ can help with that (see https://github.com/LDeakin/zarr_benchmarks for benchmarks): ``` uv pip install zarrs From dfd93850afdb8f06a86157911c61fea51a869e5e Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Mon, 31 Mar 2025 16:50:51 +0200 Subject: [PATCH 10/33] (fix): url --- docs/zarr-v3.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/zarr-v3.md b/docs/zarr-v3.md index 58d5f7657..39fccc16f 100644 --- a/docs/zarr-v3.md +++ b/docs/zarr-v3.md @@ -10,7 +10,7 @@ There are two ways of opening remote [`zarr` stores] from the `zarr-python` pack ## Local data -Local data generally poses a different set of challenges. First, write speeds can be somewhat slow and second, the creation of many small files on a file system can slow down a filesystem. For the "many small files" problem, `zarr` has introduced [sharding] in the v3 file format. Sharding requires knowledge of the array element you are writing, though, and therefore you will need to use {func}`anndata.experimental.write_dispatched` to use sharding: +Local data generally poses a different set of challenges. First, write speeds can be somewhat slow and second, the creation of many small files on a file system can slow down a filesystem. For the "many small files" problem, `zarr` has introduced [sharding] in the v3 file format. Sharding requires knowledge of the array element you are writing, though, and therefore you will need to use {func}`anndata.experimental.write_dispatched` to use sharding. Here is a short example, although you should tune the sizes to your own use-case and also use the compression that makes the most sense for you: ```python import anndata as ad @@ -47,7 +47,7 @@ However, this pipeline is not compatible with all types of zarr store, especiall ## Codecs -The default `zarr-python` v3 codec for `v3 file-format` is no longer `blosc` but `zstd`. While `zstd` is more widespread, you may find its performance to not meet your old expectations. Therefore, we recommend passing in the [`BloscCodec`] if you wish to return to the old behavior. +The default `zarr-python` v3 codec for the [v3 format] is no longer `blosc` but `zstd`. While `zstd` is more widespread, you may find its performance to not meet your old expectations. Therefore, we recommend passing in the [`BloscCodec`] if you wish to return to the old behavior. There is currently a bug with `numcodecs` that prevents data written from other non-numcodecs `zstd` implementations from being read in by the default zarr pipeline (to which the above rust pipeline falls back if it cannot handle a datatype or indexing scheme, like `vlen-string`): https://github.com/zarr-developers/numcodecs/issues/424. Thus is may be advisable to use `BloscCodec` with `zarr` v3 file format data if you wish to use the rust-accelerated pipeline until this issue is resolved. From a4f5ad644b5d777a22aa6e127902eb57c8b3a099 Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Tue, 1 Apr 2025 10:23:05 +0200 Subject: [PATCH 11/33] (fix): clarify `compressor` argument --- docs/zarr-v3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/zarr-v3.md b/docs/zarr-v3.md index 39fccc16f..7035ccf85 100644 --- a/docs/zarr-v3.md +++ b/docs/zarr-v3.md @@ -47,7 +47,7 @@ However, this pipeline is not compatible with all types of zarr store, especiall ## Codecs -The default `zarr-python` v3 codec for the [v3 format] is no longer `blosc` but `zstd`. While `zstd` is more widespread, you may find its performance to not meet your old expectations. Therefore, we recommend passing in the [`BloscCodec`] if you wish to return to the old behavior. +The default `zarr-python` v3 codec for the [v3 format] is no longer `blosc` but `zstd`. While `zstd` is more widespread, you may find its performance to not meet your old expectations. Therefore, we recommend passing in the [`BloscCodec`] to `compressor` on {func}`~anndata.AnnData.write_zarr` if you wish to return to the old behavior. There is currently a bug with `numcodecs` that prevents data written from other non-numcodecs `zstd` implementations from being read in by the default zarr pipeline (to which the above rust pipeline falls back if it cannot handle a datatype or indexing scheme, like `vlen-string`): https://github.com/zarr-developers/numcodecs/issues/424. Thus is may be advisable to use `BloscCodec` with `zarr` v3 file format data if you wish to use the rust-accelerated pipeline until this issue is resolved. From daa433af794c015599814c7f4e9034c8128cab96 Mon Sep 17 00:00:00 2001 From: "Philipp A." Date: Fri, 4 Apr 2025 09:32:08 +0200 Subject: [PATCH 12/33] one sentence per line --- docs/zarr-v3.md | 42 ++++++++++++++++++++++++++++++++---------- 1 file changed, 32 insertions(+), 10 deletions(-) diff --git a/docs/zarr-v3.md b/docs/zarr-v3.md index 7035ccf85..327dda7f3 100644 --- a/docs/zarr-v3.md +++ b/docs/zarr-v3.md @@ -1,16 +1,28 @@ # zarr-v3 Guide/Roadmap -`anndata` now uses the much improved {mod}`zarr` v3 package and also allows writing of datasets in the [v3 format], with the exception of structured arrays. Users should notice a significant performance improvement, especially for cloud data, but also likely for local data as well. Here is a quick guide on some of our learnings so far: +`anndata` now uses the much improved {mod}`zarr` v3 package and also allows writing of datasets in the [v3 format], with the exception of structured arrays. +Users should notice a significant performance improvement, especially for cloud data, but also likely for local data as well. +Here is a quick guide on some of our learnings so far: ## Remote data -We now provide the {func}`anndata.experimental.read_lazy` feature for reading as much of the {class}`~anndata.AnnData` object as lazily as possible, using `dask` and {mod}`xarray`. Please note that this feature is experimental and subject to change. To enable this functionality in a performant and feature-complete way for remote data sources, we use [consolidated metadata] on the `zarr` store (written by default). Please note that this introduces consistency issues - if you update the structure of the underlying `zarr` store i.e., remove a column from `obs`, the consolidated metadata will no longer be valid. Further, note that without consolidated metadata, we cannot guarantee your stored `AnnData` object will be fully readable. And even if it is fully readable, it will almost certainly be much slower to read. +We now provide the {func}`anndata.experimental.read_lazy` feature for reading as much of the {class}`~anndata.AnnData` object as lazily as possible, using `dask` and {mod}`xarray`. +Please note that this feature is experimental and subject to change. +To enable this functionality in a performant and feature-complete way for remote data sources, we use [consolidated metadata] on the `zarr` store (written by default). +Please note that this introduces consistency issues - if you update the structure of the underlying `zarr` store i.e., remove a column from `obs`, the consolidated metadata will no longer be valid. +Further, note that without consolidated metadata, we cannot guarantee your stored `AnnData` object will be fully readable. +And even if it is fully readable, it will almost certainly be much slower to read. -There are two ways of opening remote [`zarr` stores] from the `zarr-python` package, `fsspec` and `obstore`, and both can be used with `read_lazy`. [`obstore` claims] to be more performant out-of-the-box, but notes that this claim has not been benchmarked with the `uvloop` event loop, which itself claims to be 2X more performant than the default event loop for `python`. +There are two ways of opening remote [`zarr` stores] from the `zarr-python` package, `fsspec` and `obstore`, and both can be used with `read_lazy`. +[`obstore` claims] to be more performant out-of-the-box, but notes that this claim has not been benchmarked with the `uvloop` event loop, which itself claims to be 2X more performant than the default event loop for `python`. ## Local data -Local data generally poses a different set of challenges. First, write speeds can be somewhat slow and second, the creation of many small files on a file system can slow down a filesystem. For the "many small files" problem, `zarr` has introduced [sharding] in the v3 file format. Sharding requires knowledge of the array element you are writing, though, and therefore you will need to use {func}`anndata.experimental.write_dispatched` to use sharding. Here is a short example, although you should tune the sizes to your own use-case and also use the compression that makes the most sense for you: +Local data generally poses a different set of challenges. +First, write speeds can be somewhat slow and second, the creation of many small files on a file system can slow down a filesystem. +For the "many small files" problem, `zarr` has introduced [sharding] in the v3 file format. +Sharding requires knowledge of the array element you are writing, though, and therefore you will need to use {func}`anndata.experimental.write_dispatched` to use sharding. +Here is a short example, although you should tune the sizes to your own use-case and also use the compression that makes the most sense for you: ```python import anndata as ad @@ -31,7 +43,8 @@ def write_sharded(group: zarr.Group, adata: ad.AnnData): return ad.experimental.write_dispatched(group, "/", adata, callback=callback) ``` -However, `zarr-python` can be slow with sharding throughput as well as writing throughput. Thus if you wish to speed up either writing, sharding, or both (or receive a modest speed-boost for reading), a bridge to the `zarr` implementation in Rust: https://zarrs-python.readthedocs.io/en/latest/ can help with that (see https://github.com/LDeakin/zarr_benchmarks for benchmarks): +However, `zarr-python` can be slow with sharding throughput as well as writing throughput. +Thus if you wish to speed up either writing, sharding, or both (or receive a modest speed-boost for reading), a bridge to the `zarr` implementation in Rust: https://zarrs-python.readthedocs.io/en/latest/ can help with that (see https://github.com/LDeakin/zarr_benchmarks for benchmarks): ``` uv pip install zarrs @@ -43,25 +56,34 @@ import zarrs zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}) ``` -However, this pipeline is not compatible with all types of zarr store, especially remote stores and there are limitations on where rust can give a performance boost for indexing. We therefore recommend this pipeline for writing full datasets and reading contiguous regions of said written data. +However, this pipeline is not compatible with all types of zarr store, especially remote stores and there are limitations on where rust can give a performance boost for indexing. +We therefore recommend this pipeline for writing full datasets and reading contiguous regions of said written data. ## Codecs -The default `zarr-python` v3 codec for the [v3 format] is no longer `blosc` but `zstd`. While `zstd` is more widespread, you may find its performance to not meet your old expectations. Therefore, we recommend passing in the [`BloscCodec`] to `compressor` on {func}`~anndata.AnnData.write_zarr` if you wish to return to the old behavior. +The default `zarr-python` v3 codec for the [v3 format] is no longer `blosc` but `zstd`. +While `zstd` is more widespread, you may find its performance to not meet your old expectations. +Therefore, we recommend passing in the [`BloscCodec`] to `compressor` on {func}`~anndata.AnnData.write_zarr` if you wish to return to the old behavior. -There is currently a bug with `numcodecs` that prevents data written from other non-numcodecs `zstd` implementations from being read in by the default zarr pipeline (to which the above rust pipeline falls back if it cannot handle a datatype or indexing scheme, like `vlen-string`): https://github.com/zarr-developers/numcodecs/issues/424. Thus is may be advisable to use `BloscCodec` with `zarr` v3 file format data if you wish to use the rust-accelerated pipeline until this issue is resolved. +There is currently a bug with `numcodecs` that prevents data written from other non-numcodecs `zstd` implementations from being read in by the default zarr pipeline (to which the above rust pipeline falls back if it cannot handle a datatype or indexing scheme, like `vlen-string`): https://github.com/zarr-developers/numcodecs/issues/424. +Thus is may be advisable to use `BloscCodec` with `zarr` v3 file format data if you wish to use the rust-accelerated pipeline until this issue is resolved. The same issue with `zstd` applies to data that may eventually be written by the GPU `zstd` implementation (see below). ## GPU i/o -At the moment, it is unlikely your `anndata` i/o will work if you use [`zarr.enable_gpu`]. It's *possible* dense data i/o i.e., using {func}`anndata.io.read_elem` will work as expected, but this functionality is untested - sparse data, awkward arrays, and dataframes will not. `kvikio` currently provides a [`GDS`-enabled store] although there are no working compressors at the moment exported from the `zarr-python` package (work is underway for `Zstd`: https://github.com/zarr-developers/zarr-python/pull/2863). +At the moment, it is unlikely your `anndata` i/o will work if you use [`zarr.enable_gpu`]. +It's *possible* dense data i/o i.e., using {func}`anndata.io.read_elem` will work as expected, but this functionality is untested - sparse data, awkward arrays, and dataframes will not. +`kvikio` currently provides a [`GDS`-enabled store] although there are no working compressors at the moment exported from the `zarr-python` package (work is underway for `Zstd`: https://github.com/zarr-developers/zarr-python/pull/2863). We anticipate enabling officially supporting this functionality officially for dense data, sparse data, and possibly awkward arrays in the next minor release, 0.13. ## Asynchronous i/o -At the moment, `anndata` exports no `async` functions. However, `zarr-python` has a fully `async` API and provides its own event-loop so that users like `anndata` can interact with a synchronous API while still beenfitting from `zarr-python`'s asynchronous functionality under that API. We anticipate providing `async` versions of {func}`anndata.io.read_elem` and {func}`anndata.experimental.read_dispatched` so that users can download data asynchronously without using the `zarr-python` event loop. We also would like to create an asynchronous partial reader to enable iterative streaming of a dataset. +At the moment, `anndata` exports no `async` functions. +However, `zarr-python` has a fully `async` API and provides its own event-loop so that users like `anndata` can interact with a synchronous API while still beenfitting from `zarr-python`'s asynchronous functionality under that API. +We anticipate providing `async` versions of {func}`anndata.io.read_elem` and {func}`anndata.experimental.read_dispatched` so that users can download data asynchronously without using the `zarr-python` event loop. +We also would like to create an asynchronous partial reader to enable iterative streaming of a dataset. [v3 format]: https://anndata.readthedocs.io/en/stable/generated/anndata.settings.html#anndata.settings.zarr_write_format From 2b86efad5b62899aee923b5f785dac6acfaa8a29 Mon Sep 17 00:00:00 2001 From: Ilan Gold Date: Fri, 4 Apr 2025 13:13:25 +0200 Subject: [PATCH 13/33] Apply suggestions from code review Co-authored-by: Philipp A. --- docs/zarr-v3.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/zarr-v3.md b/docs/zarr-v3.md index 327dda7f3..efc6025ba 100644 --- a/docs/zarr-v3.md +++ b/docs/zarr-v3.md @@ -9,12 +9,12 @@ Here is a quick guide on some of our learnings so far: We now provide the {func}`anndata.experimental.read_lazy` feature for reading as much of the {class}`~anndata.AnnData` object as lazily as possible, using `dask` and {mod}`xarray`. Please note that this feature is experimental and subject to change. To enable this functionality in a performant and feature-complete way for remote data sources, we use [consolidated metadata] on the `zarr` store (written by default). -Please note that this introduces consistency issues - if you update the structure of the underlying `zarr` store i.e., remove a column from `obs`, the consolidated metadata will no longer be valid. +Please note that this introduces consistency issues – if you update the structure of the underlying `zarr` store i.e., remove a column from `obs`, the consolidated metadata will no longer be valid. Further, note that without consolidated metadata, we cannot guarantee your stored `AnnData` object will be fully readable. And even if it is fully readable, it will almost certainly be much slower to read. There are two ways of opening remote [`zarr` stores] from the `zarr-python` package, `fsspec` and `obstore`, and both can be used with `read_lazy`. -[`obstore` claims] to be more performant out-of-the-box, but notes that this claim has not been benchmarked with the `uvloop` event loop, which itself claims to be 2X more performant than the default event loop for `python`. +[`obstore` claims] to be more performant out-of-the-box, but notes that this claim has not been benchmarked with the `uvloop` event loop, which itself claims to be 2× more performant than the default event loop for `python`. ## Local data @@ -73,7 +73,7 @@ The same issue with `zstd` applies to data that may eventually be written by the ## GPU i/o At the moment, it is unlikely your `anndata` i/o will work if you use [`zarr.enable_gpu`]. -It's *possible* dense data i/o i.e., using {func}`anndata.io.read_elem` will work as expected, but this functionality is untested - sparse data, awkward arrays, and dataframes will not. +It's *possible* dense data i/o i.e., using {func}`anndata.io.read_elem` will work as expected, but this functionality is untested – sparse data, awkward arrays, and dataframes will not. `kvikio` currently provides a [`GDS`-enabled store] although there are no working compressors at the moment exported from the `zarr-python` package (work is underway for `Zstd`: https://github.com/zarr-developers/zarr-python/pull/2863). We anticipate enabling officially supporting this functionality officially for dense data, sparse data, and possibly awkward arrays in the next minor release, 0.13. From 3ac84a8c8bf4037fb2cd06054888072f67e94fe5 Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Fri, 4 Apr 2025 15:20:15 +0200 Subject: [PATCH 14/33] (fix): numpy `StringDType` version condition --- src/anndata/_io/specs/methods.py | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/src/anndata/_io/specs/methods.py b/src/anndata/_io/specs/methods.py index 7b7c330bb..c1f4ebb0d 100644 --- a/src/anndata/_io/specs/methods.py +++ b/src/anndata/_io/specs/methods.py @@ -629,11 +629,16 @@ def write_vlen_string_array_zarr( dataset_kwargs = dataset_kwargs.copy() dataset_kwargs = zarr_v3_compressor_compat(dataset_kwargs) - filters, dtype = ( - ([VLenUTF8()], object) - if ad.settings.zarr_write_format == 2 - else (None, np.dtypes.StringDType()) - ) + match ( + ad.settings.zarr_write_format, + Version(np.__version__) >= Version("2.0.0"), + ): + case 2, _: + filters, dtype = [VLenUTF8()], object + case 3, True: + filters, dtype = None, np.dtypes.StringDType() + case 3, False: + filters, dtype = None, np.dtypes.ObjectDType() f.create_array( k, shape=elem.shape, From b5af99b9f0feb90b56259239423193f65a743618 Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Fri, 4 Apr 2025 15:43:03 +0200 Subject: [PATCH 15/33] (fix): snippet formatting --- docs/zarr-v3.md | 27 +++++++++++++++++++-------- 1 file changed, 19 insertions(+), 8 deletions(-) diff --git a/docs/zarr-v3.md b/docs/zarr-v3.md index efc6025ba..c5a485ae8 100644 --- a/docs/zarr-v3.md +++ b/docs/zarr-v3.md @@ -25,21 +25,32 @@ Sharding requires knowledge of the array element you are writing, though, and th Here is a short example, although you should tune the sizes to your own use-case and also use the compression that makes the most sense for you: ```python +import zarr import anndata as ad from collections.abc import Mapping from typing import Any -import zarr ad.settings.zarr_write_format = 3 # Absolutely crucial! Sharding is only for the v3 file format! def write_sharded(group: zarr.Group, adata: ad.AnnData): - def callback(func: ad.experimental.Write, g: zarr.Group, k: str, elem: ad.typing.RWAble, dataset_kwargs: Mapping[str, Any], iospec: ad.experimental.IOSpec): - if iospec.encoding_type in { "array" }: - dataset_kwargs = { "shards": tuple(int(2 ** (16 / len(elem.shape))) for _ in elem.shape), **dataset_kwargs} + def callback( + func: ad.experimental.Write, + g: zarr.Group, + k: str, + elem: ad.typing.RWAble, + dataset_kwargs: Mapping[str, Any], + iospec: ad.experimental.IOSpec, + ): + if iospec.encoding_type in {"array"}: + dataset_kwargs = { + "shards": tuple(int(2 ** (16 / len(elem.shape))) for _ in elem.shape), + **dataset_kwargs, + } dataset_kwargs["chunks"] = tuple(i // 2 for i in dataset_kwargs["shards"]) - elif iospec.encoding_type in { "csr_matrix", "csc_matrix" }: - dataset_kwargs = { "shards": (2**16,), "chunks": (2**8, ), **dataset_kwargs } + elif iospec.encoding_type in {"csr_matrix", "csc_matrix"}: + dataset_kwargs = {"shards": (2**16,), "chunks": (2**8,), **dataset_kwargs} func(g, k, elem, dataset_kwargs=dataset_kwargs) + return ad.experimental.write_dispatched(group, "/", adata, callback=callback) ``` @@ -65,7 +76,7 @@ The default `zarr-python` v3 codec for the [v3 format] is no longer `blosc` but While `zstd` is more widespread, you may find its performance to not meet your old expectations. Therefore, we recommend passing in the [`BloscCodec`] to `compressor` on {func}`~anndata.AnnData.write_zarr` if you wish to return to the old behavior. -There is currently a bug with `numcodecs` that prevents data written from other non-numcodecs `zstd` implementations from being read in by the default zarr pipeline (to which the above rust pipeline falls back if it cannot handle a datatype or indexing scheme, like `vlen-string`): https://github.com/zarr-developers/numcodecs/issues/424. +There is currently a bug with `numcodecs` that prevents data written from other non-numcodecs `zstd` implementations from being read in by the default zarr pipeline (to which the above rust pipeline falls back if it cannot handle a datatype or indexing scheme, like `vlen-string`): {issue}`zarr-developers/numcodecs#424`. Thus is may be advisable to use `BloscCodec` with `zarr` v3 file format data if you wish to use the rust-accelerated pipeline until this issue is resolved. The same issue with `zstd` applies to data that may eventually be written by the GPU `zstd` implementation (see below). @@ -74,7 +85,7 @@ The same issue with `zstd` applies to data that may eventually be written by the At the moment, it is unlikely your `anndata` i/o will work if you use [`zarr.enable_gpu`]. It's *possible* dense data i/o i.e., using {func}`anndata.io.read_elem` will work as expected, but this functionality is untested – sparse data, awkward arrays, and dataframes will not. -`kvikio` currently provides a [`GDS`-enabled store] although there are no working compressors at the moment exported from the `zarr-python` package (work is underway for `Zstd`: https://github.com/zarr-developers/zarr-python/pull/2863). +`kvikio` currently provides a [`GDS`-enabled store] although there are no working compressors at the moment exported from the `zarr-python` package (work is underway for `Zstd`: {pr}`zarr-developers/zarr-python#2863`. We anticipate enabling officially supporting this functionality officially for dense data, sparse data, and possibly awkward arrays in the next minor release, 0.13. From 660e367f8a576ad663302c62d7c20d0a0c64b007 Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Fri, 4 Apr 2025 15:44:34 +0200 Subject: [PATCH 16/33] (fix): more sharding info --- docs/zarr-v3.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/zarr-v3.md b/docs/zarr-v3.md index c5a485ae8..fc7b6d110 100644 --- a/docs/zarr-v3.md +++ b/docs/zarr-v3.md @@ -21,7 +21,8 @@ There are two ways of opening remote [`zarr` stores] from the `zarr-python` pack Local data generally poses a different set of challenges. First, write speeds can be somewhat slow and second, the creation of many small files on a file system can slow down a filesystem. For the "many small files" problem, `zarr` has introduced [sharding] in the v3 file format. -Sharding requires knowledge of the array element you are writing, though, and therefore you will need to use {func}`anndata.experimental.write_dispatched` to use sharding. +Sharding requires knowledge of the array element you are writing (such as shape or data type), though, and therefore you will need to use {func}`anndata.experimental.write_dispatched` to use sharding. +For example, you cannot shard a 1D array with `shard` sizes `(256, 256)`. Here is a short example, although you should tune the sizes to your own use-case and also use the compression that makes the most sense for you: ```python From 7c3dd0b2071aee05e8b97ea7057da57c4e21cde5 Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Fri, 4 Apr 2025 17:06:37 +0200 Subject: [PATCH 17/33] (fix): intersphinx linking --- docs/zarr-v3.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/docs/zarr-v3.md b/docs/zarr-v3.md index fc7b6d110..f5f1f7b91 100644 --- a/docs/zarr-v3.md +++ b/docs/zarr-v3.md @@ -1,6 +1,6 @@ # zarr-v3 Guide/Roadmap -`anndata` now uses the much improved {mod}`zarr` v3 package and also allows writing of datasets in the [v3 format], with the exception of structured arrays. +`anndata` now uses the much improved {mod}`zarr` v3 package and also allows writing of datasets in the v3 format via {attr}`anndata.settings.remove_unused_category`, with the exception of structured arrays. Users should notice a significant performance improvement, especially for cloud data, but also likely for local data as well. Here is a quick guide on some of our learnings so far: @@ -73,7 +73,7 @@ We therefore recommend this pipeline for writing full datasets and reading conti ## Codecs -The default `zarr-python` v3 codec for the [v3 format] is no longer `blosc` but `zstd`. +The default `zarr-python` v3 codec for the v3 format is no longer `blosc` but `zstd`. While `zstd` is more widespread, you may find its performance to not meet your old expectations. Therefore, we recommend passing in the [`BloscCodec`] to `compressor` on {func}`~anndata.AnnData.write_zarr` if you wish to return to the old behavior. @@ -97,8 +97,6 @@ However, `zarr-python` has a fully `async` API and provides its own event-loop s We anticipate providing `async` versions of {func}`anndata.io.read_elem` and {func}`anndata.experimental.read_dispatched` so that users can download data asynchronously without using the `zarr-python` event loop. We also would like to create an asynchronous partial reader to enable iterative streaming of a dataset. - -[v3 format]: https://anndata.readthedocs.io/en/stable/generated/anndata.settings.html#anndata.settings.zarr_write_format [consolidated metadata]: https://zarr.readthedocs.io/en/stable/user-guide/consolidated_metadata.html [`zarr` stores]: https://zarr.readthedocs.io/en/stable/api/zarr/storage/index.html [`obstore` claims]: https://developmentseed.org/obstore/latest/performance From 1cab3d8a756e1c07af0ec261f83411d17cbd2af7 Mon Sep 17 00:00:00 2001 From: Ilan Gold Date: Fri, 4 Apr 2025 18:28:24 +0200 Subject: [PATCH 18/33] Update docs/zarr-v3.md Co-authored-by: Philipp A. --- docs/zarr-v3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/zarr-v3.md b/docs/zarr-v3.md index f5f1f7b91..af91ddf4c 100644 --- a/docs/zarr-v3.md +++ b/docs/zarr-v3.md @@ -1,6 +1,6 @@ # zarr-v3 Guide/Roadmap -`anndata` now uses the much improved {mod}`zarr` v3 package and also allows writing of datasets in the v3 format via {attr}`anndata.settings.remove_unused_category`, with the exception of structured arrays. +`anndata` now uses the much improved {mod}`zarr` v3 package and also allows writing of datasets in the v3 format via {attr}`anndata.settings.zarr_write_format`, with the exception of structured arrays. Users should notice a significant performance improvement, especially for cloud data, but also likely for local data as well. Here is a quick guide on some of our learnings so far: From 4020071279edf5e53001ed508b873e6c74d06fdf Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Mon, 7 Apr 2025 12:23:35 +0200 Subject: [PATCH 19/33] (fix): notebook tweaks --- docs/tutorials/notebooks | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tutorials/notebooks b/docs/tutorials/notebooks index 24c1fd147..1be5cae95 160000 --- a/docs/tutorials/notebooks +++ b/docs/tutorials/notebooks @@ -1 +1 @@ -Subproject commit 24c1fd147181f2027017e33ae2b19c1d8434fec4 +Subproject commit 1be5cae95f335573d0f7e1ef8fd31b8a6fe1110a From 92eb6af5795be9be765bf84aa478aed091183633 Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Mon, 7 Apr 2025 12:25:07 +0200 Subject: [PATCH 20/33] (fix): update `zarr-v3` location --- docs/index.md | 1 - docs/tutorials/index.md | 1 + docs/{ => tutorials}/zarr-v3.md | 0 3 files changed, 1 insertion(+), 1 deletion(-) rename docs/{ => tutorials}/zarr-v3.md (100%) diff --git a/docs/index.md b/docs/index.md index dc55ca3bc..06aacaed6 100644 --- a/docs/index.md +++ b/docs/index.md @@ -20,7 +20,6 @@ benchmarks contributing release-notes/index references -zarr-v3 ``` # News diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md index 29bc1f1fc..08ae7fc23 100644 --- a/docs/tutorials/index.md +++ b/docs/tutorials/index.md @@ -15,4 +15,5 @@ notebooks/anndata_dask_array notebooks/awkward-arrays notebooks/{read,write}_dispatched notebooks/read_lazy +zarr-v3 ``` diff --git a/docs/zarr-v3.md b/docs/tutorials/zarr-v3.md similarity index 100% rename from docs/zarr-v3.md rename to docs/tutorials/zarr-v3.md From d3f1b1ccaeecd1c9cc1d683abdd0e6b8f22d1baf Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Mon, 7 Apr 2025 13:51:24 +0200 Subject: [PATCH 21/33] (fix): point at correct docs --- docs/conf.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/conf.py b/docs/conf.py index 6f964b6d1..de2aeb51e 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -113,7 +113,7 @@ def setup(app: Sphinx): python=("https://docs.python.org/3", None), scipy=("https://docs.scipy.org/doc/scipy", None), sklearn=("https://scikit-learn.org/stable", None), - zarr=("https://zarr.readthedocs.io/en/stable/", None), + zarr=("https://zarr.readthedocs.io/en/latest/", None), xarray=("https://docs.xarray.dev/en/stable", None), ) qualname_overrides = { From 5cc4d165ad84f42491f36c7a5945438e6742ce13 Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Mon, 7 Apr 2025 13:51:45 +0200 Subject: [PATCH 22/33] (chore): update notebook --- docs/tutorials/notebooks | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tutorials/notebooks b/docs/tutorials/notebooks index 1be5cae95..f302bda51 160000 --- a/docs/tutorials/notebooks +++ b/docs/tutorials/notebooks @@ -1 +1 @@ -Subproject commit 1be5cae95f335573d0f7e1ef8fd31b8a6fe1110a +Subproject commit f302bda517e77e0a631d13f971dc3e5a8f92ab49 From f44c2479f86e9f3a93da6f7c2a4c6319e90470ec Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Mon, 7 Apr 2025 14:47:22 +0200 Subject: [PATCH 23/33] (chore): update notebooks --- docs/tutorials/notebooks | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tutorials/notebooks b/docs/tutorials/notebooks index f302bda51..dd1ea16af 160000 --- a/docs/tutorials/notebooks +++ b/docs/tutorials/notebooks @@ -1 +1 @@ -Subproject commit f302bda517e77e0a631d13f971dc3e5a8f92ab49 +Subproject commit dd1ea16afcc34f13d28ed0f58c1013eb419c7f41 From 63288c214951360c30238ad4aa1fb86c7291abee Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Tue, 8 Apr 2025 10:07:17 +0200 Subject: [PATCH 24/33] (fix): most intersphinx links --- docs/conf.py | 2 ++ docs/tutorials/zarr-v3.md | 20 +++++++------------- 2 files changed, 9 insertions(+), 13 deletions(-) diff --git a/docs/conf.py b/docs/conf.py index de2aeb51e..525799fae 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -115,6 +115,8 @@ def setup(app: Sphinx): sklearn=("https://scikit-learn.org/stable", None), zarr=("https://zarr.readthedocs.io/en/latest/", None), xarray=("https://docs.xarray.dev/en/stable", None), + obstore=("https://developmentseed.org/obstore/latest/", None), + kvikio=("https://docs.rapids.ai/api/kvikio/stable/", None), ) qualname_overrides = { "h5py._hl.group.Group": "h5py.Group", diff --git a/docs/tutorials/zarr-v3.md b/docs/tutorials/zarr-v3.md index af91ddf4c..004f2a2cd 100644 --- a/docs/tutorials/zarr-v3.md +++ b/docs/tutorials/zarr-v3.md @@ -8,19 +8,19 @@ Here is a quick guide on some of our learnings so far: We now provide the {func}`anndata.experimental.read_lazy` feature for reading as much of the {class}`~anndata.AnnData` object as lazily as possible, using `dask` and {mod}`xarray`. Please note that this feature is experimental and subject to change. -To enable this functionality in a performant and feature-complete way for remote data sources, we use [consolidated metadata] on the `zarr` store (written by default). +To enable this functionality in a performant and feature-complete way for remote data sources, we use {doc}`conslidated metadata ` on the `zarr` store (written by default). Please note that this introduces consistency issues – if you update the structure of the underlying `zarr` store i.e., remove a column from `obs`, the consolidated metadata will no longer be valid. Further, note that without consolidated metadata, we cannot guarantee your stored `AnnData` object will be fully readable. And even if it is fully readable, it will almost certainly be much slower to read. -There are two ways of opening remote [`zarr` stores] from the `zarr-python` package, `fsspec` and `obstore`, and both can be used with `read_lazy`. +There are two ways of opening remote `zarr` stores from the `zarr-python` package, {class}`zarr.storage.FsspecStore` and {class}`zarr.storage.ObjectStore`, and both can be used with `read_lazy`. [`obstore` claims] to be more performant out-of-the-box, but notes that this claim has not been benchmarked with the `uvloop` event loop, which itself claims to be 2× more performant than the default event loop for `python`. ## Local data Local data generally poses a different set of challenges. First, write speeds can be somewhat slow and second, the creation of many small files on a file system can slow down a filesystem. -For the "many small files" problem, `zarr` has introduced [sharding] in the v3 file format. +For the "many small files" problem, `zarr` has introduced `{ref} sharding ` in the v3 file format. Sharding requires knowledge of the array element you are writing (such as shape or data type), though, and therefore you will need to use {func}`anndata.experimental.write_dispatched` to use sharding. For example, you cannot shard a 1D array with `shard` sizes `(256, 256)`. Here is a short example, although you should tune the sizes to your own use-case and also use the compression that makes the most sense for you: @@ -75,7 +75,7 @@ We therefore recommend this pipeline for writing full datasets and reading conti The default `zarr-python` v3 codec for the v3 format is no longer `blosc` but `zstd`. While `zstd` is more widespread, you may find its performance to not meet your old expectations. -Therefore, we recommend passing in the [`BloscCodec`] to `compressor` on {func}`~anndata.AnnData.write_zarr` if you wish to return to the old behavior. +Therefore, we recommend passing in the {class}`zarr.codecs.BloscCodec` to `compressor` on {func}`~anndata.AnnData.write_zarr` if you wish to return to the old behavior. There is currently a bug with `numcodecs` that prevents data written from other non-numcodecs `zstd` implementations from being read in by the default zarr pipeline (to which the above rust pipeline falls back if it cannot handle a datatype or indexing scheme, like `vlen-string`): {issue}`zarr-developers/numcodecs#424`. Thus is may be advisable to use `BloscCodec` with `zarr` v3 file format data if you wish to use the rust-accelerated pipeline until this issue is resolved. @@ -84,9 +84,9 @@ The same issue with `zstd` applies to data that may eventually be written by the ## GPU i/o -At the moment, it is unlikely your `anndata` i/o will work if you use [`zarr.enable_gpu`]. +At the moment, it is unlikely your `anndata` i/o will work if you use `zarr.enable_gpu `. It's *possible* dense data i/o i.e., using {func}`anndata.io.read_elem` will work as expected, but this functionality is untested – sparse data, awkward arrays, and dataframes will not. -`kvikio` currently provides a [`GDS`-enabled store] although there are no working compressors at the moment exported from the `zarr-python` package (work is underway for `Zstd`: {pr}`zarr-developers/zarr-python#2863`. +`kvikio` currently provides a {class}`kvikio.zarr.GDSStore` although there are no working compressors at the moment exported from the `zarr-python` package (work is underway for `Zstd`: {pr}`zarr-developers/zarr-python#2863`. We anticipate enabling officially supporting this functionality officially for dense data, sparse data, and possibly awkward arrays in the next minor release, 0.13. @@ -95,12 +95,6 @@ We anticipate enabling officially supporting this functionality officially for d At the moment, `anndata` exports no `async` functions. However, `zarr-python` has a fully `async` API and provides its own event-loop so that users like `anndata` can interact with a synchronous API while still beenfitting from `zarr-python`'s asynchronous functionality under that API. We anticipate providing `async` versions of {func}`anndata.io.read_elem` and {func}`anndata.experimental.read_dispatched` so that users can download data asynchronously without using the `zarr-python` event loop. -We also would like to create an asynchronous partial reader to enable iterative streaming of a dataset. +We also would like to create an asynchronous partial reader to enable iterative streaming of a dataset. {doc}`zarr:user-guide/consolidated_metadata` -[consolidated metadata]: https://zarr.readthedocs.io/en/stable/user-guide/consolidated_metadata.html -[`zarr` stores]: https://zarr.readthedocs.io/en/stable/api/zarr/storage/index.html [`obstore` claims]: https://developmentseed.org/obstore/latest/performance -[sharding]: https://zarr.readthedocs.io/en/stable/user-guide/performance.html#sharding -[`BloscCodec`]: https://zarr.readthedocs.io/en/stable/api/zarr/codecs/index.html#zarr.codecs.BloscCodec -[`zarr.enable_gpu`]: https://zarr.readthedocs.io/en/stable/user-guide/gpu.html#reading-data-into-device-memory -[`GDS`-enabled store]: https://docs.rapids.ai/api/kvikio/nightly/api/#kvikio.zarr.GDSStore From 5cb93dc843602eb5c0578beb82055f8f47cc353d Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Tue, 8 Apr 2025 10:18:48 +0200 Subject: [PATCH 25/33] (fix): add `zarrs` intersphinx --- docs/conf.py | 2 ++ docs/tutorials/zarr-v3.md | 3 ++- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/docs/conf.py b/docs/conf.py index 525799fae..1f14db724 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -117,7 +117,9 @@ def setup(app: Sphinx): xarray=("https://docs.xarray.dev/en/stable", None), obstore=("https://developmentseed.org/obstore/latest/", None), kvikio=("https://docs.rapids.ai/api/kvikio/stable/", None), + zarrs=("https://zarrs-python.readthedocs.io/en/stable/", None), ) + qualname_overrides = { "h5py._hl.group.Group": "h5py.Group", "h5py._hl.files.File": "h5py.File", diff --git a/docs/tutorials/zarr-v3.md b/docs/tutorials/zarr-v3.md index 004f2a2cd..8351a4d65 100644 --- a/docs/tutorials/zarr-v3.md +++ b/docs/tutorials/zarr-v3.md @@ -56,7 +56,7 @@ def write_sharded(group: zarr.Group, adata: ad.AnnData): ``` However, `zarr-python` can be slow with sharding throughput as well as writing throughput. -Thus if you wish to speed up either writing, sharding, or both (or receive a modest speed-boost for reading), a bridge to the `zarr` implementation in Rust: https://zarrs-python.readthedocs.io/en/latest/ can help with that (see https://github.com/LDeakin/zarr_benchmarks for benchmarks): +Thus if you wish to speed up either writing, sharding, or both (or receive a modest speed-boost for reading), a bridge to the `zarr` implementation in Rust {doc}`zarrs-python ` can help with that (see the [zarr-benchmarks]): ``` uv pip install zarrs @@ -98,3 +98,4 @@ We anticipate providing `async` versions of {func}`anndata.io.read_elem` and {fu We also would like to create an asynchronous partial reader to enable iterative streaming of a dataset. {doc}`zarr:user-guide/consolidated_metadata` [`obstore` claims]: https://developmentseed.org/obstore/latest/performance +[zarr-benchmarks]: https://github.com/LDeakin/zarr_benchmarks From ecdda515ab7fd66e9787dac0ff6e696cdb7704d2 Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Tue, 8 Apr 2025 10:20:42 +0200 Subject: [PATCH 26/33] (chore): todo --- docs/conf.py | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/conf.py b/docs/conf.py index 1f14db724..702aa8b34 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -113,6 +113,7 @@ def setup(app: Sphinx): python=("https://docs.python.org/3", None), scipy=("https://docs.scipy.org/doc/scipy", None), sklearn=("https://scikit-learn.org/stable", None), + # TODO: move back to stable once `ObjectStore` is released zarr=("https://zarr.readthedocs.io/en/latest/", None), xarray=("https://docs.xarray.dev/en/stable", None), obstore=("https://developmentseed.org/obstore/latest/", None), From 8f440830159bf5fe5698238c7e295d1cf6b6523a Mon Sep 17 00:00:00 2001 From: Ilan Gold Date: Tue, 8 Apr 2025 11:14:35 +0200 Subject: [PATCH 27/33] Update docs/tutorials/zarr-v3.md Co-authored-by: Philipp A. --- docs/tutorials/zarr-v3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tutorials/zarr-v3.md b/docs/tutorials/zarr-v3.md index 8351a4d65..5f31ee5e7 100644 --- a/docs/tutorials/zarr-v3.md +++ b/docs/tutorials/zarr-v3.md @@ -95,7 +95,7 @@ We anticipate enabling officially supporting this functionality officially for d At the moment, `anndata` exports no `async` functions. However, `zarr-python` has a fully `async` API and provides its own event-loop so that users like `anndata` can interact with a synchronous API while still beenfitting from `zarr-python`'s asynchronous functionality under that API. We anticipate providing `async` versions of {func}`anndata.io.read_elem` and {func}`anndata.experimental.read_dispatched` so that users can download data asynchronously without using the `zarr-python` event loop. -We also would like to create an asynchronous partial reader to enable iterative streaming of a dataset. {doc}`zarr:user-guide/consolidated_metadata` +We also would like to create an asynchronous partial reader to enable iterative streaming of a dataset. [`obstore` claims]: https://developmentseed.org/obstore/latest/performance [zarr-benchmarks]: https://github.com/LDeakin/zarr_benchmarks From 0fe6868597cacce0fc9c8e72f81316f6aca48dde Mon Sep 17 00:00:00 2001 From: Ilan Gold Date: Tue, 8 Apr 2025 11:14:46 +0200 Subject: [PATCH 28/33] Update docs/tutorials/zarr-v3.md Co-authored-by: Philipp A. --- docs/tutorials/zarr-v3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tutorials/zarr-v3.md b/docs/tutorials/zarr-v3.md index 5f31ee5e7..751434b32 100644 --- a/docs/tutorials/zarr-v3.md +++ b/docs/tutorials/zarr-v3.md @@ -8,7 +8,7 @@ Here is a quick guide on some of our learnings so far: We now provide the {func}`anndata.experimental.read_lazy` feature for reading as much of the {class}`~anndata.AnnData` object as lazily as possible, using `dask` and {mod}`xarray`. Please note that this feature is experimental and subject to change. -To enable this functionality in a performant and feature-complete way for remote data sources, we use {doc}`conslidated metadata ` on the `zarr` store (written by default). +To enable this functionality in a performant and feature-complete way for remote data sources, we use {doc}`zarr:user-guide/consolidated_metadata` on the `zarr` store (written by default). Please note that this introduces consistency issues – if you update the structure of the underlying `zarr` store i.e., remove a column from `obs`, the consolidated metadata will no longer be valid. Further, note that without consolidated metadata, we cannot guarantee your stored `AnnData` object will be fully readable. And even if it is fully readable, it will almost certainly be much slower to read. From e806a508d6735f40979ff7f52e933498176186c8 Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Tue, 8 Apr 2025 11:29:03 +0200 Subject: [PATCH 29/33] (refactor): links --- docs/tutorials/zarr-v3.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/tutorials/zarr-v3.md b/docs/tutorials/zarr-v3.md index 8351a4d65..899a70cf5 100644 --- a/docs/tutorials/zarr-v3.md +++ b/docs/tutorials/zarr-v3.md @@ -20,7 +20,7 @@ There are two ways of opening remote `zarr` stores from the `zarr-python` packag Local data generally poses a different set of challenges. First, write speeds can be somewhat slow and second, the creation of many small files on a file system can slow down a filesystem. -For the "many small files" problem, `zarr` has introduced `{ref} sharding ` in the v3 file format. +For the "many small files" problem, `zarr` has introduced `{ref} sharding ` in the v3 file format. Sharding requires knowledge of the array element you are writing (such as shape or data type), though, and therefore you will need to use {func}`anndata.experimental.write_dispatched` to use sharding. For example, you cannot shard a 1D array with `shard` sizes `(256, 256)`. Here is a short example, although you should tune the sizes to your own use-case and also use the compression that makes the most sense for you: @@ -84,7 +84,7 @@ The same issue with `zstd` applies to data that may eventually be written by the ## GPU i/o -At the moment, it is unlikely your `anndata` i/o will work if you use `zarr.enable_gpu `. +At the moment, it is unlikely your `anndata` i/o will work if you use {ref}`zarr.config.enable_gpu `. It's *possible* dense data i/o i.e., using {func}`anndata.io.read_elem` will work as expected, but this functionality is untested – sparse data, awkward arrays, and dataframes will not. `kvikio` currently provides a {class}`kvikio.zarr.GDSStore` although there are no working compressors at the moment exported from the `zarr-python` package (work is underway for `Zstd`: {pr}`zarr-developers/zarr-python#2863`. From 395080deb3cbcbd98337e32d4270f9820dfb7391 Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Tue, 8 Apr 2025 14:17:50 +0200 Subject: [PATCH 30/33] (fix): add `Dask` section --- docs/tutorials/zarr-v3.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/tutorials/zarr-v3.md b/docs/tutorials/zarr-v3.md index 040702ec6..e2908c4f6 100644 --- a/docs/tutorials/zarr-v3.md +++ b/docs/tutorials/zarr-v3.md @@ -82,6 +82,10 @@ Thus is may be advisable to use `BloscCodec` with `zarr` v3 file format data if The same issue with `zstd` applies to data that may eventually be written by the GPU `zstd` implementation (see below). +## Dask + +Zarr v3 should be compatible with dask, although the default behavior is to use zarr's chunking for its own. With sharding, this behavior may be undesirable as shards can often contain many small chunks, thereby slowing down i/o as dask will need to index into the zarr store for every chunk. Therefore it may be better to customize this behavior by passing `chunks=my_zarr_array.shards` as an argument to the {func}`dask.array.from_zarr` or similar. + ## GPU i/o At the moment, it is unlikely your `anndata` i/o will work if you use {ref}`zarr.config.enable_gpu `. From dbdfdda0362463c8600776fd0b94d46fee4cff36 Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Tue, 8 Apr 2025 14:18:52 +0200 Subject: [PATCH 31/33] (fix): small changes --- docs/tutorials/zarr-v3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tutorials/zarr-v3.md b/docs/tutorials/zarr-v3.md index e2908c4f6..be11fc099 100644 --- a/docs/tutorials/zarr-v3.md +++ b/docs/tutorials/zarr-v3.md @@ -84,7 +84,7 @@ The same issue with `zstd` applies to data that may eventually be written by the ## Dask -Zarr v3 should be compatible with dask, although the default behavior is to use zarr's chunking for its own. With sharding, this behavior may be undesirable as shards can often contain many small chunks, thereby slowing down i/o as dask will need to index into the zarr store for every chunk. Therefore it may be better to customize this behavior by passing `chunks=my_zarr_array.shards` as an argument to the {func}`dask.array.from_zarr` or similar. +Zarr v3 should be compatible with dask, although the default behavior is to use zarr's chunking for dask's own. With sharding, this behavior may be undesirable as shards can often contain many small chunks, thereby slowing down i/o as dask will need to index into the zarr store for every chunk. Therefore it may be better to customize this behavior by passing `chunks=my_zarr_array.shards` as an argument to {func}`dask.array.from_zarr` or similar. ## GPU i/o From 5b737bec67a9677f694023011c725c3e01b3f5bf Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Tue, 8 Apr 2025 18:08:29 +0200 Subject: [PATCH 32/33] (fix) `ref` --- docs/tutorials/zarr-v3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/tutorials/zarr-v3.md b/docs/tutorials/zarr-v3.md index be11fc099..833c3c9d1 100644 --- a/docs/tutorials/zarr-v3.md +++ b/docs/tutorials/zarr-v3.md @@ -20,7 +20,7 @@ There are two ways of opening remote `zarr` stores from the `zarr-python` packag Local data generally poses a different set of challenges. First, write speeds can be somewhat slow and second, the creation of many small files on a file system can slow down a filesystem. -For the "many small files" problem, `zarr` has introduced `{ref} sharding ` in the v3 file format. +For the "many small files" problem, `zarr` has introduced {ref}`sharding ` in the v3 file format. Sharding requires knowledge of the array element you are writing (such as shape or data type), though, and therefore you will need to use {func}`anndata.experimental.write_dispatched` to use sharding. For example, you cannot shard a 1D array with `shard` sizes `(256, 256)`. Here is a short example, although you should tune the sizes to your own use-case and also use the compression that makes the most sense for you: From 1f25a0d97378f7ac43da3e7b61c4e40e78c60f8f Mon Sep 17 00:00:00 2001 From: ilan-gold Date: Wed, 9 Apr 2025 10:55:20 +0200 Subject: [PATCH 33/33] (fix): must update zarr min version --- pyproject.toml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pyproject.toml b/pyproject.toml index db0dbfdc4..c41351017 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -47,7 +47,7 @@ dependencies = [ # array-api-compat 1.5 has https://github.com/scverse/anndata/issues/1410 "array_api_compat>1.4,!=1.5", "legacy-api-wrap", - "zarr >=2.15.0, !=3.0.0, !=3.0.1, !=3.0.2, !=3.0.3", + "zarr >=2.18.7, !=3.0.0, !=3.0.1, !=3.0.2, !=3.0.3", ] dynamic = [ "version" ]