-
Notifications
You must be signed in to change notification settings - Fork 164
(feat): zarr
v3 guide
#1948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
(feat): zarr
v3 guide
#1948
Changes from 26 commits
Commits
Show all changes
39 commits
Select commit
Hold shift + click to select a range
f1186ed
(docs): add a `zarr` v3 guide
ilan-gold c9bdccb
(chore): clean up `async` docs
ilan-gold 46c0824
(chore): add sharding code snippet
ilan-gold a3367a9
(fix): use `np.dtypes.StringDtype`
ilan-gold 2b6ed6f
Merge branch 'main' into ig/zarr_v3_doc
ilan-gold be4066e
(chore): clarify docs
ilan-gold 2d3f3aa
more docs
ilan-gold 77f9af7
Merge branch 'ig/zarr_v3_doc' of github.com:scverse/anndata into ig/z…
ilan-gold fa14046
(fix): clarify `zstd` issue
ilan-gold cb019fc
(fix): links + index
ilan-gold 87873d8
(chore): more small fixes
ilan-gold dfd9385
(fix): url
ilan-gold a4f5ad6
(fix): clarify `compressor` argument
ilan-gold daa433a
one sentence per line
flying-sheep 2b86efa
Apply suggestions from code review
ilan-gold 3ac84a8
(fix): numpy `StringDType` version condition
ilan-gold b5af99b
(fix): snippet formatting
ilan-gold 660e367
(fix): more sharding info
ilan-gold 7c3dd0b
(fix): intersphinx linking
ilan-gold 1cab3d8
Update docs/zarr-v3.md
ilan-gold 2c3e6ad
Merge branch 'main' into ig/zarr_v3_doc
ilan-gold 4020071
(fix): notebook tweaks
ilan-gold 92eb6af
(fix): update `zarr-v3` location
ilan-gold d3f1b1c
(fix): point at correct docs
ilan-gold 5cc4d16
(chore): update notebook
ilan-gold f44c247
(chore): update notebooks
ilan-gold 63288c2
(fix): most intersphinx links
ilan-gold 5cb93dc
(fix): add `zarrs` intersphinx
ilan-gold ecdda51
(chore): todo
ilan-gold 8f44083
Update docs/tutorials/zarr-v3.md
ilan-gold 0fe6868
Update docs/tutorials/zarr-v3.md
ilan-gold e806a50
(refactor): links
ilan-gold 17ae029
Merge branch 'ig/zarr_v3_doc' of github.com:scverse/anndata into ig/z…
ilan-gold cc50263
Merge branch 'main' into ig/zarr_v3_doc
ilan-gold 395080d
(fix): add `Dask` section
ilan-gold d179c1e
Merge branch 'ig/zarr_v3_doc' of github.com:scverse/anndata into ig/z…
ilan-gold dbdfdda
(fix): small changes
ilan-gold 5b737be
(fix) `ref`
ilan-gold 1f25a0d
(fix): must update zarr min version
ilan-gold File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
# zarr-v3 Guide/Roadmap | ||
|
||
`anndata` now uses the much improved {mod}`zarr` v3 package and also allows writing of datasets in the v3 format via {attr}`anndata.settings.zarr_write_format`, with the exception of structured arrays. | ||
Users should notice a significant performance improvement, especially for cloud data, but also likely for local data as well. | ||
Here is a quick guide on some of our learnings so far: | ||
|
||
## Remote data | ||
|
||
We now provide the {func}`anndata.experimental.read_lazy` feature for reading as much of the {class}`~anndata.AnnData` object as lazily as possible, using `dask` and {mod}`xarray`. | ||
Please note that this feature is experimental and subject to change. | ||
To enable this functionality in a performant and feature-complete way for remote data sources, we use [consolidated metadata] on the `zarr` store (written by default). | ||
Please note that this introduces consistency issues – if you update the structure of the underlying `zarr` store i.e., remove a column from `obs`, the consolidated metadata will no longer be valid. | ||
Further, note that without consolidated metadata, we cannot guarantee your stored `AnnData` object will be fully readable. | ||
And even if it is fully readable, it will almost certainly be much slower to read. | ||
|
||
There are two ways of opening remote [`zarr` stores] from the `zarr-python` package, `fsspec` and `obstore`, and both can be used with `read_lazy`. | ||
[`obstore` claims] to be more performant out-of-the-box, but notes that this claim has not been benchmarked with the `uvloop` event loop, which itself claims to be 2× more performant than the default event loop for `python`. | ||
|
||
## Local data | ||
|
||
Local data generally poses a different set of challenges. | ||
First, write speeds can be somewhat slow and second, the creation of many small files on a file system can slow down a filesystem. | ||
For the "many small files" problem, `zarr` has introduced [sharding] in the v3 file format. | ||
Sharding requires knowledge of the array element you are writing (such as shape or data type), though, and therefore you will need to use {func}`anndata.experimental.write_dispatched` to use sharding. | ||
For example, you cannot shard a 1D array with `shard` sizes `(256, 256)`. | ||
Here is a short example, although you should tune the sizes to your own use-case and also use the compression that makes the most sense for you: | ||
|
||
```python | ||
import zarr | ||
import anndata as ad | ||
from collections.abc import Mapping | ||
from typing import Any | ||
|
||
ad.settings.zarr_write_format = 3 # Absolutely crucial! Sharding is only for the v3 file format! | ||
|
||
def write_sharded(group: zarr.Group, adata: ad.AnnData): | ||
def callback( | ||
func: ad.experimental.Write, | ||
g: zarr.Group, | ||
k: str, | ||
elem: ad.typing.RWAble, | ||
dataset_kwargs: Mapping[str, Any], | ||
iospec: ad.experimental.IOSpec, | ||
): | ||
if iospec.encoding_type in {"array"}: | ||
dataset_kwargs = { | ||
"shards": tuple(int(2 ** (16 / len(elem.shape))) for _ in elem.shape), | ||
**dataset_kwargs, | ||
} | ||
dataset_kwargs["chunks"] = tuple(i // 2 for i in dataset_kwargs["shards"]) | ||
elif iospec.encoding_type in {"csr_matrix", "csc_matrix"}: | ||
dataset_kwargs = {"shards": (2**16,), "chunks": (2**8,), **dataset_kwargs} | ||
func(g, k, elem, dataset_kwargs=dataset_kwargs) | ||
|
||
return ad.experimental.write_dispatched(group, "/", adata, callback=callback) | ||
``` | ||
|
||
However, `zarr-python` can be slow with sharding throughput as well as writing throughput. | ||
Thus if you wish to speed up either writing, sharding, or both (or receive a modest speed-boost for reading), a bridge to the `zarr` implementation in Rust: https://zarrs-python.readthedocs.io/en/latest/ can help with that (see https://github.com/LDeakin/zarr_benchmarks for benchmarks): | ||
ilan-gold marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
``` | ||
uv pip install zarrs | ||
``` | ||
|
||
```python | ||
import zarr | ||
import zarrs | ||
zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}) | ||
``` | ||
|
||
However, this pipeline is not compatible with all types of zarr store, especially remote stores and there are limitations on where rust can give a performance boost for indexing. | ||
We therefore recommend this pipeline for writing full datasets and reading contiguous regions of said written data. | ||
|
||
## Codecs | ||
|
||
The default `zarr-python` v3 codec for the v3 format is no longer `blosc` but `zstd`. | ||
While `zstd` is more widespread, you may find its performance to not meet your old expectations. | ||
Therefore, we recommend passing in the [`BloscCodec`] to `compressor` on {func}`~anndata.AnnData.write_zarr` if you wish to return to the old behavior. | ||
|
||
There is currently a bug with `numcodecs` that prevents data written from other non-numcodecs `zstd` implementations from being read in by the default zarr pipeline (to which the above rust pipeline falls back if it cannot handle a datatype or indexing scheme, like `vlen-string`): {issue}`zarr-developers/numcodecs#424`. | ||
Thus is may be advisable to use `BloscCodec` with `zarr` v3 file format data if you wish to use the rust-accelerated pipeline until this issue is resolved. | ||
|
||
The same issue with `zstd` applies to data that may eventually be written by the GPU `zstd` implementation (see below). | ||
|
||
## GPU i/o | ||
|
||
At the moment, it is unlikely your `anndata` i/o will work if you use [`zarr.enable_gpu`]. | ||
It's *possible* dense data i/o i.e., using {func}`anndata.io.read_elem` will work as expected, but this functionality is untested – sparse data, awkward arrays, and dataframes will not. | ||
`kvikio` currently provides a [`GDS`-enabled store] although there are no working compressors at the moment exported from the `zarr-python` package (work is underway for `Zstd`: {pr}`zarr-developers/zarr-python#2863`. | ||
|
||
We anticipate enabling officially supporting this functionality officially for dense data, sparse data, and possibly awkward arrays in the next minor release, 0.13. | ||
|
||
## Asynchronous i/o | ||
|
||
At the moment, `anndata` exports no `async` functions. | ||
However, `zarr-python` has a fully `async` API and provides its own event-loop so that users like `anndata` can interact with a synchronous API while still beenfitting from `zarr-python`'s asynchronous functionality under that API. | ||
We anticipate providing `async` versions of {func}`anndata.io.read_elem` and {func}`anndata.experimental.read_dispatched` so that users can download data asynchronously without using the `zarr-python` event loop. | ||
We also would like to create an asynchronous partial reader to enable iterative streaming of a dataset. | ||
|
||
[consolidated metadata]: https://zarr.readthedocs.io/en/stable/user-guide/consolidated_metadata.html | ||
[`zarr` stores]: https://zarr.readthedocs.io/en/stable/api/zarr/storage/index.html | ||
[`obstore` claims]: https://developmentseed.org/obstore/latest/performance | ||
[sharding]: https://zarr.readthedocs.io/en/stable/user-guide/performance.html#sharding | ||
[`BloscCodec`]: https://zarr.readthedocs.io/en/stable/api/zarr/codecs/index.html#zarr.codecs.BloscCodec | ||
[`zarr.enable_gpu`]: https://zarr.readthedocs.io/en/stable/user-guide/gpu.html#reading-data-into-device-memory | ||
[`GDS`-enabled store]: https://docs.rapids.ai/api/kvikio/nightly/api/#kvikio.zarr.GDSStore | ||
flying-sheep marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.