Skip to content

Commit b08fc84

Browse files
committed
Address Tom comments
1 parent 8030453 commit b08fc84

File tree

7 files changed

+106
-87
lines changed

7 files changed

+106
-87
lines changed

docs/faq.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,10 @@ You can also use this approach to write a reader that starts from a kerchunk-for
9292

9393
Currently if you want to call your new reader from `virtualizarr.open_virtual_dataset` you would need to open a PR to this repository, but we plan to generalize this system to allow 3rd party libraries to plug in via an entrypoint (see [issue #245](https://github.com/zarr-developers/VirtualiZarr/issues/245)).
9494

95+
### What ML/AI model formats are supported?
96+
97+
VirtualiZarr has built-in support for [SafeTensors](safetensors.md) files, which are commonly used for storing ML model weights in a safe, efficient format.
98+
9599
## How does this actually work?
96100

97101
I'm glad you asked! We can think of the problem of providing virtualized zarr-like access to a set of archival files in some other format as a series of steps:

docs/index.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ VirtualiZarr aims to make the creation of cloud-optimized virtualized zarr data
1515
## Features
1616

1717
* Create virtual references pointing to bytes inside a archival file with [`open_virtual_dataset`](https://virtualizarr.readthedocs.io/en/latest/usage.html#opening-files-as-virtual-datasets),
18-
* Supports a [range of archival file formats](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare), including netCDF4 and HDF5,
18+
* Supports a [range of archival file formats](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare), including netCDF4, HDF5, and [SafeTensors](safetensors.md),
1919
* [Combine data from multiple files](https://virtualizarr.readthedocs.io/en/latest/usage.html#combining-virtual-datasets) into one larger store using [xarray's combining functions](https://docs.xarray.dev/en/stable/user-guide/combining.html), such as [`xarray.concat`](https://docs.xarray.dev/en/stable/generated/xarray.concat.html),
2020
* Commit the virtual references to storage either using the [Kerchunk references](https://fsspec.github.io/kerchunk/spec.html) specification or the [Icechunk](https://icechunk.io/) transactional storage engine.
2121
* Users access the virtual dataset using [`xarray.open_dataset`](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html#xarray.open_dataset).
@@ -79,6 +79,7 @@ self
7979
installation
8080
usage
8181
examples
82+
safetensors
8283
faq
8384
api
8485
releases

docs/safetensors.md

Lines changed: 5 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@ The SafeTensors reader in VirtualiZarr allows you to reference tensors stored in
44

55
## What is SafeTensors Format?
66

7-
SafeTensors is a file format for storing tensors (multidimensional arrays) that offers several advantages:
7+
SafeTensors is a file format developed by HuggingFace for storing tensors (multidimensional arrays)
8+
that offers several advantages:
89
- Safe: No use of pickle, eliminating security concerns
910
- Efficient: Zero-copy access for fast loading
1011
- Simple: Straightforward binary format with JSON header
@@ -18,7 +19,8 @@ The format consists of:
1819
## How VirtualiZarr's SafeTensors Reader Works
1920

2021
VirtualiZarr's SafeTensors reader allows you to:
21-
- Work with the tensors as xarray DataArrays with named dimensions
22+
- Create "virtual" Zarr stores pointing to chunks of data inside SafeTensors files
23+
- Open the virtual zarr stores as xarray DataArrays with named dimensions
2224
- Access specific slices of tensors from cloud storage
2325
- Preserve metadata from the original SafeTensors file
2426

@@ -35,10 +37,6 @@ vds = vz.open_virtual_dataset("model.safetensors")
3537
# Access tensors as xarray variables
3638
weight = vds["weight"]
3739
bias = vds["bias"]
38-
39-
# Convert to numpy arrays when needed
40-
weight_array = weight.values
41-
bias_array = bias.values
4240
```
4341

4442
## Custom Dimension Names
@@ -84,7 +82,7 @@ large_tensor = vds["large_tensor"]
8482

8583
The SafeTensors reader supports reading from the HuggingFace Hub:
8684
```python
87-
# S3
85+
# HuggingFace Hub
8886
vds = vz.open_virtual_dataset(
8987
"https://huggingface.co/openai-community/gpt2/model.safetensors",
9088
virtual_backend_kwargs={"revision": "main"}

docs/usage.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,28 @@ aws_credentials = {"key": ..., "secret": ...}
8989
vds = open_virtual_dataset("s3://some-bucket/file.nc", reader_options={'storage_options': aws_credentials})
9090
```
9191

92+
### Opening different file formats
93+
94+
VirtualiZarr automatically detects the file format based on the file extension or content. Currently supported formats include:
95+
96+
- **NetCDF/HDF5**: Scientific data formats (NetCDF3, NetCDF4/HDF5)
97+
- **DMRPP**: OPeNDAP Data Access Protocol responses
98+
- **FITS**: Astronomical data in Flexible Image Transport System format
99+
- **TIFF**: Tagged Image File Format for geospatial and scientific imagery
100+
- **SafeTensors**: ML model weights format (`*.safetensors`), see the [SafeTensors guide](safetensors.md) for details
101+
- **Kerchunk references**: Previously created virtualized references
102+
103+
Each format has specific readers optimized for its structure. For SafeTensors files, additional options like custom dimension naming are available:
104+
105+
```python
106+
# Open a SafeTensors file with custom dimension names
107+
custom_dims = {"weight": ["input_features", "output_features"]}
108+
vds = open_virtual_dataset(
109+
"model.safetensors",
110+
virtual_backend_kwargs={"dimension_names": custom_dims}
111+
)
112+
```
113+
92114
## Chunk Manifests
93115

94116
In the Zarr model N-dimensional arrays are stored as a series of compressed chunks, each labelled by a chunk key which indicates its position in the array. Whilst conventionally each of these Zarr chunks are a separate compressed binary file stored within a Zarr Store, there is no reason why these chunks could not actually already exist as part of another file (e.g. a netCDF file), and be loaded by reading a specific byte range from this pre-existing file.

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ hdf = [
5555
safetensors = [
5656
"safetensors",
5757
"ml-dtypes",
58+
"obstore>=0.5.1",
5859
]
5960

6061
# kerchunk-based readers

virtualizarr/readers/safetensors.py

Lines changed: 31 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,10 @@
22
import struct
33
from collections.abc import Iterable, Mapping
44
from pathlib import Path
5-
from typing import Any, Dict, Optional
5+
from typing import TYPE_CHECKING, Any, Dict, Optional
66
from urllib.parse import urlparse
77

88
import numpy as np
9-
from obstore.store import (
10-
HTTPStore,
11-
LocalStore,
12-
ObjectStore, # type: ignore[import-not-found]
13-
)
149
from xarray import Dataset, Index
1510

1611
from virtualizarr.manifests import (
@@ -26,6 +21,11 @@
2621
from virtualizarr.readers.api import VirtualBackend
2722
from virtualizarr.types import ChunkKey
2823

24+
if TYPE_CHECKING:
25+
from obstore.store import (
26+
ObjectStore, # type: ignore[import-not-found]
27+
)
28+
2929

3030
class SafeTensorsVirtualBackend(VirtualBackend):
3131
"""
@@ -67,7 +67,7 @@ class SafeTensorsVirtualBackend(VirtualBackend):
6767

6868
@staticmethod
6969
def _parse_safetensors_header(
70-
filepath: str, store: ObjectStore
70+
filepath: str, store: "ObjectStore"
7171
) -> tuple[dict[str, Any], int]:
7272
"""
7373
Parse the header of a SafeTensors file to extract metadata.
@@ -131,7 +131,7 @@ def _parse_safetensors_header(
131131
def _create_manifest_group(
132132
filepath: str,
133133
drop_variables: list,
134-
store: ObjectStore,
134+
store: "ObjectStore",
135135
dimension_names: Optional[Dict[str, list[str]]] = None,
136136
) -> ManifestGroup:
137137
"""
@@ -207,8 +207,11 @@ def _create_manifest_group(
207207

208208
data_start = 8 + header_size
209209

210+
def should_skip_tensor(tensor_name: str, drop_variables: list) -> bool:
211+
return tensor_name == "__metadata__" or tensor_name in drop_variables
212+
210213
for tensor_name, tensor_info in header.items():
211-
if tensor_name == "__metadata__" or tensor_name in drop_variables:
214+
if should_skip_tensor(tensor_name, drop_variables):
212215
continue
213216

214217
dtype_str = tensor_info["dtype"]
@@ -328,6 +331,11 @@ def _create_manifest_store(
328331
... revision="v2.0"
329332
... )
330333
"""
334+
from obstore.store import (
335+
HTTPStore,
336+
LocalStore,
337+
)
338+
331339
store_registry = ObjectStoreRegistry()
332340
store = default_object_store(filepath)
333341

@@ -518,6 +526,20 @@ def _create_chunk_manifest(
518526
a chunk manifest that points to the exact location of a tensor within the file,
519527
treating the entire tensor as a single chunk for efficient memory mapping.
520528
529+
The structure of the variable names within a Safetensors file often reflects a
530+
hierarchical organization, commonly represented using a dot separator (e.g.,
531+
'a.b.c'). While this structure could naturally map to a nested format like Zarr
532+
groups (e.g., a/b/c), the dominant framework for using these models, PyTorch,
533+
utilizes a flattened dictionary structure (a 'state dict') where these dot-separated
534+
names serve as keys.
535+
536+
To ease integration with PyTorch's expected format, ChunkManifests are currently a
537+
flattened dictionary where the keys are the dot-separated variable names.
538+
539+
Further consideration could be given to optionally returning the data as an
540+
xarray.DataTree to better represent the inherent hierarchical structure, but
541+
this has been deferred to prioritize compatibility with PyTorch workflows.
542+
521543
Parameters
522544
----------
523545
filepath : str

0 commit comments

Comments
 (0)