Address Tom comments

nenb · nenb · commit b08fc84d326a · 2025-04-22T17:33:47.000-03:00
diff --git a/docs/faq.md b/docs/faq.md
@@ -92,6 +92,10 @@ You can also use this approach to write a reader that starts from a kerchunk-for
 
 Currently if you want to call your new reader from `virtualizarr.open_virtual_dataset` you would need to open a PR to this repository, but we plan to generalize this system to allow 3rd party libraries to plug in via an entrypoint (see [issue #245](https://github.com/zarr-developers/VirtualiZarr/issues/245)).
 
+### What ML/AI model formats are supported?
+
+VirtualiZarr has built-in support for [SafeTensors](safetensors.md) files, which are commonly used for storing ML model weights in a safe, efficient format.
+
 ## How does this actually work?
 
 I'm glad you asked! We can think of the problem of providing virtualized zarr-like access to a set of archival files in some other format as a series of steps:
diff --git a/docs/index.md b/docs/index.md
@@ -15,7 +15,7 @@ VirtualiZarr aims to make the creation of cloud-optimized virtualized zarr data
 ## Features
 
 * Create virtual references pointing to bytes inside a archival file with [`open_virtual_dataset`](https://virtualizarr.readthedocs.io/en/latest/usage.html#opening-files-as-virtual-datasets),
-* Supports a [range of archival file formats](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare), including netCDF4 and HDF5,
+* Supports a [range of archival file formats](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare), including netCDF4, HDF5, and [SafeTensors](safetensors.md),
 * [Combine data from multiple files](https://virtualizarr.readthedocs.io/en/latest/usage.html#combining-virtual-datasets) into one larger store using [xarray's combining functions](https://docs.xarray.dev/en/stable/user-guide/combining.html), such as [`xarray.concat`](https://docs.xarray.dev/en/stable/generated/xarray.concat.html),
 * Commit the virtual references to storage either using the [Kerchunk references](https://fsspec.github.io/kerchunk/spec.html) specification or the [Icechunk](https://icechunk.io/) transactional storage engine.
 * Users access the virtual dataset using [`xarray.open_dataset`](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html#xarray.open_dataset).
@@ -79,6 +79,7 @@ self
 installation
 usage
 examples
+safetensors
 faq
 api
 releases
diff --git a/docs/safetensors.md b/docs/safetensors.md
@@ -4,7 +4,8 @@ The SafeTensors reader in VirtualiZarr allows you to reference tensors stored in
 
 ## What is SafeTensors Format?
 
-SafeTensors is a file format for storing tensors (multidimensional arrays) that offers several advantages:
+SafeTensors is a file format developed by HuggingFace for storing tensors (multidimensional arrays)
+that offers several advantages:
 - Safe: No use of pickle, eliminating security concerns
 - Efficient: Zero-copy access for fast loading
 - Simple: Straightforward binary format with JSON header
@@ -18,7 +19,8 @@ The format consists of:
 ## How VirtualiZarr's SafeTensors Reader Works
 
 VirtualiZarr's SafeTensors reader allows you to:
-- Work with the tensors as xarray DataArrays with named dimensions
+- Create "virtual" Zarr stores pointing to chunks of data inside SafeTensors files
+- Open the virtual zarr stores as xarray DataArrays with named dimensions
 - Access specific slices of tensors from cloud storage
 - Preserve metadata from the original SafeTensors file
 
@@ -35,10 +37,6 @@ vds = vz.open_virtual_dataset("model.safetensors")
 # Access tensors as xarray variables
 weight = vds["weight"]
 bias = vds["bias"]
-
-# Convert to numpy arrays when needed
-weight_array = weight.values
-bias_array = bias.values
 ```
 
 ## Custom Dimension Names
@@ -84,7 +82,7 @@ large_tensor = vds["large_tensor"]
 
 The SafeTensors reader supports reading from the HuggingFace Hub:
 ```python
-# S3
+# HuggingFace Hub
 vds = vz.open_virtual_dataset(
     "https://huggingface.co/openai-community/gpt2/model.safetensors",
     virtual_backend_kwargs={"revision": "main"}
diff --git a/docs/usage.md b/docs/usage.md
@@ -89,6 +89,28 @@ aws_credentials = {"key": ..., "secret": ...}
 vds = open_virtual_dataset("s3://some-bucket/file.nc", reader_options={'storage_options': aws_credentials})
 ```
 
+### Opening different file formats
+
+VirtualiZarr automatically detects the file format based on the file extension or content. Currently supported formats include:
+
+- **NetCDF/HDF5**: Scientific data formats (NetCDF3, NetCDF4/HDF5)
+- **DMRPP**: OPeNDAP Data Access Protocol responses
+- **FITS**: Astronomical data in Flexible Image Transport System format
+- **TIFF**: Tagged Image File Format for geospatial and scientific imagery
+- **SafeTensors**: ML model weights format (`*.safetensors`), see the [SafeTensors guide](safetensors.md) for details
+- **Kerchunk references**: Previously created virtualized references
+
+Each format has specific readers optimized for its structure. For SafeTensors files, additional options like custom dimension naming are available:
+
+```python
+# Open a SafeTensors file with custom dimension names
+custom_dims = {"weight": ["input_features", "output_features"]}
+vds = open_virtual_dataset(
+    "model.safetensors",
+    virtual_backend_kwargs={"dimension_names": custom_dims}
+)
+```
+
 ## Chunk Manifests
 
 In the Zarr model N-dimensional arrays are stored as a series of compressed chunks, each labelled by a chunk key which indicates its position in the array. Whilst conventionally each of these Zarr chunks are a separate compressed binary file stored within a Zarr Store, there is no reason why these chunks could not actually already exist as part of another file (e.g. a netCDF file), and be loaded by reading a specific byte range from this pre-existing file.
diff --git a/pyproject.toml b/pyproject.toml
@@ -55,6 +55,7 @@ hdf = [
 safetensors = [
     "safetensors",
     "ml-dtypes",
+    "obstore>=0.5.1",
 ]
 
 # kerchunk-based readers
diff --git a/virtualizarr/readers/safetensors.py b/virtualizarr/readers/safetensors.py
@@ -2,15 +2,10 @@
 import struct
 from collections.abc import Iterable, Mapping
 from pathlib import Path
-from typing import Any, Dict, Optional
+from typing import TYPE_CHECKING, Any, Dict, Optional
 from urllib.parse import urlparse
 
 import numpy as np
-from obstore.store import (
-    HTTPStore,
-    LocalStore,
-    ObjectStore,  # type: ignore[import-not-found]
-)
 from xarray import Dataset, Index
 
 from virtualizarr.manifests import (
@@ -26,6 +21,11 @@
 from virtualizarr.readers.api import VirtualBackend
 from virtualizarr.types import ChunkKey
 
+if TYPE_CHECKING:
+    from obstore.store import (
+        ObjectStore,  # type: ignore[import-not-found]
+    )
+
 
 class SafeTensorsVirtualBackend(VirtualBackend):
     """
@@ -67,7 +67,7 @@ class SafeTensorsVirtualBackend(VirtualBackend):
 
     @staticmethod
     def _parse_safetensors_header(
-        filepath: str, store: ObjectStore
+        filepath: str, store: "ObjectStore"
     ) -> tuple[dict[str, Any], int]:
         """
         Parse the header of a SafeTensors file to extract metadata.
@@ -131,7 +131,7 @@ def _parse_safetensors_header(
     def _create_manifest_group(
         filepath: str,
         drop_variables: list,
-        store: ObjectStore,
+        store: "ObjectStore",
         dimension_names: Optional[Dict[str, list[str]]] = None,
     ) -> ManifestGroup:
         """
@@ -207,8 +207,11 @@ def _create_manifest_group(
 
         data_start = 8 + header_size
 
+        def should_skip_tensor(tensor_name: str, drop_variables: list) -> bool:
+            return tensor_name == "__metadata__" or tensor_name in drop_variables
+
         for tensor_name, tensor_info in header.items():
-            if tensor_name == "__metadata__" or tensor_name in drop_variables:
+            if should_skip_tensor(tensor_name, drop_variables):
                 continue
 
             dtype_str = tensor_info["dtype"]
@@ -328,6 +331,11 @@ def _create_manifest_store(
         ...     revision="v2.0"
         ... )
         """
+        from obstore.store import (
+            HTTPStore,
+            LocalStore,
+        )
+
         store_registry = ObjectStoreRegistry()
         store = default_object_store(filepath)
 
@@ -518,6 +526,20 @@ def _create_chunk_manifest(
         a chunk manifest that points to the exact location of a tensor within the file,
         treating the entire tensor as a single chunk for efficient memory mapping.
 
+        The structure of the variable names within a Safetensors file often reflects a
+        hierarchical organization, commonly represented using a dot separator (e.g.,
+        'a.b.c'). While this structure could naturally map to a nested format like Zarr
+        groups (e.g., a/b/c), the dominant framework for using these models, PyTorch,
+        utilizes a flattened dictionary structure (a 'state dict') where these dot-separated
+        names serve as keys.
+
+        To ease integration with PyTorch's expected format, ChunkManifests are currently a
+        flattened dictionary where the keys are the dot-separated variable names.
+
+        Further consideration could be given to optionally returning the data as an
+        xarray.DataTree to better represent the inherent hierarchical structure, but
+        this has been deferred to prioritize compatibility with PyTorch workflows.
+
         Parameters
         ----------
         filepath : str
diff --git a/virtualizarr/tests/test_readers/test_safetensors.py b/virtualizarr/tests/test_readers/test_safetensors.py

Original file line number	Diff line number	Diff line change
`@@ -55,6 +55,7 @@ hdf = [`
`55`	`55`	`safetensors = [`
`56`	`56`	`"safetensors",`
`57`	`57`	`"ml-dtypes",`
	`58`	`+ "obstore>=0.5.1",`
`58`	`59`	`]`
`59`	`60`
`60`	`61`	`# kerchunk-based readers`