Skip to content

Add refresh_attributes() and implement cache_attrs for Group and Array #3215

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

bojidar-bg
Copy link

Should resolve #3178, if one passes cache_attrs=False when creating the various groups.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jul 8, 2025
Copy link

codecov bot commented Jul 8, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.75%. Comparing base (378d5af) to head (075a330).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3215      +/-   ##
==========================================
+ Coverage   94.73%   94.75%   +0.01%     
==========================================
  Files          78       78              
  Lines        8646     8670      +24     
==========================================
+ Hits         8191     8215      +24     
  Misses        455      455              
Files with missing lines Coverage Δ
src/zarr/api/synchronous.py 94.66% <ø> (ø)
src/zarr/core/array.py 98.48% <100.00%> (+0.02%) ⬆️
src/zarr/core/group.py 94.94% <100.00%> (+0.06%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines +878 to +881
cache_attrs : bool, optional
If True (default), user attributes will be cached for attribute read
operations. If False, user attributes are reloaded from the store prior
to all attribute read operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this docstring says true is the default, but the parameter itself has a default of None

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other places defining a cache_attrs argument did the same, so I decided to follow suit. (:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets break with the past and ensure that the docstring matches the annotation here

@d-v-b
Copy link
Contributor

d-v-b commented Jul 8, 2025

anyone have opinions on whether cache_attrs should have a default of True or False?

@TomAugspurger
Copy link
Contributor

Can someone write up a detailed proposal of the model here? The interaction between in-memory metadata objects and their serialized counterparts is a big topic.

If we're formally committing to something then we should do so carefully, and document it extensively.

@bojidar-bg
Copy link
Author

From my perspective of "just solving the issue", the crux of this PR is the refresh_attributes() function, and not the cache_attrs argument—I implemented cache_attrs only because it looked convenient to knock off another v3 missing feature. (:

As for a model... that would be a good idea, I agree.

If I understand the current main branch code correctly (from a cursory reading, mind it), the current model for caching data is that the store may cache data, but Array/Group never cache data themselves, only metadata. Metadata is cached when opening an array/group, and kept cached until you discard the respective object.
As such, array data and group items are never cached; meanwhile, group attributes and consolidated metadata, as well as array attributes, shapes, codecs, encoders, chunks sizes and such are always cached.
(Critically, since stores can never notify us of remote changes, writing to the same chunk/shard of data from multiple open instances of the same Array, or writing to the metadata of multiple open instances of the same Array or Group results in latest-write-wins behavior; which, with multiple peers resizing the same array, can result in various kinds of inconsistent data.)

This PR adds the possibility of manually (or automatically) reloading just the attributes of an array or group. Ideally, there would be support for reloading the whole metadata in case there might have been changes made by others—but for that, we would need to either make arrays/groups unfrozen, or guide users to re-open the whole array/group (and replace all references to the old version) in case there have been changes made by another process.

As for a better model:

  • Automatically watching for metadata changes seem infeasible. Perhaps support for it can be added in the store, but apart from the local and memory store, I don't think that can work for s3- and http-based stores.
  • Leaving things as-is and relying on users to manually re-open modified groups and arrays sounds like a footgun waiting to happen, especially with longer-running process which might keep a reference to an array around.
  • Adding a refresh_metadata method, with some ways to control metadata caching, seem plausible, but would require the AsyncArray() and AsyncGroup() classes to be unfrozen (so that they can replace their metadata on-the-fly).
  • Merging this PR as-is with its refresh_attributes method is probably also a footgun waiting to happen; as now not only would users have a way to slowly drift out of sync with their arrays' newest shapes, but they would also have a function to sync everything but the array's shape 😅

Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Group attributes not shared across different group instances with identical store_path, path and name
3 participants