Use icechunk with UHI compatible histograms #1059
pfackeldey
started this conversation in
Ideas
Replies: 3 comments 1 reply
-
|
Very cool use case! |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
Here is a solution from the sprints: |
Beta Was this translation helpful? Give feedback.
1 reply
-
|
Update: It looks like I'm getting somewhere: https://github.com/pfackeldey/hizt?tab=readme-ov-file#usage There are things to be fine-tuned and some help needed to make e.g. structured dtypes work in zarr v3, but the concept works 🎉 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @rabernat,
as discussed at SciPy, we're (HEP community) are interested to try icechunk on top of UHI compatible histograms that have now a standardized serialization format. The idea is to store histograms in Zarr groups and use a smart chunking that aligns with e.g. categorical bin axes. It's relatively common that during development of a HEP analysis only some bins (chunks) will be updated while the rest remains the same as in a previous run. Our histograms can easily amount to O(100)GB on disk, so saving only small diffs can be very efficient here. For reproducibility it's also interesting for us to be able to go back in time to any histogram in the past (histograms are among our most valuable outputs, they hold all information before we run our statistical inference).
I've prepared a zarr dev branch that can read and write our serialization format for histograms to disk: https://github.com/scikit-hep/uhi/tree/pfackeldey/zarr. Currently, I'm using auto-chunking there, because it was not yet entirely clear to me what the best chunking would be and if
icechunkwould prefer certain conditions, e.g. chunkings do not change between runs?Here's a small gist that you can have a look at / play around with: https://gist.github.com/pfackeldey/8ff6f5f9c224587fd154be3e7abed8e5 (You can run it standalone with
uv run icechunk_uhi.py)We're typically writing one large histogram to disk as the output of one "analysis run", it would be great if
icechunkcould figure out diffs on the chunk-level automatically there.Given that our serialization format is JSON based (metadata + blob of numeric data), this does seem to be connected to #1054.
Thanks a lot for your help and input!
Best, Peter
(tagging some people who may be interested: @matthewfeickert @henryiii)
Beta Was this translation helpful? Give feedback.
All reactions