Use icechunk with UHI compatible histograms #1059

pfackeldey · 2025-07-09T23:26:33Z

pfackeldey
Jul 9, 2025

Hi @rabernat,
as discussed at SciPy, we're (HEP community) are interested to try icechunk on top of UHI compatible histograms that have now a standardized serialization format. The idea is to store histograms in Zarr groups and use a smart chunking that aligns with e.g. categorical bin axes. It's relatively common that during development of a HEP analysis only some bins (chunks) will be updated while the rest remains the same as in a previous run. Our histograms can easily amount to O(100)GB on disk, so saving only small diffs can be very efficient here. For reproducibility it's also interesting for us to be able to go back in time to any histogram in the past (histograms are among our most valuable outputs, they hold all information before we run our statistical inference).

I've prepared a zarr dev branch that can read and write our serialization format for histograms to disk: https://github.com/scikit-hep/uhi/tree/pfackeldey/zarr. Currently, I'm using auto-chunking there, because it was not yet entirely clear to me what the best chunking would be and if icechunk would prefer certain conditions, e.g. chunkings do not change between runs?

Here's a small gist that you can have a look at / play around with: https://gist.github.com/pfackeldey/8ff6f5f9c224587fd154be3e7abed8e5 (You can run it standalone with uv run icechunk_uhi.py)

We're typically writing one large histogram to disk as the output of one "analysis run", it would be great if icechunk could figure out diffs on the chunk-level automatically there.

Given that our serialization format is JSON based (metadata + blob of numeric data), this does seem to be connected to #1054.

Thanks a lot for your help and input!
Best, Peter

(tagging some people who may be interested: @matthewfeickert @henryiii)

paraseba · 2025-07-10T00:13:24Z

paraseba
Jul 10, 2025
Maintainer

Very cool use case!

0 replies

dcherian · 2025-07-12T18:40:23Z

dcherian
Jul 12, 2025
Maintainer

Here is a solution from the sprints:
https://gist.github.com/dcherian/bb9cf78515082c99a35fdc5d50fe06f2

1 reply

pfackeldey Jul 12, 2025
Author

Thank you so much! We'll work on expanding this for HEP next week 🚀

pfackeldey · 2025-07-14T04:59:48Z

pfackeldey
Jul 14, 2025
Author

Update: It looks like I'm getting somewhere: https://github.com/pfackeldey/hizt?tab=readme-ov-file#usage

There are things to be fine-tuned and some help needed to make e.g. structured dtypes work in zarr v3, but the concept works 🎉

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use icechunk with UHI compatible histograms #1059

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Use icechunk with UHI compatible histograms #1059

Uh oh!

pfackeldey Jul 9, 2025

Replies: 3 comments · 1 reply

Uh oh!

paraseba Jul 10, 2025 Maintainer

Uh oh!

dcherian Jul 12, 2025 Maintainer

Uh oh!

pfackeldey Jul 12, 2025 Author

Uh oh!

pfackeldey Jul 14, 2025 Author

pfackeldey
Jul 9, 2025

Replies: 3 comments 1 reply

paraseba
Jul 10, 2025
Maintainer

dcherian
Jul 12, 2025
Maintainer

pfackeldey Jul 12, 2025
Author

pfackeldey
Jul 14, 2025
Author