feat: use values for all types #152

henryiii · 2025-04-16T18:33:03Z

This modifies the schema to keep the types more consistent. When implementing it, I have to have if statements around this portion; it would be simpler if every storage had the same structure.

Now they all are structured as storage/data/values, instead of storage/data sometimes containing fields and sometimes not.

This is a change to the schema, so I'd like sign-offs from @HDembinski and @jpivarski. There will be a series of these based on my discussions with @pfackeldey today and my findings in scikit-hep/boost-histogram#997, so I'm apologizing in advance for the noise! This one specifically will also have a followup.

Signed-off-by: Henry Schreiner <henryschreineriii@gmail.com>

for more information, see https://pre-commit.ci

HDembinski · 2025-04-16T18:36:32Z

Can you summarize the rationale behind this?

Sounds like you want to make the implementation simpler at the cost of adding bloat to the serialization format. Why is that a good trade-of?

henryiii · 2025-04-16T19:29:28Z

This adds consistency between the formats, at the expense of a single level of nesting for the integer/floating storage. Every storage type then has the same structure. Every writer will need to handle this complexity.

For example in HDF5:

# Before
if not isinstance(storage_data, dict):
    storage_grp.create_dataset("data", shape=storage_data.shape, data=storage_data)
else:
    storage_data_grp = storage_grp.create_group("data")
    for key, value in storage_data.items():
        storage_data_grp.create_dataset(key, shape=value.shape, data=value)

# After
storage_data_grp = storage_grp.create_group("data")
for key, value in storage_data.items():
    storage_data_grp.create_dataset(key, shape=value.shape, data=value)

# Before
if isinstance(data_grp, h5py.Dataset):
    storage["data"] = np.array(data_grp)
else:
    assert isinstance(data_grp, h5py.Group)
    storage["data"] = {key: np.array(data_grp[key]) for key in data_grp}

# After
assert isinstance(data_grp, h5py.Group)
storage["data"] = {key: np.array(data_grp[key]) for key in data_grp}

It also makes manual investigation of the data easier; just looking up storage/data/values is always the values, rather than storage/data sometimes returning an object and sometimes an array and sometimes a string (see followup).

There are two follow-on changes that this also helps:

We can move the string to the individual fields, rather than assuming a specific structure like we do now. So that would be:

{
  "type": "weighed",
  "data": {
    "values": "some/path/to/array",
    "variances": "some/path/to/array"
  }
}

{
  "type": "integer",
  "data": {
    "values": "some/path/to/array"
  }
}

ect.

And the original motivation for the change would be supporting a future addition of sparse storage. A sparse histogram would also have an "index" field. (Exact design to be discussed in a follow-up).

henryiii · 2025-04-16T20:04:51Z

By the way, I'm assuming that having structure in our data format is fine and not an issue; if it is a "bloat" issue, we could always remove "data" entirely and just have "storage" be:

{
  "type": "integer",
  "values": "some/path/to/array"
}

and

{
  "type": "weighed",
  "values": "some/path/to/array",
  "variances": "some/path/to/array2"
}

ect.

henryiii · 2025-04-16T21:23:42Z

src/uhi/resources/histogram.schema.json

@@ -177,9 +177,16 @@
          "oneOf": [
            {
              "type": "string",
-              "description": "A path (URI?) to the integer bin data."
+              "description": "A path (URI?) to the floating point bin data."


Suggested change

"description": "A path (URI?) to the floating point bin data."

"description": "A path (URI?) to the integer bin data."

Copy and paste mistake

HDembinski · 2025-04-18T08:17:13Z

I see, so if the change simplifies multiple implementations then ok.

I also like the idea to remove "data" completely, but I do not understand the ramifications of that. Does this cause unwanted side-effects?

HDembinski · 2025-04-18T08:20:04Z

I am assuming that technically, the difference in complexity of the storage structure (more or less nested) is not the issue here, so we are mainly discussing ease of maintenance here, and the desire to have an orderly, clean structure that is easy to understand. Is this assumption correct?

In other words, the difference in size of the serialization format and reading-writing speed are negligible, whether this PR goes through or not.

henryiii · 2025-04-18T13:30:42Z

Yes, the data format is assumed to handle any level of nesting. I also think I like #155 a little better. Let's see what @jpivarski thinks.

jpivarski · 2025-04-18T14:30:22Z

I'm in favor of making the format more consistent, though I haven't looked at the state before and after the change to know what had been inconsistent. A slightly larger header would be acceptable for that, too, since most of the memory usage should be in the histogram data, and the header should be optimized for understandability.

Main point: don't hold this up because of me. I think you're doing the right thing!

henryiii and others added 2 commits April 16, 2025 14:28

fix: use values for all types

15f294b

Signed-off-by: Henry Schreiner <henryschreineriii@gmail.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

8cfe88c

for more information, see https://pre-commit.ci

henryiii changed the title ~~fix: use values for all types~~ feat: use values for all types Apr 16, 2025

henryiii commented Apr 16, 2025

View reviewed changes

henryiii mentioned this pull request Apr 16, 2025

feat: remove data nesting #155

Merged

henryiii closed this in #155 Apr 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: use values for all types #152

feat: use values for all types #152

Uh oh!

henryiii commented Apr 16, 2025

Uh oh!

HDembinski commented Apr 16, 2025 •

edited

Loading

Uh oh!

henryiii commented Apr 16, 2025 •

edited

Loading

Uh oh!

henryiii commented Apr 16, 2025

Uh oh!

henryiii Apr 16, 2025

Uh oh!

HDembinski commented Apr 18, 2025

Uh oh!

HDembinski commented Apr 18, 2025 •

edited

Loading

Uh oh!

henryiii commented Apr 18, 2025

Uh oh!

jpivarski commented Apr 18, 2025

Uh oh!

Uh oh!

	"description": "A path (URI?) to the floating point bin data."
	"description": "A path (URI?) to the integer bin data."

feat: use values for all types #152

feat: use values for all types #152

Uh oh!

Conversation

henryiii commented Apr 16, 2025

Uh oh!

HDembinski commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henryiii commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henryiii commented Apr 16, 2025

Uh oh!

henryiii Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

HDembinski commented Apr 18, 2025

Uh oh!

HDembinski commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henryiii commented Apr 18, 2025

Uh oh!

jpivarski commented Apr 18, 2025

Uh oh!

Uh oh!

HDembinski commented Apr 16, 2025 •

edited

Loading

henryiii commented Apr 16, 2025 •

edited

Loading

HDembinski commented Apr 18, 2025 •

edited

Loading