Skip to content

feat: use values for all types #152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Conversation

henryiii
Copy link
Member

This modifies the schema to keep the types more consistent. When implementing it, I have to have if statements around this portion; it would be simpler if every storage had the same structure.

Now they all are structured as storage/data/values, instead of storage/data sometimes containing fields and sometimes not.

This is a change to the schema, so I'd like sign-offs from @HDembinski and @jpivarski. There will be a series of these based on my discussions with @pfackeldey today and my findings in scikit-hep/boost-histogram#997, so I'm apologizing in advance for the noise! This one specifically will also have a followup.

henryiii and others added 2 commits April 16, 2025 14:28
Signed-off-by: Henry Schreiner <henryschreineriii@gmail.com>
@HDembinski
Copy link
Member

HDembinski commented Apr 16, 2025

Can you summarize the rationale behind this?

Sounds like you want to make the implementation simpler at the cost of adding bloat to the serialization format. Why is that a good trade-of?

@henryiii
Copy link
Member Author

henryiii commented Apr 16, 2025

This adds consistency between the formats, at the expense of a single level of nesting for the integer/floating storage. Every storage type then has the same structure. Every writer will need to handle this complexity.

For example in HDF5:
# Before
if not isinstance(storage_data, dict):
    storage_grp.create_dataset("data", shape=storage_data.shape, data=storage_data)
else:
    storage_data_grp = storage_grp.create_group("data")
    for key, value in storage_data.items():
        storage_data_grp.create_dataset(key, shape=value.shape, data=value)

# After
storage_data_grp = storage_grp.create_group("data")
for key, value in storage_data.items():
    storage_data_grp.create_dataset(key, shape=value.shape, data=value)
# Before
if isinstance(data_grp, h5py.Dataset):
    storage["data"] = np.array(data_grp)
else:
    assert isinstance(data_grp, h5py.Group)
    storage["data"] = {key: np.array(data_grp[key]) for key in data_grp}

# After
assert isinstance(data_grp, h5py.Group)
storage["data"] = {key: np.array(data_grp[key]) for key in data_grp}

It also makes manual investigation of the data easier; just looking up storage/data/values is always the values, rather than storage/data sometimes returning an object and sometimes an array and sometimes a string (see followup).

There are two follow-on changes that this also helps:

We can move the string to the individual fields, rather than assuming a specific structure like we do now. So that would be:

{
  "type": "weighed",
  "data": {
    "values": "some/path/to/array",
    "variances": "some/path/to/array"
  }
}
{
  "type": "integer",
  "data": {
    "values": "some/path/to/array"
  }
}

ect.

And the original motivation for the change would be supporting a future addition of sparse storage. A sparse histogram would also have an "index" field. (Exact design to be discussed in a follow-up).

@henryiii
Copy link
Member Author

By the way, I'm assuming that having structure in our data format is fine and not an issue; if it is a "bloat" issue, we could always remove "data" entirely and just have "storage" be:

{
  "type": "integer",
  "values": "some/path/to/array"
}

and

{
  "type": "weighed",
  "values": "some/path/to/array",
  "variances": "some/path/to/array2"
}

ect.

@henryiii henryiii changed the title fix: use values for all types feat: use values for all types Apr 16, 2025
@@ -177,9 +177,16 @@
"oneOf": [
{
"type": "string",
"description": "A path (URI?) to the integer bin data."
"description": "A path (URI?) to the floating point bin data."
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"description": "A path (URI?) to the floating point bin data."
"description": "A path (URI?) to the integer bin data."

Copy and paste mistake

@HDembinski
Copy link
Member

I see, so if the change simplifies multiple implementations then ok.

I also like the idea to remove "data" completely, but I do not understand the ramifications of that. Does this cause unwanted side-effects?

@HDembinski
Copy link
Member

HDembinski commented Apr 18, 2025

I am assuming that technically, the difference in complexity of the storage structure (more or less nested) is not the issue here, so we are mainly discussing ease of maintenance here, and the desire to have an orderly, clean structure that is easy to understand. Is this assumption correct?

In other words, the difference in size of the serialization format and reading-writing speed are negligible, whether this PR goes through or not.

@henryiii
Copy link
Member Author

Yes, the data format is assumed to handle any level of nesting. I also think I like #155 a little better. Let's see what @jpivarski thinks.

@jpivarski
Copy link
Member

I'm in favor of making the format more consistent, though I haven't looked at the state before and after the change to know what had been inconsistent. A slightly larger header would be acceptable for that, too, since most of the memory usage should be in the histogram data, and the header should be optimized for understandability.

Main point: don't hold this up because of me. I think you're doing the right thing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants