-
Notifications
You must be signed in to change notification settings - Fork 4
feat: use values for all types #152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Henry Schreiner <henryschreineriii@gmail.com>
for more information, see https://pre-commit.ci
Can you summarize the rationale behind this? Sounds like you want to make the implementation simpler at the cost of adding bloat to the serialization format. Why is that a good trade-of? |
This adds consistency between the formats, at the expense of a single level of nesting for the integer/floating storage. Every storage type then has the same structure. Every writer will need to handle this complexity. For example in HDF5:# Before
if not isinstance(storage_data, dict):
storage_grp.create_dataset("data", shape=storage_data.shape, data=storage_data)
else:
storage_data_grp = storage_grp.create_group("data")
for key, value in storage_data.items():
storage_data_grp.create_dataset(key, shape=value.shape, data=value)
# After
storage_data_grp = storage_grp.create_group("data")
for key, value in storage_data.items():
storage_data_grp.create_dataset(key, shape=value.shape, data=value) # Before
if isinstance(data_grp, h5py.Dataset):
storage["data"] = np.array(data_grp)
else:
assert isinstance(data_grp, h5py.Group)
storage["data"] = {key: np.array(data_grp[key]) for key in data_grp}
# After
assert isinstance(data_grp, h5py.Group)
storage["data"] = {key: np.array(data_grp[key]) for key in data_grp} It also makes manual investigation of the data easier; just looking up There are two follow-on changes that this also helps: We can move the string to the individual fields, rather than assuming a specific structure like we do now. So that would be: {
"type": "weighed",
"data": {
"values": "some/path/to/array",
"variances": "some/path/to/array"
}
} {
"type": "integer",
"data": {
"values": "some/path/to/array"
}
} ect. And the original motivation for the change would be supporting a future addition of sparse storage. A sparse histogram would also have an |
By the way, I'm assuming that having structure in our data format is fine and not an issue; if it is a "bloat" issue, we could always remove {
"type": "integer",
"values": "some/path/to/array"
} and {
"type": "weighed",
"values": "some/path/to/array",
"variances": "some/path/to/array2"
} ect. |
@@ -177,9 +177,16 @@ | |||
"oneOf": [ | |||
{ | |||
"type": "string", | |||
"description": "A path (URI?) to the integer bin data." | |||
"description": "A path (URI?) to the floating point bin data." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"description": "A path (URI?) to the floating point bin data." | |
"description": "A path (URI?) to the integer bin data." |
Copy and paste mistake
I see, so if the change simplifies multiple implementations then ok. I also like the idea to remove "data" completely, but I do not understand the ramifications of that. Does this cause unwanted side-effects? |
I am assuming that technically, the difference in complexity of the storage structure (more or less nested) is not the issue here, so we are mainly discussing ease of maintenance here, and the desire to have an orderly, clean structure that is easy to understand. Is this assumption correct? In other words, the difference in size of the serialization format and reading-writing speed are negligible, whether this PR goes through or not. |
Yes, the data format is assumed to handle any level of nesting. I also think I like #155 a little better. Let's see what @jpivarski thinks. |
I'm in favor of making the format more consistent, though I haven't looked at the state before and after the change to know what had been inconsistent. A slightly larger header would be acceptable for that, too, since most of the memory usage should be in the histogram data, and the header should be optimized for understandability. Main point: don't hold this up because of me. I think you're doing the right thing! |
This modifies the schema to keep the types more consistent. When implementing it, I have to have if statements around this portion; it would be simpler if every storage had the same structure.
Now they all are structured as
storage/data/values
, instead ofstorage/data
sometimes containing fields and sometimes not.This is a change to the schema, so I'd like sign-offs from @HDembinski and @jpivarski. There will be a series of these based on my discussions with @pfackeldey today and my findings in scikit-hep/boost-histogram#997, so I'm apologizing in advance for the noise! This one specifically will also have a followup.