Skip to content

Add tests for std::[unordered_][multi]set #39

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 14, 2025

Conversation

enirolf
Copy link
Contributor

@enirolf enirolf commented Feb 28, 2025

Closes #14.

Copy link
Member

@hahnjo hahnjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I left some comments, mostly we should start by figuring out what we want to test for the binary format in particular. Originally I considered being lazy for other containers and put all std::[unordered_][multi]set in a single test, without the full combination of index column types. But I think the multi containers make this awkward because there we actually want to test that the duplicates are preserved, and also we may want to be 100% certain that we get the index columns right. For the non-multi containers though, duplicates are essentially handled on the C++ side, before it comes to RNTuple I think, so not sure if we need those entries here...

@enirolf
Copy link
Contributor Author

enirolf commented Mar 4, 2025

Thanks for the PR! I left some comments, mostly we should start by figuring out what we want to test for the binary format in particular. Originally I considered being lazy for other containers and put all std::[unordered_][multi]set in a single test, without the full combination of index column types. But I think the multi containers make this awkward because there we actually want to test that the duplicates are preserved, and also we may want to be 100% certain that we get the index columns right. For the non-multi containers though, duplicates are essentially handled on the C++ side, before it comes to RNTuple I think, so not sure if we need those entries here...

Yeah that's fair enough! I suppose we could squash [unordered_]set and [unordered_]multiset. One thing to consider is that it's not a given that every writer/reader will be written in C++ (we already know this isn't the case), but the spec explicitly refers to C++ types. But you're right that still the ordering and duplicate handling is not something that the format itself is responsible for.

Taking it even further, the specification explicitly states that the on-disk representation is identical to std::vector, so if we really want to be strict about the scope we could even argue that these tests are somewhat redundant (except for maybe type handling?)..

@hahnjo
Copy link
Member

hahnjo commented Mar 4, 2025

I suppose we could squash [unordered_]set and [unordered_]multiset.

That might be a good compromise because we probably want the same input entries / entry classes for multiset and unordered_multiset...

Taking it even further, the specification explicitly states that the on-disk representation is identical to std::vector, so if we really want to be strict about the scope we could even argue that these tests are somewhat redundant (except for maybe type handling?)..

Yes, in my opinion we want each supported C++ type to appear at least once in the validation suite. But indeed, the question is how much different is it from a binary format perspective. There's two axes to that: index column encoding and nesting (e.g. std::set<std::set<std::int32_t>>).

@pcanal
Copy link
Member

pcanal commented Mar 4, 2025

But indeed, the question is how much different is it from a binary format perspective.

As a side note, in roottest we do have a test of reading each STL collections in file format into all other STL collections of the same content (for a large subset of cases)

@hahnjo
Copy link
Member

hahnjo commented Mar 5, 2025

After thinking this over, here are some more considerations:

  1. I think we want the nested tests of containers, both that values can be non-fundamental types (non-simple fields in ROOT) and that they work when below another container. It's not really much different from a binary format point of view, but I think the suite should also (try to) validate the write and read implementation and there I see plenty of ways how the (de)serialization can get things wrong (for example, just taking the entry number to get the size of a collection). If we can do the two things with a single test (e.g. std::set<std::set<std::int32_t>>) the better.
  2. We may want at least kIndex64 and kSplitIndex64 because those are the two default encodings (depending on compression on/off), so they are what will be used most often. kIndex32 and kSplitIndex32 are kind of niche. Again not really an argument of the binary format but more of the implementation.

The decision whether to merge [unordered_]set and [unordered_]multiset I leave to you. In the end, given that we already have the tests written out, we may as well just keep them and make it easier to find the tests for a particular container.

@enirolf
Copy link
Contributor Author

enirolf commented Mar 17, 2025

Based on this statement:

Yes, in my opinion we want each supported C++ type to appear at least once in the validation suite.

and because I right now don't have a clear idea how to nicely merge the ordered and unordered variants (two field types instead of one, i.e., [Unordered][Split]Index{32,64}? that would mix the field type with the column representation in a way though, which I don't like), for now I would opt to keep the four variants separate. I can remove the "redundant" test cases (e.g., the duplications in std::set) but I also don't think a bit of redundancy necessarily hurts here. What do you think?

@enirolf enirolf requested a review from hahnjo March 17, 2025 10:35
@hahnjo
Copy link
Member

hahnjo commented Mar 20, 2025

Thanks for the work of also updating the infrastructure and the CI, seems to pass 😃

for now I would opt to keep the four variants separate

Fine with me.

I can remove the "redundant" test cases (e.g., the duplications in std::set) but I also don't think a bit of redundancy necessarily hurts here. What do you think?

I would tend to remove the "duplicate" and "reverse" tests because at least I personally find it confusing that the output doesn't match the input, and it's actually not RNTuple that is responsible for that. Let's hear some opinions from others maybe?

@hahnjo hahnjo added this to the 1.0 milestone May 1, 2025
@pcanal
Copy link
Member

pcanal commented May 8, 2025

I would tend to remove the "duplicate" and "reverse" ....

I agree that those tests would not be adding much value. On the other hand, see my question/pondering on unordered_ and platform independence.

@hahnjo
Copy link
Member

hahnjo commented May 9, 2025

#39 (comment)

I agree that the semantic data on disk is identical to std::vector, so it's indeed not a matter of the binary format.

While typing this, I realized that the same reasoning could be applied to all the nested tests: In the end they are the same as std::vector<std::vector<std::int32_t>...

Edit: and the Index column encoding...

@enirolf
Copy link
Contributor Author

enirolf commented May 12, 2025

#39 (comment)

I agree that the semantic data on disk is identical to std::vector, so it's indeed not a matter of the binary format.

While typing this, I realized that the same reasoning could be applied to all the nested tests: In the end they are the same as std::vector<std::vector<std::int32_t>...

Edit: and the Index column encoding...

So, what would your proposal be? I mean, fundamentally they are all identical to std::vector but we also agreed we want to have at least natively covered C++ type in once. Taking these two arguments, if you ask me it sort of makes sense to have a one-to-one mirror of the std::vector tests because that would also allow to cross-reference between the outputs of the std::vector tests and the set-related tests. In other words, I would propose to merge the tests as they are now. I still don't think there's a perfect test case that exactly matches the requirements for the tests we set ourselves (i.e., only test the binary format but also cover natively supported C++ types), simply because they end up contradicting each other.

@enirolf enirolf requested a review from hahnjo May 12, 2025 09:46
@pcanal
Copy link
Member

pcanal commented May 12, 2025

Taking these two arguments, if you ask me it sort of makes sense to have a one-to-one mirror of the std::vector tests

This indeed has the 'nice' side effect that (except for map/multimap which contains pairs) the json files would be almost identical (i.e. just the (type)names would be different, isn't it)

@hahnjo
Copy link
Member

hahnjo commented May 13, 2025

Taking these two arguments, if you ask me it sort of makes sense to have a one-to-one mirror of the std::vector tests

This indeed has the 'nice' side effect that (except for map/multimap which contains pairs) the json files would be almost identical (i.e. just the (type)names would be different, isn't it)

Not quite, since (a) sets don't preserve their order and (b) we don't store duplicate elements. We could make the std::vector tests match these restrictions, but given that maps will produce a different json file anyway I don't think we should. However, by not storing duplicate elements even for multiset, we have the same reference files for all set containers, which is probably a good idea and another reason to do it that way.

Copy link
Member

@hahnjo hahnjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, apart from two comments that apply to all nested tests.

@hahnjo
Copy link
Member

hahnjo commented May 13, 2025

I mean, fundamentally they are all identical to std::vector but we also agreed we want to have at least natively covered C++ type in once. Taking these two arguments, if you ask me it sort of makes sense to have a one-to-one mirror of the std::vector tests because that would also allow to cross-reference between the outputs of the std::vector tests and the set-related tests.

I would have previously argued that we don't need nesting to have the C++ type once, a fundamental item type could be sufficient. But the discussions convinced me that this is useful to have the nesting since it directly affects the *Index* column data.

So, what would your proposal be?

I don't have a really good proposal. The best I could come up with after some thinking is effectively documenting the reasoning behind the current tests, and just generalize based on that: #50 Let me know what you think.

@enirolf enirolf merged commit 9d22af6 into root-project:main May 14, 2025
1 check passed
@enirolf enirolf deleted the types-sets branch May 14, 2025 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test(s) for std::[unordered_][multi]set
3 participants