-
Notifications
You must be signed in to change notification settings - Fork 214
Added new method #559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Added new method #559
Changes from 2 commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest reusing the existing logic like this:
HeapQuickSelectSketch.setHashTableThreshold(lgK, lgK + 1) + Family.QUICKSELECT.getMaxPreLongs()) * Long.BYTES
By the way, I noticed that the setHashTableThreshold() method is duplicated in DirectQuickSelectSketch (not introduced by this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, the method name setHashTableThreshold is a bit misleading. It does not set anything. It computes and returns the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HeapQuickSelectSketch
function togetHashTableThreshold(..)
DirectQuickSelectSketchR
function with the same name is actually different and uses a different constant. The Direct version is tuned to delay having to resize less often because the cost of resizing in off-heap is much more expensive than on-heap. Nonetheless, I changed its name togetOffHeapHashTableThreshold(...)
.As for the internal computation of getCompactSketchMaxBytes(...) I changed it to be a little different from what you proposed.
You proposed using the HeapQuickSelectSketch.getHashTableThreshold(lgK, lgK + 1).
Unfortunately, the function input was nomEntries and not lgNomEntries, which meant I would have to convert to LgK before using the proposed function. This is not a single CPU cycle operation. If you go look at the JVM source code for
Long.numberOfTrailingZeros(long i)
you'll see that it is not trivial.Meanwhile, we have for some time also been using
lgNomEntries
in Theta. (Go look at the Theta builder.) So I decided to switch the input tolgNomEntries
.getHashTableThreshold(..)
andgetOffHeapHashTableThreshold(...)
is to allow fine tuning of the resizing operation the sketch, which we actually do because we tune the resizing of the heap sketch differently from the off-heap sketch. Both of these were package-private, and neither was used in test. So I decided to make the heap functionprivate
and the off-heap versionprotected
because it is used by the child sketchDirectQuickSelectSketch
.I really didn't want to use these special functions for this
getCompactSketchMaxBytes(...)
, so I rewrote it with one of your suggestions and leveraging the REBUILD_THRESHOLD constant already defined.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand why not use getHashTableThreshold() for max serialized size computation. The size depends exactly on the maximum number of retained entries in the hash table. Now the knowledge about 2 * K * rebuild_threshold is sitting in two places
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. But my point is that the getHashTableThreshold(...) is doing multiple things and forcing it to just return the rebuild Thresh is a hack.
But to fix this I will need to separate the two functions getHashTableResizeThreshold and getHashTableRebuildThreshold. The resize Thresh should never change but the Resize Thresh depends on the type of sketch.
I will fix this, but you will have to approve it again :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started to implement this idea of trying to have one place that computes the 2 * K * rebuild_Threshold.
But it turned out to be much more complicated so I decided to leave it alone for now.
If, and when, we decide to make an improvement in this area, I want to do one more improvement:
We have 3 static final double constants: REBUILD_THRESHOLD, RESIZE_THRESHOLD, AND DQS_RESIZE_THRESHOLD that are used for tuning the performance of the hash tables in different situations.
Unfortunately, at the time, I decided to use doubles for these constants. This forces a computation like this:
return (int) Math.floor(<double_constant> * (1 << lgArrLongs));
every time they are used.However, the hash array sizes are always powers of 2, and never smaller than 16, and the constants are either 15/16 or 0.5, or fractions that can exactly produce an integer for all possible sizes of the hash tables.
This means, instead of using double constants, we could use static final methods that eliminate the double arithmetic and Math.floor. So the above equation becomes, for one example:
return ((1 << lgArrLongs) * 15) / 16;
or even better:
return ((16 << lgArrLongs) - (1 << lgArrLongs)) >>> 4
which is much faster.
I looked into this, but eliminating these double constants propagates everywhere and is a more complicated than I want to attempt right now. We can put this into our backlog.