-
Notifications
You must be signed in to change notification settings - Fork 46
Text update and delete ingestion #412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: fulltext
Are you sure you want to change the base?
Conversation
Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>
Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>
Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>
Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>
Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>
…ndex Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>
Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>
mutable sb_stemmer* stemmer_ = nullptr; | ||
|
||
// TODO: Create a little pool of stemmers protected by this mutex | ||
mutable std::mutex stemmer_mutex_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I thought that Lexer was per-thread. Why does it need a lock then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we did have a Lexer object created per thread, but really it was just a stateless set of functions. We created one stemmer per index schema and would pass that down into the lexer functions. @Aksha1812 reported that the stemmer wasn't threadsafe. I switched things around here so that there is one lexer instance per index schema that holds the stemmer and manages its concurrency.
We could create a lexer/stemmer instance per thread, but the stemmer instantiates some internal buffers every time so that's probably going to hurt performance. I was planning to play around with a small pool of stemmers.
// It's expected that the caller will know whether or not the word | ||
// exists. Passing in a word that doesn't exist along with a | ||
// nullopt new_target will cause the word to be added and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear how this is different from MutateTarget.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the code, it's clear that this is an optimization for when the caller doesn't care what was at the word before, right? If so, perhaps the comment could be improved to indicate that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update the comment. It's likely these APIs change when we get to improving the locking. Right now the new target generated during the prefix tree update is applied to the suffix tree update, which is fine given the entire thing is wrapped in a mutex. Once the locking gets moved into the trees, we'll need some sort of coordination to make sure the suffix tree updates don't get applied out of order.
absl::StatusOr<bool> TextIndexSchema::IndexAttributeData( | ||
const InternedStringPtr& key, absl::string_view data, | ||
size_t text_field_number, bool stem, size_t min_stem_size, bool suffix) { | ||
auto tokens = lexer_.Tokenize(data, GetPunctuationBitmap(), stem, | ||
min_stem_size, GetStopWordsSet()); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation assumes that it's easy to incrementally update the postings list, i.e., to add/remove key and position entries. While that's true for keys, it won't be true for positions for fields. When Aksha rebuilds the position map to be space efficient it won't be cheap to incrementally edit it.
I believe we'll want to move to structure where per-field/per-position information coming out of the lexer is retained in some working storage object. Then at the end of the key ingestion that working storage object is "flushed" out. This allows the accumulation of words/positions/fields for the keys and they can then be ingested in an efficient fashion.
I don't think this is hard to do. In the ingestion engine, there's a per-attribute loop that ends up calling AddRecord/ModifyRecord for each attribute(field) of the index. It should be easy to add this new object as an extra parameter to those functions and have the higher level loop create and then "flush" that object.
A key advantage of this scheme is that Aksha never has to incrementally edit a positions list as the data for it can be stored in the temporary object in a fashion that's native to the Position map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll leave a TODO and track this task
ef7f639
to
1c6b335
Compare
Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>
1c6b335
to
49378e4
Compare
This PR populates a separate TextIndex for every key to support deletes and, later on, inline filtering. When deleting a key from the index, the key-specific TextIndex is traversed to gather all the words indexed from it. This tells us which paths in the schema-level TextIndex will lead to a postings object that the key needs to be removed from.
Any time an indexed key is modified, it's first removed from the text index (if one is present), and then re-indexed. The inefficiency here is if a key is only partially updated (e.g. some non-indexed metadata is changed), we may unnecessarily delete and re-index the same text. The hope is that text indexing will be fast enough we can live with this. Otherwise, we can consider a solution where Valkey provides insight into which fields of the Hash or JSON document were actually mutated.
I've also added: