Skip to content

Conversation

BCathcart
Copy link
Collaborator

@BCathcart BCathcart commented Oct 11, 2025

This PR populates a separate TextIndex for every key to support deletes and, later on, inline filtering. When deleting a key from the index, the key-specific TextIndex is traversed to gather all the words indexed from it. This tells us which paths in the schema-level TextIndex will lead to a postings object that the key needs to be removed from.

Any time an indexed key is modified, it's first removed from the text index (if one is present), and then re-indexed. The inefficiency here is if a key is only partially updated (e.g. some non-indexed metadata is changed), we may unnecessarily delete and re-index the same text. The hope is that text indexing will be fast enough we can live with this. Otherwise, we can consider a solution where Valkey provides insight into which fields of the Hash or JSON document were actually mutated.

I've also added:

  • Code to populate the suffix tree, although I'm leaving this untested until the search side is hooked up.
  • Rudimentary locking so at least it shouldn't crash on concurrency issues.
  • Some small refactoring (e.g. the stemmer is now owned by the lexer).

Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>
Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>
Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>
Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>
Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>
…ndex

Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>
Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>
mutable sb_stemmer* stemmer_ = nullptr;

// TODO: Create a little pool of stemmers protected by this mutex
mutable std::mutex stemmer_mutex_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I thought that Lexer was per-thread. Why does it need a lock then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we did have a Lexer object created per thread, but really it was just a stateless set of functions. We created one stemmer per index schema and would pass that down into the lexer functions. @Aksha1812 reported that the stemmer wasn't threadsafe. I switched things around here so that there is one lexer instance per index schema that holds the stemmer and manages its concurrency.

We could create a lexer/stemmer instance per thread, but the stemmer instantiates some internal buffers every time so that's probably going to hurt performance. I was planning to play around with a small pool of stemmers.

Comment on lines +99 to +101
// It's expected that the caller will know whether or not the word
// exists. Passing in a word that doesn't exist along with a
// nullopt new_target will cause the word to be added and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear how this is different from MutateTarget.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the code, it's clear that this is an optimization for when the caller doesn't care what was at the word before, right? If so, perhaps the comment could be improved to indicate that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the comment. It's likely these APIs change when we get to improving the locking. Right now the new target generated during the prefix tree update is applied to the suffix tree update, which is fine given the entire thing is wrapped in a mutex. Once the locking gets moved into the trees, we'll need some sort of coordination to make sure the suffix tree updates don't get applied out of order.

Comment on lines +98 to +103
absl::StatusOr<bool> TextIndexSchema::IndexAttributeData(
const InternedStringPtr& key, absl::string_view data,
size_t text_field_number, bool stem, size_t min_stem_size, bool suffix) {
auto tokens = lexer_.Tokenize(data, GetPunctuationBitmap(), stem,
min_stem_size, GetStopWordsSet());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation assumes that it's easy to incrementally update the postings list, i.e., to add/remove key and position entries. While that's true for keys, it won't be true for positions for fields. When Aksha rebuilds the position map to be space efficient it won't be cheap to incrementally edit it.

I believe we'll want to move to structure where per-field/per-position information coming out of the lexer is retained in some working storage object. Then at the end of the key ingestion that working storage object is "flushed" out. This allows the accumulation of words/positions/fields for the keys and they can then be ingested in an efficient fashion.

I don't think this is hard to do. In the ingestion engine, there's a per-attribute loop that ends up calling AddRecord/ModifyRecord for each attribute(field) of the index. It should be easy to add this new object as an extra parameter to those functions and have the higher level loop create and then "flush" that object.

A key advantage of this scheme is that Aksha never has to incrementally edit a positions list as the data for it can be stored in the temporary object in a fashion that's native to the Position map.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll leave a TODO and track this task

@BCathcart BCathcart force-pushed the brennan-per-key-index branch from ef7f639 to 1c6b335 Compare October 14, 2025 20:03
Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>
@BCathcart BCathcart force-pushed the brennan-per-key-index branch from 1c6b335 to 49378e4 Compare October 14, 2025 20:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants