Text update and delete ingestion #412

BCathcart · 2025-10-11T02:19:29Z

This PR populates a separate TextIndex for every key to support deletes and, later on, inline filtering. When deleting a key from the index, the key-specific TextIndex is traversed to gather all the words indexed from it. This tells us which paths in the schema-level TextIndex will lead to a postings object that the key needs to be removed from.

Any time an indexed key is modified, it's first removed from the text index (if one is present), and then re-indexed. The inefficiency here is if a key is only partially updated (e.g. some non-indexed metadata is changed), we may unnecessarily delete and re-index the same text. The hope is that text indexing will be fast enough we can live with this. Otherwise, we can consider a solution where Valkey provides insight into which fields of the Hash or JSON document were actually mutated.

I've also added:

Code to populate the suffix tree, although I'm leaving this untested until the search side is hooked up.
Rudimentary locking so at least it shouldn't crash on concurrency issues.
Some small refactoring (e.g. the stemmer is now owned by the lexer).

Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>

…ndex

Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>

…ndex Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>

Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>

allenss-amazon · 2025-10-14T02:25:11Z

src/indexes/text/lexer.h

+  mutable sb_stemmer* stemmer_ = nullptr;
+
+  // TODO: Create a little pool of stemmers protected by this mutex
+  mutable std::mutex stemmer_mutex_;


Hmm, I thought that Lexer was per-thread. Why does it need a lock then?

Previously we did have a Lexer object created per thread, but really it was just a stateless set of functions. We created one stemmer per index schema and would pass that down into the lexer functions. @Aksha1812 reported that the stemmer wasn't threadsafe. I switched things around here so that there is one lexer instance per index schema that holds the stemmer and manages its concurrency.

We could create a lexer/stemmer instance per thread, but the stemmer instantiates some internal buffers every time so that's probably going to hurt performance. I was planning to play around with a small pool of stemmers.

allenss-amazon · 2025-10-14T02:27:30Z

src/indexes/text/radix_tree.h

+  // It's expected that the caller will know whether or not the word
+  // exists. Passing in a word that doesn't exist along with a
+  // nullopt new_target will cause the word to be added and


It's not clear how this is different from MutateTarget.

After reading the code, it's clear that this is an optimization for when the caller doesn't care what was at the word before, right? If so, perhaps the comment could be improved to indicate that.

I'll update the comment. It's likely these APIs change when we get to improving the locking. Right now the new target generated during the prefix tree update is applied to the suffix tree update, which is fine given the entire thing is wrapped in a mutex. Once the locking gets moved into the trees, we'll need some sort of coordination to make sure the suffix tree updates don't get applied out of order.

allenss-amazon · 2025-10-14T02:59:46Z

src/indexes/text/text_index.cc

+absl::StatusOr<bool> TextIndexSchema::IndexAttributeData(
+    const InternedStringPtr& key, absl::string_view data,
+    size_t text_field_number, bool stem, size_t min_stem_size, bool suffix) {
+  auto tokens = lexer_.Tokenize(data, GetPunctuationBitmap(), stem,
+                                min_stem_size, GetStopWordsSet());
+


This implementation assumes that it's easy to incrementally update the postings list, i.e., to add/remove key and position entries. While that's true for keys, it won't be true for positions for fields. When Aksha rebuilds the position map to be space efficient it won't be cheap to incrementally edit it.

I believe we'll want to move to structure where per-field/per-position information coming out of the lexer is retained in some working storage object. Then at the end of the key ingestion that working storage object is "flushed" out. This allows the accumulation of words/positions/fields for the keys and they can then be ingested in an efficient fashion.

I don't think this is hard to do. In the ingestion engine, there's a per-attribute loop that ends up calling AddRecord/ModifyRecord for each attribute(field) of the index. It should be easy to add this new object as an extra parameter to those functions and have the higher level loop create and then "flush" that object.

A key advantage of this scheme is that Aksha never has to incrementally edit a positions list as the data for it can be stored in the temporary object in a fashion that's native to the Position map.

Good point. I'll leave a TODO and track this task

Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>

BCathcart added 8 commits October 10, 2025 04:24

In progress

b59d461

Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>

Fix up text_index.h/.cc to what I had

55a8e31

Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>

Added suffix tree mutation

e0c76d8

Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>

Add integration test

c857c3c

Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>

Merge remote-tracking branch 'origin/fulltext' into brennan-per-key-i…

425e832

…ndex

Vibed multi-client test

c8d3725

Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>

Merge remote-tracking branch 'origin/fulltext' into brennan-per-key-i…

83830d4

…ndex Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>

Clang format

5f26dd8

Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>

BCathcart mentioned this pull request Oct 11, 2025

UpdateRecord, RemoveRecord Implementation for Text #336

Closed

BCathcart requested a review from allenss-amazon October 11, 2025 02:26

allenss-amazon reviewed Oct 14, 2025

View reviewed changes

BCathcart force-pushed the brennan-per-key-index branch from ef7f639 to 1c6b335 Compare October 14, 2025 20:03

Address PR comments and failing analyzers

49378e4

Signed-off-by: Brennan Cathcart <brennancathcart@gmail.com>

BCathcart force-pushed the brennan-per-key-index branch from 1c6b335 to 49378e4 Compare October 14, 2025 20:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Text update and delete ingestion #412

Text update and delete ingestion #412

BCathcart commented Oct 11, 2025 •

edited

Loading

Uh oh!

allenss-amazon Oct 14, 2025

Uh oh!

BCathcart Oct 14, 2025

Uh oh!

allenss-amazon Oct 14, 2025

Uh oh!

allenss-amazon Oct 14, 2025

Uh oh!

BCathcart Oct 14, 2025

Uh oh!

allenss-amazon Oct 14, 2025

Uh oh!

BCathcart Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Text update and delete ingestion #412

Are you sure you want to change the base?

Text update and delete ingestion #412

Conversation

BCathcart commented Oct 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

allenss-amazon Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

BCathcart Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

allenss-amazon Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

allenss-amazon Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

BCathcart Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

allenss-amazon Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

BCathcart Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BCathcart commented Oct 11, 2025 •

edited

Loading