Skip to content

Add ability for reverse-lookup (string retrieval) in ZeroTrie #6922

@sffc

Description

@sffc

ZeroTrie has been designed for "forward" lookup (string to integer). We could also support "backward" lookup (offset to string) by adding nodes to the ZeroTrie that allow jumping backward through the trie.

For example, a basic ZeroTrie containing "axb" and "ayc" has these bytes:

b'a'
0b11000010 <-- a branch node containing 2 edges
b'x'
b'y'
2          <-- part of the branch node: jump over 2 bytes if the value is 'y'
b'b'
0b10000000 <-- value 0
b'c'
0b10000001 <-- value 1

Reading the trie forward is as follows; suppose you are looking up "ayc":

  1. First node is an 'a'. Good, continue.
  2. Second node is a branch node with 2 values. Binary-search to find 'y', then skip 2 bytes past the end of the branch node.
  3. Third node is 'c'. Good, continue.
  4. Fourth node is the value node. This means "ayc" is in the trie.

The jump-backward nodes would be added as follows:

b'a'
0b11000010 <-- a branch node (0b110) containing 2 edges
b'x'
b'y'
3          <-- part of the branch node: jump over 3 bytes if the value is 'y'
0b11100010 <-- a jump-backward node (0b1110) that says to jump 4 bytes back
b'b'
0b10000000 <-- value node (0b10) with value 0
0b11100101 <-- a jump-backward node (0b1110) that says to jump 7 bytes back
b'c'
0b10000001 <-- value node (0b10) with value 1

Reading the trie backward would be as follows; suppose you are trying to extract the string "axb":

  1. Start at index 6 (this index is the integer that needs to be stored externally in order to perform reverse lookup)
  2. The node at index 6 is 'b'. Initialize the string with that character.
  3. Walk back. There is a jump-backward node, which tells us to go to the branch node.
  4. Evaluate the branch node to find the edge that led to our earlier position. It is a 'x'; prepend it to the string.
  5. Walk back. The node is 'a'; prepend it to the string.
  6. We are at the beginning of the string, and we successfully extracted "axb".

Alternatively, we could make a separate data structure that only supports reverse lookup:

b'a'
b'x'
b'b'
0b11100001 <-- skip 2
b'y'
b'c'

With this simpler structure, we only store jump-back nodes and remove everything else. Again, we start at the end of the string, and then walk backward, obeying skip nodes until we reach the start of the string. (We could instead store it in reverse order and walk forward.)

Related: #1397

CC @Manishearth @echeran @js-choi

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-zerovecComponent: Yoke, ZeroVec, DataBake

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions