Skip to content

Make a ZeroTrie dense data optimization #6920

@sffc

Description

@sffc

ZeroTrie excels at storing sparse data and identical prefixes, but it is not as strong at dense data and identical suffixes.

For example, consider these strings (for simplicity, assume each string maps to a unique integer):

ar/FR
ar/IR
ar/SA
ar/UK
ar/US
en/AU
en/FR
en/UK
en/US
fr/FR
fr/SA
fr/UK
fr/US
it/IT

In a ZeroAsciiTrie, these strings take up 71 B. In a VarZeroVec (Index16), they use 98 B.

However, we can optimize the storage by observing that there is a dense matrix hiding inside these strings:

Lang \ Region FR SA UK US
ar ✔️ ✔️ ✔️ ✔️
en ✔️ ✔️ ✔️
fr ✔️ ✔️ ✔️ ✔️

The ❌ is to illustrate that en-SA is not in the input data and the matrix need not be fully populated.

We can then store the dense matrix alongside a a trie with region names and a language trie with just these keys:

ar
ar/IR
en
en/AU
fr
it/IT

The sum of the size of the two tries and the dense matrix, assuming 1 B per cell in the matrix, is 55 B.

(For comparison, storing the key and row headers in their own tries with a fully dense matrix would be 56 B: not a significant difference in this toy example, but it starts to blow up when the dense matrix has a lot of sparsity.)

These numbers scale up to bigger datasets. For example, with this type of optimization, we could reduce the region display names lookup table from 272 kB to 137 kB.

See some early work in #6880

An open question is how we can detect the dense matrix inside of a zerotrie. An easy way to start would be to give a field separator hint like /, but it seems like we should be able to automatically extract the dense matrix from any arbitrary set of strings.

@Manishearth @echeran thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions