Make a ZeroTrie dense data optimization

ZeroTrie excels at storing sparse data and identical prefixes, but it is not as strong at dense data and identical suffixes.

For example, consider these strings (for simplicity, assume each string maps to a unique integer):

```
ar/FR
ar/IR
ar/SA
ar/UK
ar/US
en/AU
en/FR
en/UK
en/US
fr/FR
fr/SA
fr/UK
fr/US
it/IT
```

In a ZeroAsciiTrie, these strings take up 71 B. In a VarZeroVec (Index16), they use 98 B.

However, we can optimize the storage by observing that there is a dense matrix hiding inside these strings:

| Lang \ Region | FR | SA | UK | US |
|---|---|---|---|---|
| ar | ✔️ | ✔️ | ✔️ | ✔️ |
| en | ✔️ | ❌ | ✔️ | ✔️ |
| fr | ✔️ | ✔️ | ✔️ | ✔️ |

The ❌ is to illustrate that `en-SA` is not in the input data and the matrix need not be fully populated.

We can then store the dense matrix alongside a a trie with region names and a language trie with just these keys:

```
ar
ar/IR
en
en/AU
fr
it/IT
```

The sum of the size of the two tries and the dense matrix, assuming 1 B per cell in the matrix, is 55 B.

(For comparison, storing the key and row headers in their own tries with a fully dense matrix would be 56 B: not a significant difference in this toy example, but it starts to blow up when the dense matrix has a lot of sparsity.)

These numbers scale up to bigger datasets. For example, with this type of optimization, we could reduce the region display names lookup table from 272 kB to 137 kB.

See some early work in https://github.com/unicode-org/icu4x/pull/6880

An open question is how we can detect the dense matrix inside of a zerotrie. An easy way to start would be to give a field separator hint like `/`, but it seems like we should be able to automatically extract the dense matrix from any arbitrary set of strings.

@Manishearth @echeran thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make a ZeroTrie dense data optimization #6920

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Lang \ Region	FR	SA	UK	US
ar	✔️	✔️	✔️	✔️
en	✔️	❌	✔️	✔️
fr	✔️	✔️	✔️	✔️

Make a ZeroTrie dense data optimization #6920

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions