JSON document search in Lance #3841

jackye1995 · 2025-05-17T17:41:58Z

jackye1995
May 17, 2025
Maintainer

JSON document is an important part of multimodal AI data. Currently if we store JSON document as string or blob, we cannot efficiently do common operations like searching for a path.

Existing approaches

Some existing approaches

Tantivy

https://github.com/quickwit-oss/tantivy/blob/main/doc/src/json.md

For instance, if user is a json field, the following document:

{
    "user": {
        "name": "Paul Masurel",
        "address": {
            "city": "Tokyo",
            "country": "Japan"
        },
        "created_at": "2018-11-12T23:20:50.52Z"
    }
}

emits the following tokens:

("name", Text, "Paul")
("name", Text, "Masurel")
("address.city", Text, "Tokyo")
("address.country", Text, "Japan")
("created_at", Date, 15420648505)

Then the tokens are sorted and binary encoded for search.

Parquet VARIANT

https://github.com/apache/parquet-format/blob/master/VariantEncoding.md

Shredding is used to allow fast access to certain key value within the semi-structured document.

Potential approach in Lance

I think there can be an alternative approach in Lance, leveraging its data evolution ability. Instead of emitting and encoding tokens for each document as what tantivy is doing, or trying to shred the data in the column, we can emit 1 column per path. A process can sample the documents, or based on user configuration, create new columns and related indexes for them. Take the same example as tantivy, suppose the document is stored as column document, then we will create the following columns with indexes:

document.name: string, btree
document.address.city: string, bitmap
document.address.country: string, bitmap
document.created_at: date, bitmap

And based on the value of each document, we create and backfill these additional columns.

Then users can easily search any field in the document, like SELECT document.created_at FROM table WHERE document.address.city = 'San Francisco', which becomes an index-based random access search.

This actually also solves the pitfalls of the tantivy JSON index described in their documentation:

the data type is precisely defined for each JSON path
for list object, we can use a label list index, so it will not produce wrong result

Note on that "process to sample the documents..."

I think this should be more of a business logic outside the Lance format to wrap the behavior described above, but maybe it also makes sense to have some related concepts exposed in Lance. Curious what others think.

changhiskhan · 2025-05-17T17:53:50Z

changhiskhan
May 17, 2025
Maintainer

Interesting idea. Could work well if the schema is mostly stable but if there's really a huge variety with deep nesting, you could potentially end up with thousands of columns, most of which you don't care to search at all. And then fetching the json also becomes way too expensive if it's all broken out like that?

1 reply

jackye1995 May 18, 2025
Maintainer Author

Yes you are absolutely right. And I think "huge variety" and "deep nesting" are 2 separated questions, let me break those 2.

If the documents are very different from each other

I was mainly thinking about the case where the column store documents that are somewhat related (e.g. emitted from the same group of IoT devices, records of the requests from the same API calls, etc.). I feel this already cover a good portion of use cases, and is naturally how most people and organization organize columns, that's why I wanted to at least propose this idea here.

If it is a case of just dumping vastly different documents (which is becoming increasingly common with LLM apps), then this approach will not work well, and maybe we should create the tantivy-like JSON tokens or something similar if there are better ones.

In fact, we can take a similar approach to also store the tantivy JSON tokens as a new column with type array<struct<path: string, type: string, value: string>> and then we could run a similar search algorithm as tantivy (or directly invoke tantivy against those tokens? I did not look into that level of detail to see if that is possible). We could also design an index in Lance to accelerate that search if necessary (again need to look into more details, just a thought at this moment).

If the document is deeply nested

If the document is super nested and we want to control the number of path columns generated, I think this will require some intelligence or user input from that "process" to determine what are meaningful or useful JSON paths to create separated columns (e.g. top 10 most frequently used columns). For other columns, we can either leave those not path-searchable and generate the path column when needed in the future, or we can generate the tantivy tokens for those columns similar to the approach in the previous section.

And this actually becomes extremely similar to the Parquet VARIANT approach. And the challenge you are describing are basically also the challenge of Parquet/Iceberg/Delta/Spark VARIANT. (See for example the recent Iceberg Summit talk https://www.youtube.com/watch?v=lbAzqgBcLso 24:18)

But in the proposed solution above, it is already better than Parquet VARIANT in 2 ways:

it is easy to add additional path columns and backfill those columns for existing data in Lance
user can directly use it against any JSON column today, there is no need to fully migrate all data to a new VARIANT type with complicated encodings to the extent that the data is not accessible in many open source tools.

westonpace · 2025-05-18T13:32:11Z

westonpace
May 18, 2025
Maintainer

Another potential approach is to leave the column as a single column but allow indices to be created on paths within the column. For example, we could create an index on foo.bar.baz.

Training an index would be slightly more expensive because we'd need to do more I/O. To train the index we would have to read the entire JSON column, convert to JSON, strip out the column of interest, and then train the index.

Searching the index would not be any more expensive.

Users would not be able to fetch individual columns from the JSON. If we need that then I think we'd eventually want something like VARIANT.

The advantage of this approach over the proposed approach and the tantivy approach is that we would not need to create an index on every field in the JSON column (this would be both expensive to maintain and require at least as much storage as the JSON column itself).

The disadvantages are that we don't get column projection and users have to manually specify which columns they are interested in using as filters.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JSON document search in Lance #3841

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

JSON document search in Lance #3841

Uh oh!

Uh oh!

jackye1995 May 17, 2025 Maintainer

Existing approaches

Tantivy

Parquet VARIANT

Potential approach in Lance

Note on that "process to sample the documents..."

Replies: 2 comments · 1 reply

Uh oh!

changhiskhan May 17, 2025 Maintainer

Uh oh!

Uh oh!

jackye1995 May 18, 2025 Maintainer Author

If the documents are very different from each other

If the document is deeply nested

Uh oh!

westonpace May 18, 2025 Maintainer

jackye1995
May 17, 2025
Maintainer

Replies: 2 comments 1 reply

changhiskhan
May 17, 2025
Maintainer

jackye1995 May 18, 2025
Maintainer Author

westonpace
May 18, 2025
Maintainer