JSON document search in Lance #3841
Replies: 2 comments 1 reply
-
Interesting idea. Could work well if the schema is mostly stable but if there's really a huge variety with deep nesting, you could potentially end up with thousands of columns, most of which you don't care to search at all. And then fetching the json also becomes way too expensive if it's all broken out like that? |
Beta Was this translation helpful? Give feedback.
-
Another potential approach is to leave the column as a single column but allow indices to be created on paths within the column. For example, we could create an index on Training an index would be slightly more expensive because we'd need to do more I/O. To train the index we would have to read the entire JSON column, convert to JSON, strip out the column of interest, and then train the index. Searching the index would not be any more expensive. Users would not be able to fetch individual columns from the JSON. If we need that then I think we'd eventually want something like VARIANT. The advantage of this approach over the proposed approach and the tantivy approach is that we would not need to create an index on every field in the JSON column (this would be both expensive to maintain and require at least as much storage as the JSON column itself). The disadvantages are that we don't get column projection and users have to manually specify which columns they are interested in using as filters. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
JSON document is an important part of multimodal AI data. Currently if we store JSON document as string or blob, we cannot efficiently do common operations like searching for a path.
Existing approaches
Some existing approaches
Tantivy
https://github.com/quickwit-oss/tantivy/blob/main/doc/src/json.md
For instance, if user is a json field, the following document:
emits the following tokens:
Then the tokens are sorted and binary encoded for search.
Parquet VARIANT
https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
Shredding is used to allow fast access to certain key value within the semi-structured document.
Potential approach in Lance
I think there can be an alternative approach in Lance, leveraging its data evolution ability. Instead of emitting and encoding tokens for each document as what tantivy is doing, or trying to shred the data in the column, we can emit 1 column per path. A process can sample the documents, or based on user configuration, create new columns and related indexes for them. Take the same example as tantivy, suppose the document is stored as column
document
, then we will create the following columns with indexes:And based on the value of each document, we create and backfill these additional columns.
Then users can easily search any field in the document, like
SELECT document.created_at FROM table WHERE document.address.city = 'San Francisco'
, which becomes an index-based random access search.This actually also solves the pitfalls of the tantivy JSON index described in their documentation:
Note on that "process to sample the documents..."
I think this should be more of a business logic outside the Lance format to wrap the behavior described above, but maybe it also makes sense to have some related concepts exposed in Lance. Curious what others think.
Beta Was this translation helpful? Give feedback.
All reactions