-
Notifications
You must be signed in to change notification settings - Fork 477
Description
Is your feature request related to a problem? Please describe.
When using json fields (or dynamic mappings), the number of fields can be arbitrarily large.
Describe the solution you'd like
We would like for the field_caps endpoint to return only K results, prioritizing fields that appear in many documents. This doesn't need to be an exact Top K, we typically just want to prune a long tail of field names with random values that appear only in 1 or a few documents.
Describe alternatives you've considered
We could use pattern matching to filter out some field names, but using an approx. TopK is easier to use.
Implementation propositions
- fields that are part of the doc mapping can always be returned, their number doesn't grow with the dataset and they are usually "important"
- we would need an approximative count, for each field, of the number of docs where it appears. We need it at least for fields dynamically created in a json field.
It seems that getting that count would be quite costly for fast fields as it would require deserializing every column:
- a) we could amortize this cost by storing the count in the
split_fields
section of the split. This section is currently extracted from the Tantivy index in the packager. It would still be fairly costly- for splits with
fast: true
on JSON fields, it would mean deserializing all that data - if the split has many fields (e.g millions) it could be even worse as we would need to read all this fragmented data
- for splits with
- b) we could build the list of fields with their count at index time
- Store List of Fields in Segment tantivy#2279 proposes to store the field list in the Tantivy index
- we could also build the list of FieldMetadata (with the counts) at index time but without storing it in the index (i.e
SegmentIndexWriter.add_document_track_fields(&mut self, document: D, fields_metadata: &mut HashMap<K, FieldMetadata>
)