Skip to content

Limit the number of results returned by _field_caps #5832

@rdettai

Description

@rdettai

Is your feature request related to a problem? Please describe.
When using json fields (or dynamic mappings), the number of fields can be arbitrarily large.

Describe the solution you'd like
We would like for the field_caps endpoint to return only K results, prioritizing fields that appear in many documents. This doesn't need to be an exact Top K, we typically just want to prune a long tail of field names with random values that appear only in 1 or a few documents.

Describe alternatives you've considered
We could use pattern matching to filter out some field names, but using an approx. TopK is easier to use.

Implementation propositions

  • fields that are part of the doc mapping can always be returned, their number doesn't grow with the dataset and they are usually "important"
  • we would need an approximative count, for each field, of the number of docs where it appears. We need it at least for fields dynamically created in a json field.

It seems that getting that count would be quite costly for fast fields as it would require deserializing every column:

  • a) we could amortize this cost by storing the count in the split_fields section of the split. This section is currently extracted from the Tantivy index in the packager. It would still be fairly costly
    • for splits with fast: true on JSON fields, it would mean deserializing all that data
    • if the split has many fields (e.g millions) it could be even worse as we would need to read all this fragmented data
  • b) we could build the list of fields with their count at index time
    • Store List of Fields in Segment tantivy#2279 proposes to store the field list in the Tantivy index
    • we could also build the list of FieldMetadata (with the counts) at index time but without storing it in the index (i.e SegmentIndexWriter.add_document_track_fields(&mut self, document: D, fields_metadata: &mut HashMap<K, FieldMetadata>)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions