Limit the number of results returned by _field_caps

**Is your feature request related to a problem? Please describe.**
When using json fields (or dynamic mappings), the number of fields can be arbitrarily large. 

**Describe the solution you'd like**
We would like for the field_caps endpoint to return only K results, prioritizing fields that appear in many documents. This doesn't need to be an exact Top K, we typically just want to prune a long tail of field names with random values that appear only in 1 or a few documents.

**Describe alternatives you've considered**
We could use pattern matching to filter out some field names, but using an approx. TopK is easier to use.

**Implementation propositions**
- fields that are part of the doc mapping can always be returned, their number doesn't grow with the dataset and they are usually "important" 
- we would need an approximative count, for each field, of the number of docs where it appears. We need it at least for fields dynamically created in a json field.

It seems that getting that count would be quite costly for fast fields as it would require deserializing every column:
- a) we could amortize this cost by storing the count in the `split_fields` section of the split. This section is currently extracted from the Tantivy index in the packager. It would still be fairly costly
    - for splits with `fast: true` on JSON fields, it would mean deserializing all that data
    - if the split has many fields (e.g millions) it could be even worse as we would need to read all this fragmented data
- b) we could build the list of fields with their count at index time
    - https://github.com/quickwit-oss/tantivy/pull/2279 proposes to store the field list in the Tantivy index
    - we could also build the list of FieldMetadata (with the counts) at index time but without storing it in the index (i.e `SegmentIndexWriter.add_document_track_fields(&mut self, document: D, fields_metadata: &mut HashMap<K, FieldMetadata>`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Limit the number of results returned by _field_caps #5832

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Limit the number of results returned by _field_caps #5832

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions