From a99b93a59e9c29a3972e9bc7f985fe19d3d36f59 Mon Sep 17 00:00:00 2001 From: Anton Rubin Date: Wed, 21 May 2025 16:07:42 +0100 Subject: [PATCH 1/3] adding rewrite parameter docs Signed-off-by: Anton Rubin --- _query-dsl/rewrite-parameter.md | 174 ++++++++++++++++++++++++++++++++ 1 file changed, 174 insertions(+) create mode 100644 _query-dsl/rewrite-parameter.md diff --git a/_query-dsl/rewrite-parameter.md b/_query-dsl/rewrite-parameter.md new file mode 100644 index 00000000000..abc7d50b78c --- /dev/null +++ b/_query-dsl/rewrite-parameter.md @@ -0,0 +1,174 @@ +--- +layout: default +title: Rewrite +nav_order: 80 +--- + +# Rewrite + +Multi-term queries like `wildcard`, `prefix`, `regexp`, `fuzzy`, and `range` expand internally into sets of terms. The `rewrite` parameter allows you to control how these term expansions are executed and scored. + +When a multi-term query expands into many terms (for example `prefix: "error*"` matching hundreds of terms), they are converted into actual term queries internally. This process can: + +* Exceed the `indices.query.bool.max_clause_count` limit (default `1024`) +* Affect how scores are calculated for matching documents +* Impact memory and latency depending on the rewrite method used + +## Available rewrite methods + +| Rewrite method | Description | +| [`constant_score`](#constant_score-default) | (Default) All expanded terms are evaluated together as a single unit, assigning the same score to every match. Efficient for filtering use cases. | +| [`scoring_boolean`](#scoring_boolean) | Breaks the query into a Boolean `should` clause with one term query per match. Each result is scored individually based on relevance. | +| [`constant_score_boolean`](#constant_score_boolean) | Similar to `scoring_boolean`, but all documents receive a fixed score regardless of term frequency. Maintains Boolean structure without TF/IDF weighting. | +| [`top_terms_N`](#top_terms_N) | Restricts scoring and execution to the N most frequent terms. Reduces resource usage and prevents clause overload. | +| [`top_terms_boost_N`](#top_terms_boost_N) | Like `top_terms_N`, but uses static boosting instead of full scoring. Offers performance improvements with simplified relevance. | +| [`top_terms_blended_freqs_N`](#top_terms_blended_freqs_N) | Chooses the top N matching terms and averages their document frequencies for scoring. Produces balanced scores without full term explosion. | + +## Boolean-based rewrite limits + +All Boolean-based rewrites, such as `scoring_boolean`, `constant_score_boolean`, and `top_terms_*`, are subject to: + +```json +indices.query.bool.max_clause_count +``` + +This setting controls the maximum number of allowed Boolean `should` clauses (default: 1024). If your query expands beyond this limit, it will be rejected with a `too_many_clauses` error. + +## constant_score (default) + +* Executes all term matches as a single bitset query. +* Ignores scoring altogether; every document gets `_score = 1.0`. +* Fastest option; ideal when filtering is the goal. + +```json +POST /logs/_search +{ + "query": { + "wildcard": { + "message": { + "value": "warning*" + } + } + } +} +``` +{% include copy-curl.html %} + +## scoring_boolean + +* Expands the wildcard into individual `term` queries inside a Boolean `should` clause. +* Each document’s score reflects how many terms it matches and their term frequency. +* Can trigger `too_many_clauses` if many terms match. + +```json +POST /logs/_search +{ + "query": { + "wildcard": { + "message": { + "value": "warning*", + "rewrite": "scoring_boolean" + } + } + } +} +``` +{% include copy-curl.html %} + +## constant_score_boolean + +* Similar structure to `scoring_boolean`, but documents are not ranked. +* All matching docs receive the same score. +* This retains Boolean clause flexibility (e.g., use with `must_not`) without ranking. + +```json +POST /logs/_search +{ + "query": { + "wildcard": { + "message": { + "value": "warning*", + "rewrite": "constant_score_boolean" + } + } + } +} +``` +{% include copy-curl.html %} + +## top_terms_N + +* Only the N most frequent matching terms are selected and scored. +* Useful when you expect large expansion and want to limit load. +* Other valid terms are ignored to preserve performance. + +```json +POST /logs/_search +{ + "query": { + "wildcard": { + "message": { + "value": "warning*", + "rewrite": "top_terms_2" + } + } + } +} +``` +{% include copy-curl.html %} + +## top_terms_boost_N + +* Limits expansion to the top N terms like `top_terms_N`. +* Rather than computing TF/IDF, it assigns a pre-set boost per term. +* Provides faster execution with predictable relevance weights. + +```json +POST /logs/_search +{ + "query": { + "wildcard": { + "message": { + "value": "warning*", + "rewrite": "top_terms_boost_2" + } + } + } +} +``` +{% include copy-curl.html %} + +## top_terms_blended_freqs_N + +* Picks the top N matching terms and applies a blended frequency to all. +* Blending makes scoring smoother across terms that differ in frequency. +* Good tradeoff when you want performance with realistic scoring. + +```json +POST /logs/_search +{ + "query": { + "wildcard": { + "message": { + "value": "warning*", + "rewrite": "top_terms_blended_freqs_2" + } + } + } +} +``` +{% include copy-curl.html %} + +## Summary + +The `rewrite` parameter gives you control over how multi-term queries behave under the hood. + +| Mode | Scores | Performance | Notes | +| --------------------------- | -------------------------------------- | ----------- | --------------------------------------------- | +| `constant_score` | Same score for all matches | Best | Default mode, ideal for filters | +| `scoring_boolean` | TF/IDF-based | Moderate | Full relevance scoring | +| `constant_score_boolean` | Same score, but with Boolean structure | Moderate | Use with `must_not` or `minimum_should_match` | +| `top_terms_N` | TF/IDF on top N terms | Efficient | Truncates expansion | +| `top_terms_boost_N` | Static boosts | Fast | Less accurate | +| `top_terms_blended_freqs_N` | Blended score | Balanced | Best scoring/efficiency tradeoff | + From d2cb7f1d9f273f525fae16cfb451d700bdb991db Mon Sep 17 00:00:00 2001 From: Anton Rubin Date: Tue, 8 Jul 2025 10:36:45 +0100 Subject: [PATCH 2/3] addressing the PR comments Signed-off-by: Anton Rubin --- _query-dsl/rewrite-parameter.md | 87 ++++++++++++++++++++++++++++----- 1 file changed, 76 insertions(+), 11 deletions(-) diff --git a/_query-dsl/rewrite-parameter.md b/_query-dsl/rewrite-parameter.md index abc7d50b78c..e3d63995c51 100644 --- a/_query-dsl/rewrite-parameter.md +++ b/_query-dsl/rewrite-parameter.md @@ -17,7 +17,7 @@ When a multi-term query expands into many terms (for example `prefix: "error*"` ## Available rewrite methods | Rewrite method | Description | -| [`constant_score`](#constant_score-default) | (Default) All expanded terms are evaluated together as a single unit, assigning the same score to every match. Efficient for filtering use cases. | +| [`constant_score`](#constant_score-default) | (Default) All expanded terms are evaluated together as a single unit, assigning the same score to every match, matching documents are not scored individually. Making it very efficient for filtering use cases | | [`scoring_boolean`](#scoring_boolean) | Breaks the query into a Boolean `should` clause with one term query per match. Each result is scored individually based on relevance. | | [`constant_score_boolean`](#constant_score_boolean) | Similar to `scoring_boolean`, but all documents receive a fixed score regardless of term frequency. Maintains Boolean structure without TF/IDF weighting. | | [`top_terms_N`](#top_terms_N) | Restricts scoring and execution to the N most frequent terms. Reduces resource usage and prevents clause overload. | @@ -32,11 +32,46 @@ All Boolean-based rewrites, such as `scoring_boolean`, `constant_score_boolean`, indices.query.bool.max_clause_count ``` -This setting controls the maximum number of allowed Boolean `should` clauses (default: 1024). If your query expands beyond this limit, it will be rejected with a `too_many_clauses` error. +This setting controls the maximum number of allowed Boolean `should` clauses (default: `1024`). If your query expands beyond this limit, it will be rejected with a `too_many_clauses` error. + +For example, a wildcard like "error*" might expand to hundreds or thousands of matching terms, such as: "error", "errors", "error_log", "error404", and others. Each of these terms turns into a separate term query. If the number of terms exceeds the `indices.query.bool.max_clause_count` limit, the query fails with an error. See following example: + +```json +POST /logs/_search +{ + "query": { + "wildcard": { + "message": { + "value": "error*", + "rewrite": "scoring_boolean" + } + } + } +} +``` +{% include copy-curl.html %} + +Query is expanded internally as follows: + +```json +{ + "bool": { + "should": [ + { "term": { "message": "error" } }, + { "term": { "message": "errors" } }, + { "term": { "message": "error_log" } }, + { "term": { "message": "error404" } }, + ... + ] + } +} +``` ## constant_score (default) -* Executes all term matches as a single bitset query. +The constant_score rewrite method wraps all expanded terms into a single query and skips the scoring phase entirely. This approach offers the following characteristics: + +* Executes all term matches as a single [bit array](https://en.wikipedia.org/wiki/Bit_array) query. * Ignores scoring altogether; every document gets `_score = 1.0`. * Fastest option; ideal when filtering is the goal. @@ -56,9 +91,12 @@ POST /logs/_search ## scoring_boolean +The `scoring_boolean` rewrite method breaks the expanded terms into separate `term` queries combined under a Boolean `should` clause. This approach works as follows: + * Expands the wildcard into individual `term` queries inside a Boolean `should` clause. * Each document’s score reflects how many terms it matches and their term frequency. -* Can trigger `too_many_clauses` if many terms match. + +The following example is using `scoring_boolean` rewrite configuration: ```json POST /logs/_search @@ -77,18 +115,26 @@ POST /logs/_search ## constant_score_boolean +The `constant_score_boolean` rewrite method uses the same Boolean structure as `scoring_boolean` but disables scoring, making it useful when clause logic is needed without relevance ranking. This method offers the following characteristics: + * Similar structure to `scoring_boolean`, but documents are not ranked. * All matching docs receive the same score. -* This retains Boolean clause flexibility (e.g., use with `must_not`) without ranking. +* This retains Boolean clause flexibility, such as using `must_not`, without ranking. + +See the following example query using `must_not`: ```json POST /logs/_search { "query": { - "wildcard": { - "message": { - "value": "warning*", - "rewrite": "constant_score_boolean" + "bool": { + "must_not": { + "wildcard": { + "message": { + "value": "error*", + "rewrite": "constant_score_boolean" + } + } } } } @@ -96,6 +142,25 @@ POST /logs/_search ``` {% include copy-curl.html %} +This query is internally expanded as follows: + +```json +{ + "bool": { + "must_not": { + "bool": { + "should": [ + { "term": { "message": "error" } }, + { "term": { "message": "errors" } }, + { "term": { "message": "error_log" } }, + ... + ] + } + } + } +} +``` + ## top_terms_N * Only the N most frequent matching terms are selected and scored. @@ -142,7 +207,7 @@ POST /logs/_search * Picks the top N matching terms and applies a blended frequency to all. * Blending makes scoring smoother across terms that differ in frequency. -* Good tradeoff when you want performance with realistic scoring. +* Good trade-off when you want performance with realistic scoring. ```json POST /logs/_search @@ -170,5 +235,5 @@ The `rewrite` parameter gives you control over how multi-term queries behave und | `constant_score_boolean` | Same score, but with Boolean structure | Moderate | Use with `must_not` or `minimum_should_match` | | `top_terms_N` | TF/IDF on top N terms | Efficient | Truncates expansion | | `top_terms_boost_N` | Static boosts | Fast | Less accurate | -| `top_terms_blended_freqs_N` | Blended score | Balanced | Best scoring/efficiency tradeoff | +| `top_terms_blended_freqs_N` | Blended score | Balanced | Best scoring/efficiency trade-off | From 266c9f9c433ff75dea3ce9cdb02ed1e745efe98c Mon Sep 17 00:00:00 2001 From: Anton Rubin Date: Tue, 8 Jul 2025 11:27:01 +0100 Subject: [PATCH 3/3] addressing comments Signed-off-by: Anton Rubin --- _query-dsl/rewrite-parameter.md | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/_query-dsl/rewrite-parameter.md b/_query-dsl/rewrite-parameter.md index e3d63995c51..ec4165ff658 100644 --- a/_query-dsl/rewrite-parameter.md +++ b/_query-dsl/rewrite-parameter.md @@ -26,7 +26,7 @@ When a multi-term query expands into many terms (for example `prefix: "error*"` ## Boolean-based rewrite limits -All Boolean-based rewrites, such as `scoring_boolean`, `constant_score_boolean`, and `top_terms_*`, are subject to: +All Boolean-based rewrites, such as `scoring_boolean`, `constant_score_boolean`, and `top_terms_*`, are subject to the following configuration: ```json indices.query.bool.max_clause_count @@ -34,7 +34,7 @@ indices.query.bool.max_clause_count This setting controls the maximum number of allowed Boolean `should` clauses (default: `1024`). If your query expands beyond this limit, it will be rejected with a `too_many_clauses` error. -For example, a wildcard like "error*" might expand to hundreds or thousands of matching terms, such as: "error", "errors", "error_log", "error404", and others. Each of these terms turns into a separate term query. If the number of terms exceeds the `indices.query.bool.max_clause_count` limit, the query fails with an error. See following example: +For example, a wildcard, such as "error*", might expand to hundreds or thousands of matching terms, which could include: "error", "errors", "error_log", "error404", and others. Each of these terms turns into a separate term query. If the number of terms exceed the `indices.query.bool.max_clause_count` limit, the query fails. See following example: ```json POST /logs/_search @@ -184,10 +184,14 @@ POST /logs/_search ## top_terms_boost_N +The `top_terms_boost_N` rewrite method selects the top N matching terms and applies static `boost` values instead of computing full relevance scores. It works as follows: + * Limits expansion to the top N terms like `top_terms_N`. * Rather than computing TF/IDF, it assigns a pre-set boost per term. * Provides faster execution with predictable relevance weights. +See the following example query using `top_terms_boost_2` rewrite parameter: + ```json POST /logs/_search { @@ -205,10 +209,14 @@ POST /logs/_search ## top_terms_blended_freqs_N +The `top_terms_blended_freqs_N` rewrite method selects the top N matching terms and blends their document frequencies to produce more balanced relevance scores. This approach offers the following characteristics: + * Picks the top N matching terms and applies a blended frequency to all. * Blending makes scoring smoother across terms that differ in frequency. * Good trade-off when you want performance with realistic scoring. +See the following example query using `top_terms_blended_freqs_2` rewrite parameter: + ```json POST /logs/_search {