Skip to content

Commit 5d04db8

Browse files
adding more_like_this query dsl docs (#9746) (#10192)
1 parent 1c71a2c commit 5d04db8

File tree

2 files changed

+287
-1
lines changed

2 files changed

+287
-1
lines changed

_query-dsl/specialized/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ OpenSearch supports the following specialized queries:
1212

1313
- `distance_feature`: Calculates document scores based on the dynamically calculated distance between the origin and a document's `date`, `date_nanos`, or `geo_point` fields. This query can skip non-competitive hits.
1414

15-
- `more_like_this`: Finds documents similar to the provided text, document, or collection of documents.
15+
- [`more_like_this`]({{site.url}}{{site.baseurl}}/query-dsl/specialized/more-like-this/): Finds documents similar to the provided text, document, or collection of documents.
1616

1717
- [`knn`]({{site.url}}{{site.baseurl}}/query-dsl/specialized/k-nn/): Used for searching raw vectors during [vector search]({{site.url}}{{site.baseurl}}/vector-search/).
1818

Lines changed: 286 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,286 @@
1+
---
2+
layout: default
3+
title: More like this
4+
parent: Specialized queries
5+
nav_order: 45
6+
has_math: false
7+
---
8+
9+
# More like this
10+
11+
Use a `more_like_this` query to find documents that are similar to one or more given documents. This is useful for recommendation engines, content discovery, and identifying related items in a dataset.
12+
13+
The `more_like_this` query analyzes the input documents or texts and selects terms that best characterize them. It then searches for other documents that contain those significant terms.
14+
15+
## Prerequisites
16+
17+
Before you use a `more_like_this` query, ensure that the fields you target are indexed and their data type is either [`text`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/text/) or [`keyword`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/keyword/).
18+
19+
If you reference documents in the `like` section, OpenSearch needs access to their content. This is typically done through the `_source` field, which is enabled by default. If `_source` is disabled, you must either store the fields individually or configure them to save [`term_vector`]({{site.url}}{{site.baseurl}}/field-types/mapping-parameters/term-vector/) data.
20+
21+
Saving [`term_vector`]({{site.url}}{{site.baseurl}}/field-types/mapping-parameters/term-vector/) information when indexing documents can greatly accelerate `more_like_this` queries because the engine can directly retrieve the important terms without reanalyzing the field text at query time.
22+
{: .note}
23+
24+
## Example: No term vector optimization
25+
26+
Create an index named `articles-basic` using the following mapping:
27+
28+
```json
29+
PUT /articles-basic
30+
{
31+
"mappings": {
32+
"properties": {
33+
"title": { "type": "text" },
34+
"content": { "type": "text" }
35+
}
36+
}
37+
}
38+
```
39+
{% include copy-curl.html %}
40+
41+
Add sample documents:
42+
43+
```json
44+
POST /articles-basic/_bulk
45+
{ "index": { "_id": 1 }}
46+
{ "title": "Exploring the Sahara Desert", "content": "Sand dunes and vast landscapes." }
47+
{ "index": { "_id": 2 }}
48+
{ "title": "Amazon Rainforest Tour", "content": "Dense jungle and exotic wildlife." }
49+
{ "index": { "_id": 3 }}
50+
{ "title": "Mountain Adventures", "content": "Snowy peaks and hiking trails." }
51+
```
52+
{% include copy-curl.html %}
53+
54+
Query using the following request:
55+
56+
```json
57+
GET /articles-basic/_search
58+
{
59+
"query": {
60+
"more_like_this": {
61+
"fields": ["content"],
62+
"like": "jungle wildlife",
63+
"min_term_freq": 1,
64+
"min_doc_freq": 1
65+
}
66+
}
67+
}
68+
```
69+
{% include copy-curl.html %}
70+
71+
The `more_like_this` query searches for the terms `jungle` and `wildlife` in the `content` field, which matches only one document:
72+
73+
```json
74+
{
75+
...
76+
"hits": {
77+
"total": {
78+
"value": 1,
79+
"relation": "eq"
80+
},
81+
"max_score": 1.9616582,
82+
"hits": [
83+
{
84+
"_index": "articles-basic",
85+
"_id": "2",
86+
"_score": 1.9616582,
87+
"_source": {
88+
"title": "Amazon Rainforest Tour",
89+
"content": "Dense jungle and exotic wildlife."
90+
}
91+
}
92+
]
93+
}
94+
}
95+
```
96+
97+
## Example: Term vector optimization
98+
99+
Create an index named `articles-optimized` using the following mapping:
100+
101+
```json
102+
PUT /articles-optimized
103+
{
104+
"mappings": {
105+
"properties": {
106+
"title": {
107+
"type": "text",
108+
"term_vector": "with_positions_offsets"
109+
},
110+
"content": {
111+
"type": "text",
112+
"term_vector": "with_positions_offsets"
113+
}
114+
}
115+
}
116+
}
117+
```
118+
{% include copy-curl.html %}
119+
120+
Insert sample documents into the optimized index:
121+
122+
```json
123+
POST /articles-optimized/_bulk
124+
{ "index": { "_id": "a1" } }
125+
{ "name": "Diana", "alias": "Wonder Woman", "quote": "Justice will come when it is deserved." }
126+
{ "index": { "_id": "a2" } }
127+
{ "name": "Clark", "alias": "Superman", "quote": "Even in the darkest times, hope cuts through." }
128+
{ "index": { "_id": "a3" } }
129+
{ "name": "Bruce", "alias": "Batman", "quote": "I am vengeance. I am the night. I am Batman!" }
130+
```
131+
{% include copy-curl.html %}
132+
133+
Find documents in which the `quote` field contains terms similar to "dark" and "night":
134+
135+
```json
136+
GET /articles-optimized/_search
137+
{
138+
"query": {
139+
"more_like_this": {
140+
"fields": ["quote"],
141+
"like": "dark night",
142+
"min_term_freq": 1,
143+
"min_doc_freq": 1
144+
}
145+
}
146+
}
147+
```
148+
{% include copy-curl.html %}
149+
150+
The `more_like_this` query searches for the terms `dark` and `night` and returns the following hit:
151+
152+
```json
153+
{
154+
...
155+
"hits": {
156+
"total": {
157+
"value": 1,
158+
"relation": "eq"
159+
},
160+
"max_score": 1.2363393,
161+
"hits": [
162+
{
163+
"_index": "articles-optimized",
164+
"_id": "a3",
165+
"_score": 1.2363393,
166+
"_source": {
167+
"name": "Bruce",
168+
"alias": "Batman",
169+
"quote": "I am vengeance. I am the night. I am Batman!"
170+
}
171+
}
172+
]
173+
}
174+
}
175+
```
176+
177+
## Example: Using multiple documents and text input
178+
179+
The `more_like_this` query allows you to provide multiple sources in the `like` parameter. You can combine free text with documents from the index. This is useful if you want the search to combine relevance signals from several examples.
180+
181+
In the following example, a custom document is provided directly. Additionally, an existing document with the ID `5` from the `heroes` index is included:
182+
183+
```json
184+
GET /articles-optimized/_search
185+
{
186+
"query": {
187+
"more_like_this": {
188+
"fields": ["name", "alias"],
189+
"like": [
190+
{
191+
"doc": {
192+
"name": "Diana",
193+
"alias": "Wonder Woman",
194+
"quote": "Courage is not the absence of fear, but the triumph over it."
195+
}
196+
},
197+
{
198+
"_index": "heroes",
199+
"_id": "5"
200+
}
201+
],
202+
"min_term_freq": 1,
203+
"min_doc_freq": 1,
204+
"max_query_terms": 25
205+
}
206+
}
207+
}
208+
```
209+
{% include copy-curl.html %}
210+
211+
The returned results contain articles most similar to the `name` and `alias` fields provided in the query:
212+
213+
```json
214+
{
215+
...
216+
"hits": {
217+
"total": {
218+
"value": 2,
219+
"relation": "eq"
220+
},
221+
"max_score": 2.140194,
222+
"hits": [
223+
{
224+
"_index": "articles-optimized",
225+
"_id": "a1",
226+
"_score": 2.140194,
227+
"_source": {
228+
"name": "Diana",
229+
"alias": "Wonder Woman",
230+
"quote": "Justice will come when it is deserved."
231+
}
232+
},
233+
{
234+
"_index": "articles-optimized",
235+
"_id": "a2",
236+
"_score": 1.1596459,
237+
"_source": {
238+
"name": "Clark",
239+
"alias": "Superman",
240+
"quote": "Even in the darkest times, hope cuts through."
241+
}
242+
}
243+
]
244+
}
245+
}
246+
```
247+
248+
Use this pattern when you want to boost results based on a new concept that is not yet fully indexed but also want to combine it with knowledge from existing indexed documents.
249+
{: .note}
250+
251+
# Parameters
252+
253+
The only required parameter for a `more_like_this` query is `like`. The rest of the parameters have default values but allow fine-tuning. The following are the main parameter categories.
254+
255+
## Document input parameters
256+
257+
The following table specifies document input parameters.
258+
259+
| Parameter | Required/Optional | Data type | Description |
260+
| :--- | :--- | :--- | :--- |
261+
| `like`| Required| Array of strings or objects | Defines the text or documents for which to find similar documents. You can input free text, real documents from the index, or artificial documents. The analyzer associated with the field processes the text unless overridden. |
262+
| `unlike`| Optional| Array of strings or objects | Provides text or documents whose terms should be *excluded* from influencing the query. Useful for specifying negative examples.|
263+
| `fields`| Optional| Array of strings| Lists fields to use when analyzing text. If not specified, all fields are used. |
264+
265+
## Term selection parameters
266+
267+
| Parameter | Required/Optional | Data type| Description|
268+
| :--- | :--- | :--- | :--- |
269+
| `max_query_terms` | Optional| Integer| Sets the maximum number of terms to select from the input. A higher value increases precision but slows down execution. Default is `25`. |
270+
| `min_term_freq` | Optional| Integer| Terms appearing fewer times than this in the input will be ignored. Default is `2`.|
271+
| `min_doc_freq`| Optional| Integer| Terms appearing in fewer documents than this value will be ignored. Default is `5`.|
272+
| `max_doc_freq`| Optional| Integer| Terms appearing in more documents than this limit are ignored. Useful for avoiding very common words. Default is unlimited (2<sup>31</sup> - 1). |
273+
| `min_word_length` | Optional| Integer| Ignore words shorter than this value. Default is `0`.|
274+
| `max_word_length` | Optional| Integer| Ignore words longer than this value. Default is unlimited. |
275+
| `stop_words`| Optional| Array of strings | Defines a list of words that are ignored completely when selecting terms.|
276+
| `analyzer`| Optional| String | The custom analyzer to use for processing input text. Defaults to the analyzer of the first field listed in `fields`.|
277+
278+
## Query formation parameters
279+
280+
| Parameter | Required/Optional | Data type | Description |
281+
| :--- | :--- | :--- | :--- |
282+
| `minimum_should_match`| Optional | String | Specifies the minimum number of terms that must match in the final query. The value can be a percentage or a fixed number. Helps fine-tune the balance between recall and precision. Default is `30%` |
283+
| `fail_on_unsupported_field` | Optional | Boolean | Determines whether to throw an error if one of the target fields is not of a compatible type (`text` or `keyword`). Set to `false` to silently skip unsupported fields. Default is `true`. |
284+
| `boost_terms` | Optional | Float | Applies a boost to selected terms based on their term frequency–inverse document frequency (TF–IDF) weight. Any value greater than `0` activates term boosting using the specified factor. Default is `0`. |
285+
| `include` | Optional | Boolean | If `true`, the source documents provided in `like` are included in the result hits. Default is `false`. |
286+
| `boost` | Optional | Float | Multiplies the relevance score of the entire `more_like_this` query. Default is `1.0`. |

0 commit comments

Comments
 (0)