Skip to content

Commit 5d8240a

Browse files
AntonEliatrakolchfa-awsnatebower
authored
adding termvector and mtermvector api docs (#10004)
* adding termvecto and mtermvector api docs Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * adding termvecto and mtermvector api docs Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Apply suggestions from code review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * updating as per the PR review Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * addressing the PR comments Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * fixing vale errors Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * updating the PR with the latest changes from the API spec Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * updating the table of response fields Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * fixing vale errors Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * fixing vale Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> * Apply suggestions from code review Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * Apply suggestions from code review Co-authored-by: Nathan Bower <nbower@amazon.com> Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> * removing the api spec automation to accomodate the changes requested Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> --------- Signed-off-by: Anton Rubin <anton.rubin@eliatra.com> Signed-off-by: AntonEliatra <anton.rubin@eliatra.com> Co-authored-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com> Co-authored-by: Nathan Bower <nbower@amazon.com>
1 parent 5f5d50d commit 5d8240a

File tree

2 files changed

+576
-0
lines changed

2 files changed

+576
-0
lines changed
Lines changed: 333 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,333 @@
1+
---
2+
layout: default
3+
title: Multi term vectors
4+
parent: Document APIs
5+
nav_order: 33
6+
---
7+
8+
# Multi term vectors
9+
10+
The `_mtermvectors` API retrieves term vector information for multiple documents in one request. Term vectors provide detailed information about the terms (words) in a document, including term frequency, positions, offsets, and payloads. This can be useful for applications such as relevance scoring, highlighting, or similarity calculations. For more information, see [Term vector parameter]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/text/#term-vector-parameter).
11+
12+
<!-- spec_insert_start
13+
api: mtermvectors
14+
component: endpoints
15+
-->
16+
## Endpoints
17+
```json
18+
GET /_mtermvectors
19+
POST /_mtermvectors
20+
GET /{index}/_mtermvectors
21+
POST /{index}/_mtermvectors
22+
```
23+
<!-- spec_insert_end -->
24+
25+
<!-- spec_insert_start
26+
api: mtermvectors
27+
component: path_parameters
28+
-->
29+
## Path parameters
30+
31+
The following table lists the available path parameters. All path parameters are optional.
32+
33+
| Parameter | Data type | Description |
34+
| :--- | :--- | :--- |
35+
| `index` | String | The name of the index containing the document. |
36+
37+
<!-- spec_insert_end -->
38+
39+
<!-- spec_insert_start
40+
api: mtermvectors
41+
component: query_parameters
42+
columns: Parameter, Data type, Description
43+
-->
44+
## Query parameters
45+
46+
The following table lists the available query parameters. All query parameters are optional.
47+
48+
| Parameter | Data type | Description |
49+
| :--- | :--- | :--- |
50+
| `field_statistics` | Boolean | If `true`, the response includes the document count, sum of document frequencies, and sum of total term frequencies. _(Default: `true`)_ |
51+
| `fields` | List or String | A comma-separated list or a wildcard expression specifying the fields to include in the statistics. Used as the default list unless a specific field list is provided in the `completion_fields` or `fielddata_fields` parameters. |
52+
| `ids` | List | A comma-separated list of documents IDs. You must provide either the `docs` field in the request body or specify `ids` as a query parameter or in the request body. |
53+
| `offsets` | Boolean | If `true`, the response includes term offsets. _(Default: `true`)_ |
54+
| `payloads` | Boolean | If `true`, the response includes term payloads. _(Default: `true`)_ |
55+
| `positions` | Boolean | If `true`, the response includes term positions. _(Default: `true`)_ |
56+
| `preference` | String | Specifies the node or shard on which the operation should be performed. See [preference query parameter]({{site.url}}{{site.baseurl}}/api-reference/search-apis/search/#the-preference-query-parameter) for a list of available options. By default the requests are routed randomly to available shard copies (primary or replica), with no guarantee of consistency across repeated queries. |
57+
| `realtime` | Boolean | If `true`, the request is real time as opposed to near real time. _(Default: `true`)_ |
58+
| `routing` | List or String | A custom value used to route operations to a specific shard. |
59+
| `term_statistics` | Boolean | If `true`, the response includes term frequency and document frequency. _(Default: `false`)_ |
60+
| `version` | Integer | If `true`, returns the document version as part of a hit. |
61+
| `version_type` | String | The specific version type. <br> Valid values are: <br> - `external`: The version number must be greater than the current version. <br> - `external_gte`: The version number must be greater than or equal to the current version. <br> - `force`: The version number is forced to be the given value. <br> - `internal`: The version number is managed internally by OpenSearch. |
62+
63+
<!-- spec_insert_end -->
64+
65+
## Request body fields
66+
67+
The following table lists the fields that can be specified in the request body.
68+
69+
| Field | Data type | Description |
70+
| `docs` | Array | An array of document specifications. |
71+
| `ids` | Array of strings | A list of document IDs to retrieve. Use only when all documents share the same index specified in the request path or query. |
72+
| `fields` | Array of strings | A list of field names for which to return term vectors. |
73+
| `offsets` | Boolean | If `true`, the response includes character offsets for each term. *(Default: `true`)* |
74+
| `payloads` | Boolean | If `true`, the response includes payloads for each term. *(Default: `true`)* |
75+
| `positions` | Boolean | If `true`, the response includes token positions. *(Default: `true`)* |
76+
| `field_statistics` | Boolean | If `true`, the response includes statistics such as document count, sum of document frequencies, and sum of total term frequencies. *(Default: `true`)* |
77+
| `term_statistics` | Boolean | If `true`, the response includes term frequency and document frequency. *(Default: `false`)* |
78+
| `routing` | String | A custom routing value used to identify the shard. Required if custom routing was used during indexing. |
79+
| `version` | Integer | The specific version of the document to retrieve. |
80+
| `version_type` | String | The type of versioning to use. Valid values: `internal`, `external`, `external_gte`. |
81+
| `filter` | Object | Filters tokens returned in the response (for example, by frequency or position). For supported fields, see [Filtering terms]({{site.url}}{{site.baseurl}}/api-reference/document-apis/mtermvectors/#filtering-terms). |
82+
| `per_field_analyzer` | Object | Specifies a custom analyzer to use per field. Format: `{ "field_name": "analyzer_name" }`. |
83+
84+
## Filtering terms
85+
86+
The `filter` object in the request body allows you to filter the tokens to include in the term vector response. The `filter` object supports the following fields.
87+
88+
| Field | Data type | Description |
89+
| `max_num_terms` | Integer | The maximum number of terms to return. |
90+
| `min_term_freq` | Integer | The minimum term frequency in the document required for a term to be included. |
91+
| `max_term_freq` | Integer | The maximum term frequency in the document required for a term to be included. |
92+
| `min_doc_freq` | Integer | The minimum document frequency across the index required for a term to be included. |
93+
| `max_doc_freq` | Integer | The maximum document frequency across the index required for a term to be included. |
94+
| `min_word_length` | Integer | The minimum length of the term to be included. |
95+
| `max_word_length` | Integer | The maximum length of the term to be included. |
96+
97+
## Example
98+
99+
Create an index with term vectors enabled:
100+
101+
```json
102+
PUT /my-index
103+
{
104+
"mappings": {
105+
"properties": {
106+
"text": {
107+
"type": "text",
108+
"term_vector": "with_positions_offsets_payloads"
109+
}
110+
}
111+
}
112+
}
113+
```
114+
{% include copy-curl.html %}
115+
116+
Index the first document:
117+
118+
```json
119+
POST /my-index/_doc/1
120+
{
121+
"text": "OpenSearch is a search engine."
122+
}
123+
```
124+
{% include copy-curl.html %}
125+
126+
Index the second document:
127+
128+
```json
129+
POST /my-index/_doc/2
130+
{
131+
"text": "OpenSearch provides powerful features."
132+
}
133+
```
134+
{% include copy-curl.html %}
135+
136+
### Example request
137+
138+
Get term vectors for multiple documents:
139+
140+
```json
141+
POST /_mtermvectors
142+
{
143+
"docs": [
144+
{
145+
"_index": "my-index",
146+
"_id": "1",
147+
"fields": ["text"]
148+
},
149+
{
150+
"_index": "my-index",
151+
"_id": "2",
152+
"fields": ["text"]
153+
}
154+
]
155+
}
156+
```
157+
{% include copy-curl.html %}
158+
159+
Alternatively, you can specify both `ids` and `fields` as query parameters:
160+
161+
```json
162+
GET /my-index/_mtermvectors?ids=1,2&fields=text
163+
```
164+
{% include copy-curl.html %}
165+
166+
You can also provide document IDs in the `ids` array instead of specifying `docs`:
167+
168+
```json
169+
GET /my-index/_mtermvectors?fields=text
170+
{
171+
"ids": [
172+
"1", "2"
173+
]
174+
}
175+
```
176+
{% include copy-curl.html %}
177+
178+
## Example response
179+
180+
The response contains term vector information for the two documents:
181+
182+
```json
183+
{
184+
"docs": [
185+
{
186+
"_index": "my-index",
187+
"_id": "1",
188+
"_version": 1,
189+
"found": true,
190+
"took": 10,
191+
"term_vectors": {
192+
"text": {
193+
"field_statistics": {
194+
"sum_doc_freq": 9,
195+
"doc_count": 2,
196+
"sum_ttf": 9
197+
},
198+
"terms": {
199+
"a": {
200+
"term_freq": 1,
201+
"tokens": [
202+
{
203+
"position": 2,
204+
"start_offset": 14,
205+
"end_offset": 15
206+
}
207+
]
208+
},
209+
"engine": {
210+
"term_freq": 1,
211+
"tokens": [
212+
{
213+
"position": 4,
214+
"start_offset": 23,
215+
"end_offset": 29
216+
}
217+
]
218+
},
219+
"is": {
220+
"term_freq": 1,
221+
"tokens": [
222+
{
223+
"position": 1,
224+
"start_offset": 11,
225+
"end_offset": 13
226+
}
227+
]
228+
},
229+
"opensearch": {
230+
"term_freq": 1,
231+
"tokens": [
232+
{
233+
"position": 0,
234+
"start_offset": 0,
235+
"end_offset": 10
236+
}
237+
]
238+
},
239+
"search": {
240+
"term_freq": 1,
241+
"tokens": [
242+
{
243+
"position": 3,
244+
"start_offset": 16,
245+
"end_offset": 22
246+
}
247+
]
248+
}
249+
}
250+
}
251+
}
252+
},
253+
{
254+
"_index": "my-index",
255+
"_id": "2",
256+
"_version": 1,
257+
"found": true,
258+
"took": 0,
259+
"term_vectors": {
260+
"text": {
261+
"field_statistics": {
262+
"sum_doc_freq": 9,
263+
"doc_count": 2,
264+
"sum_ttf": 9
265+
},
266+
"terms": {
267+
"features": {
268+
"term_freq": 1,
269+
"tokens": [
270+
{
271+
"position": 3,
272+
"start_offset": 29,
273+
"end_offset": 37
274+
}
275+
]
276+
},
277+
"opensearch": {
278+
"term_freq": 1,
279+
"tokens": [
280+
{
281+
"position": 0,
282+
"start_offset": 0,
283+
"end_offset": 10
284+
}
285+
]
286+
},
287+
"powerful": {
288+
"term_freq": 1,
289+
"tokens": [
290+
{
291+
"position": 2,
292+
"start_offset": 20,
293+
"end_offset": 28
294+
}
295+
]
296+
},
297+
"provides": {
298+
"term_freq": 1,
299+
"tokens": [
300+
{
301+
"position": 1,
302+
"start_offset": 11,
303+
"end_offset": 19
304+
}
305+
]
306+
}
307+
}
308+
}
309+
}
310+
}
311+
]
312+
}
313+
```
314+
315+
## Response body fields
316+
317+
The following table lists all response body fields.
318+
319+
| Field | Data type | Description |
320+
| -------- | --------- | ----------- |
321+
| `docs` | Array | A list of requested documents containing term vectors. |
322+
323+
Each element of the `docs` array contains the following fields.
324+
325+
| Field | Data type | Description |
326+
| -------- | --------- | ----------- |
327+
| `term_vectors` | Object | Contains term vector data for each field. |
328+
| `term_vectors.<field>.field_statistics` | Object | Contains statistics about the field. |
329+
| `term_vectors.<field>.field_statistics.doc_count` | Integer | The number of documents that contain at least one term in the specified field. |
330+
| `term_vectors.<field>.field_statistics.sum_doc_freq` | Integer | The sum of document frequencies for all terms in the field. |
331+
| `term_vectors.<field>.field_statistics.sum_ttf` | Integer | The sum of total term frequencies for all terms in the field. |
332+
| `term_vectors.<field>.terms` | Object | A map of terms in the field, in which each term includes its frequency (`term_freq`) and associated token information. |
333+
| `term_vectors.<field>.terms.<term>.tokens` | Array | An array of token objects for each term, including the token's `position` in the text and its character offsets (`start_offset` and `end_offset`). |

0 commit comments

Comments
 (0)