Skip to content

Commit add78ca

Browse files
MBkktazevaykinanton-bobkovfomichev3000
authored
Vector index docs (#16484)
Co-authored-by: azevaykin <145343289+azevaykin@users.noreply.github.com> Co-authored-by: anton-bobkov <anton-bobkov@ydb.tech> Co-authored-by: Andrey Fomichev <andrey.fomichev@gmail.com>
1 parent 4eb3ce8 commit add78ca

File tree

39 files changed

+700
-72
lines changed

39 files changed

+700
-72
lines changed
Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
# Vector indexes
2+
3+
{{ ydb-short-name }} supports [vector indexes](https://en.wikipedia.org/wiki/Vector_database) to efficiently find the top k rows with vector values closest to a query vector. Unlike secondary indexes that optimize equality or range queries, vector indexes enable similarity search based on distance or similarity functions.
4+
5+
Vector indexes are particularly useful for:
6+
7+
* recommendation systems (finding similar items/users)
8+
* semantic search (matching text embeddings)
9+
* image similarity search
10+
* anomaly detection (finding outliers)
11+
* classification systems (finding nearest labeled examples)
12+
13+
## Vector index characteristics {#characteristics}
14+
15+
Vector indexes in {{ ydb-short-name }}:
16+
17+
* Solve nearest neighbor search problems using similarity or distance functions
18+
* Support multiple distance/similarity functions: "inner_product", "cosine" similarity and "cosine", "euclidean", "manhattan" distance
19+
* Currently implement a single index type: `vector_kmeans_tree`
20+
21+
### Vector index `vector_kmeans_tree` type {#vector-kmeans-tree-type}
22+
23+
The `vector_kmeans_tree` index implements a hierarchical clustering structure. Its organization includes:
24+
25+
1. Hierarchical clustering:
26+
27+
- The index builds multiple levels of k-means clusters
28+
- At each level, vectors are partitioned into specified number of clusters in power of level
29+
- First level clusters the entire dataset
30+
- Subsequent levels recursively cluster each parent cluster's contents
31+
32+
2. Search process:
33+
34+
- During queries, the index examines only the most promising clusters
35+
- This search space pruning avoids exhaustive search through all vectors
36+
37+
3. Parameters:
38+
39+
- `levels`: The number of tree levels (typically 1-3). Controls search depth
40+
- `clusters`: The number of clusters on each level (typically 64-512). Determines search breadth at each level
41+
42+
## Vector index types {#types}
43+
44+
### Basic vector index {#basic}
45+
46+
The simplest form that indexes vectors without additional filtering capabilities. For example:
47+
48+
```yql
49+
ALTER TABLE my_table
50+
ADD INDEX my_index
51+
GLOBAL USING vector_kmeans_tree
52+
ON (embedding)
53+
WITH (distance=cosine, type="uint8", dimension=512, levels=2, clusters=128);
54+
```
55+
56+
### Vector index with covered columns {#covering}
57+
58+
Includes additional columns to avoid reading from the main table during queries:
59+
60+
```yql
61+
ALTER TABLE my_table
62+
ADD INDEX my_index
63+
GLOBAL USING vector_kmeans_tree
64+
ON (embedding) COVER (data)
65+
WITH (distance=cosine, type="uint8", dimension=512, levels=2, clusters=128);
66+
```
67+
68+
### Prefixed vector index {#prefixed}
69+
70+
Allows filtering by prefix columns before performing vector search:
71+
72+
```yql
73+
ALTER TABLE my_table
74+
ADD INDEX my_index
75+
GLOBAL USING vector_kmeans_tree
76+
ON (user, embedding)
77+
WITH (distance=cosine, type="uint8", dimension=512, levels=2, clusters=128);
78+
```
79+
80+
### Prefixed vector index with covered columns {#prefixed-covering}
81+
82+
Combines prefix filtering with covered columns for optimal performance:
83+
84+
```yql
85+
ALTER TABLE my_table
86+
ADD INDEX my_index
87+
GLOBAL USING vector_kmeans_tree
88+
ON (user, embedding) COVER (data)
89+
WITH (distance=cosine, type="uint8", dimension=512, levels=2, clusters=128);
90+
```
91+
92+
## Creating vector indexes {#creation}
93+
94+
Vector indexes can be created:
95+
96+
* When creating a table with the YQL [`CREATE TABLE` statement](../../yql/reference/syntax/create_table/vector_index.md)
97+
* Added to an existing table with the YQL [`ALTER TABLE` statement](../../yql/reference/syntax/alter_table/indexes.md)
98+
99+
For more information about vector index parameters, see [`CREATE TABLE` statement](../../yql/reference/syntax/create_table/vector_index.md).
100+
101+
## Using vector indexes {#usage}
102+
103+
Query vector indexes using the VIEW syntax in YQL. For prefixed indexes, include the prefix columns in the WHERE clause:
104+
105+
```yql
106+
SELECT user, data
107+
FROM my_table VIEW my_index
108+
WHERE user = "..."
109+
ORDER BY Knn::CosineSimilarity(embedding, ...) DESC
110+
LIMIT 10;
111+
```
112+
113+
114+
## Limitations {#limitations}
115+
116+
Currently not supported:
117+
* modifying rows in indexed tables
118+
* bit vector type

ydb/docs/en/core/concepts/column-table.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ What's currently not supported:
2222

2323
* Reading data from replicas
2424
* Secondary indexes
25+
* Vector indexes
2526
* Bloom filters
2627
* Change Data Capture
2728
* Renaming tables

ydb/docs/en/core/concepts/datamodel/_includes/table.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -197,6 +197,7 @@ At the moment, not all functionality of column-oriented tables is implemented. T
197197

198198
* Reading from replicas.
199199
* Secondary indexes.
200+
* Vector indexes.
200201
* Bloom filters.
201202
* Change Data Capture.
202203
* Table renaming.

ydb/docs/en/core/concepts/glossary.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,10 @@ A **primary index** or **primary key index** is the main data structure used to
157157

158158
A **secondary index** is an additional data structure used to locate rows in a table, typically when it can't be done efficiently using the [primary index](#primary-index). Unlike the primary index, secondary indexes are managed independently from the main table data. Thus, a table might have multiple secondary indexes for different use cases. {{ ydb-short-name }}'s capabilities in terms of secondary indexes are covered in a separate article [{#T}](secondary_indexes.md). Secondary indexes can be either unique or non-unique.
159159

160+
#### Vector Index {#vector-index}
161+
162+
A **vector index** is an additional data structure used to speed up the [nearest neighbor search](https://en.wikipedia.org/wiki/Nearest_neighbor_search), typically when the data is too large for the [index-less approach](../yql/reference/udf/list/knn.md) to handle the load. Unlike the primary index, vector indexes are managed independently of the underlying table data. Thus, a table can have multiple vector indexes for different scenarios. For more information about using vector indexes in {{ ydb-short-name }}, see [{#T}](vector_indexes.md).
163+
160164
#### Column family {#column-family}
161165

162166
A **column family** or **column group** is a feature that allows storing a subset of [row-oriented table](#row-oriented-table) columns separately in a distinct family or group. The primary use case is to store some columns on different kinds of disk drives (offload less important columns to HDD) or with various compression settings. If the workload requires many column families, consider using [column-oriented tables](#column-oriented-table) instead.

ydb/docs/en/core/concepts/toc_i.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@ items:
1414
href: transactions.md
1515
- name: Secondary indexes
1616
href: secondary_indexes.md
17+
- name: Vector indexes
18+
href: vector_indexes.md
1719
- name: Change Data Capture (CDC)
1820
href: cdc.md
1921
when: feature_changefeed
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
{% include [vector_indexes.md](_includes/vector_indexes.md) %}

ydb/docs/en/core/dev/toc_p.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@ items:
1818
path: primary-key/toc_p.yaml
1919
- name: Secondary indexes
2020
href: secondary-indexes.md
21+
- name: Vector indexes
22+
href: vector-indexes.md
2123
- name: Query plans optimization
2224
href: query-plans-optimization.md
2325
- name: Batch upload
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# Vector indexes
2+
3+
[Vector indexes](https://en.wikipedia.org/wiki/Vector_database) are specialized data structures that enable efficient similarity search in high-dimensional spaces. Unlike traditional indexes that optimize exact lookups, vector indexes allow finding the most similar items to a query vector based on mathematical distance or similarity measures.
4+
5+
Data in a {{ ydb-short-name }} table is stored and sorted by a primary key, enabling efficient point lookups and range scans. Vector indexes provide similar efficiency for nearest neighbor searches in vector spaces, which is particularly valuable for working with embeddings and other high-dimensional data representations.
6+
7+
This article describes practical operations with vector indexes. For conceptual information about vector index types and their characteristics, see [Vector indexes](../concepts/vector_indexes.md) in the Concepts section.
8+
9+
## Creating vector indexes {#create}
10+
11+
A vector index can be created with the following YQL commands:
12+
* [`CREATE TABLE`](../yql/reference/syntax/create_table/index.md)
13+
* [`ALTER TABLE`](../yql/reference/syntax/alter_table/index.md)
14+
15+
Example of creating a prefixed vector index with covered columns:
16+
17+
```yql
18+
ALTER TABLE my_table
19+
ADD INDEX my_index
20+
GLOBAL USING vector_kmeans_tree
21+
ON (user, embedding) COVER (data)
22+
WITH (distance=cosine, type="uint8", dimension=512, levels=2, clusters=128);
23+
```
24+
25+
Key parameters for `vector_kmeans_tree`:
26+
* `distance`/`similarity`: Metric function ("cosine", "euclidean", etc.)
27+
* `type`: Data type ("float", "int8", "uint8")
28+
* `dimension`: Number of dimensions (<= 16384)
29+
* `levels`: Tree depth
30+
* `clusters`: Number of clusters per level (values > 1000 may impact performance)
31+
32+
Since building a vector index requires processing existing data, index creation on populated tables may take significant time. This operation runs in the background, allowing continued table access during construction. The index becomes available automatically when ready.
33+
34+
## Using vector indexes for similarity search {#use}
35+
36+
To perform similarity searches, explicitly specify the index name in the VIEW clause. For prefixed indexes, include prefix column conditions in the WHERE clause:
37+
38+
```yql
39+
DECLARE $query_vector AS List<Uint8>;
40+
41+
SELECT user, data
42+
FROM my_table VIEW my_index
43+
WHERE user = "john_doe"
44+
ORDER BY Knn::CosineSimilarity(embedding, $query_vector) DESC
45+
LIMIT 10;
46+
```
47+
48+
Without the VIEW clause, the query would perform a full table scan with brute-force vector comparison.
49+
50+
## Checking the cost of queries {#cost}
51+
52+
Any query made in a transactional application should be checked in terms of the number of I/O operations it performed in the database and how much CPU was used to run it. You should also make sure these indicators don't continuously grow as the database volume grows. {{ ydb-short-name }} returns statistics required for the analysis after running each query.
53+
54+
If you use the {{ ydb-short-name }} CLI, select the `--stats` option to enable printing statistics after executing the `yql` command. All {{ ydb-short-name }} SDKs also contain structures with statistics returned after running a query. If you make a query in the UI, you'll see a tab with statistics next to the results tab.
55+
56+
{% note warning %}
57+
58+
Vector indexes currently don't support data modification operations.
59+
Any attempt to modify rows in indexed tables will fail.
60+
This limitation will be removed in future releases.
61+
62+
{% endnote %}

ydb/docs/en/core/reference/observability/metrics/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
| Metric name<br/>Type, units of measurement | Description<br/>Labels |
66
| ----- | ----- |
77
| `resources.storage.used_bytes`<br/>`IGAUGE`, bytes | The size of user and service data stored in distributed network storage. `resources.storage.used_bytes` = `resources.storage.table.used_bytes` + `resources.storage.topic.used_bytes`. |
8-
| `resources.storage.table.used_bytes`<br/>`IGAUGE`, bytes | The size of user and service data stored by tables in distributed network storage. Service data includes the data of the primary and [secondary indexes](../../../concepts/secondary_indexes.md). |
8+
| `resources.storage.table.used_bytes`<br/>`IGAUGE`, bytes | The size of user and service data stored by tables in distributed network storage. Service data includes the data of the primary, [secondary indexes](../../../concepts/secondary_indexes.md) and [vector indexes](../../../concepts/vector_indexes.md). |
99
| `resources.storage.topic.used_bytes`<br/>`IGAUGE`, bytes | The size of storage used by topics. This metric sums the `topic.storage_bytes` values of all topics. |
1010
| `resources.storage.limit_bytes`<br/>`IGAUGE`, bytes | A limit on the size of user and service data that a database can store in distributed network storage. |
1111

ydb/docs/en/core/reference/ydb-cli/commands/_includes/secondary_index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -89,9 +89,9 @@ Deleting the index-building details (use the actual operation id):
8989
{{ ydb-cli }} -p quickstart operation forget ydb://buildindex/7?id=2814749869
9090
```
9191

92-
## Deleting a secondary index {#drop}
92+
## Deleting a index {#drop}
9393

94-
Secondary indexes are deleted by the `table index drop` command:
94+
Indexes are deleted by the `table index drop` command:
9595

9696
```bash
9797
{{ ydb-cli }} [connection options] table index drop <table> --index-name STR

0 commit comments

Comments
 (0)