Skip to content

Commit 16a986d

Browse files
authored
Fix YQL knn udf docs (#8425)
1 parent de9b991 commit 16a986d

File tree

4 files changed

+167
-17
lines changed

4 files changed

+167
-17
lines changed
Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
1+
* [DateTime](../../datetime.md)
2+
* [Digest](../../digest.md)
3+
* [Histogram](../../histogram.md)
14
* [Hyperscan](../../hyperscan.md)
5+
* [Ip](../../ip.md)
6+
* [Knn](../../knn.md)
7+
* [Math](../../math.md)
28
* [Pcre](../../pcre.md)
39
* [Pire](../../pire.md)
410
* [Re2](../../re2.md)
511
* [String](../../string.md)
612
* [Unicode](../../unicode.md)
7-
* [DateTime](../../datetime.md)
813
* [Url](../../url.md)
9-
* [Ip](../../ip.md)
10-
* [Knn](../../knn.md)
11-
* [Yson](../../yson.md)
12-
* [Digest](../../digest.md)
13-
* [Math](../../math.md)
14-
* [Histogram](../../histogram.md)
14+
* [Yson](../../yson.md)

ydb/docs/en/core/yql/reference/yql-core/udf/list/knn.md

Lines changed: 77 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ Approximate methods do not perform a complete search of the source data. Due to
3131
This document provides an [example of approximate search](#approximate-search-examples) using scalar quantization. This example does not require the creation of a secondary vector index.
3232

3333
**Scalar quantization** is a method to compress vectors by mapping coordinates to a smaller space.
34-
{{ ydb-short-name }} support exact search for `Float`, `Int8`, `Uint8`, `Bit` vectors.
34+
This module supports exact search for `Float`, `Int8`, `Uint8`, `Bit` vectors.
3535
So, it's possible to apply scalar quantization from `Float` to one of these other types.
3636

3737
Scalar quantization decreases read/write times by reducing vector size in bytes. For example, after quantization from `Float` to `Bit,` each vector becomes 32 times smaller.
@@ -45,7 +45,7 @@ It is recommended to measure if such quantization provides sufficient accuracy/r
4545
## Data types
4646

4747
In mathematics, a vector of real or integer numbers is used to store points.
48-
In {{ ydb-short-name }}, vectors are stored in the `String` data type, which is a binary serialized representation of a vector.
48+
In this module, vectors are stored in the `String` data type, which is a binary serialized representation of a vector.
4949

5050
## Functions
5151

@@ -57,7 +57,9 @@ Conversion functions are needed to serialize vectors into an internal binary rep
5757

5858
All serialization functions wrap returned `String` data into [Tagged](../../types/special.md) types.
5959

60+
{% if backend_name == "YDB" %}
6061
The binary representation of the vector can be stored in the {{ ydb-short-name }} table column. Currently {{ ydb-short-name }} does not support storing `Tagged`, so before storing binary representation vectors you must call [Untag](../../builtins/basic#as-tagged).
62+
{% endif %}
6163

6264
#### Function signatures
6365

@@ -123,6 +125,7 @@ Error: Failed to find UDF function: Knn.CosineDistance, reason: Error: Module: K
123125

124126
## Еxact search examples
125127

128+
{% if backend_name == "YDB" %}
126129
### Creating a table
127130

128131
```sql
@@ -142,9 +145,25 @@ $vector = [1.f, 2.f, 3.f, 4.f];
142145
UPSERT INTO Facts (id, user, fact, embedding)
143146
VALUES (123, "Williams", "Full name is John Williams", Untag(Knn::ToBinaryStringFloat($vector), "FloatVector"));
144147
```
148+
{% else %}
149+
### Data declaration
150+
151+
```sql
152+
$vector = [1.f, 2.f, 3.f, 4.f];
153+
$facts = AsList(
154+
AsStruct(
155+
123 AS id, -- Id of fact
156+
"Williams" AS user, -- User name
157+
"Full name is John Williams" AS fact, -- Human-readable description of a user fact
158+
Knn::ToBinaryStringFloat($vector) AS embedding, -- Binary representation of embedding vector
159+
),
160+
);
161+
```
162+
{% endif %}
145163

146164
### Exact search of K nearest vectors
147165

166+
{% if backend_name == "YDB" %}
148167
```sql
149168
$K = 10;
150169
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);
@@ -154,23 +173,45 @@ WHERE user="Williams"
154173
ORDER BY Knn::CosineDistance(embedding, $TargetEmbedding)
155174
LIMIT $K;
156175
```
176+
{% else %}
177+
```sql
178+
$K = 10;
179+
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);
180+
181+
SELECT * FROM AS_TABLE($facts)
182+
WHERE user="Williams"
183+
ORDER BY Knn::CosineDistance(embedding, $TargetEmbedding)
184+
LIMIT $K;
185+
```
186+
{% endif %}
157187

158188
### Exact search of vectors in radius R
159189

190+
{% if backend_name == "YDB" %}
160191
```sql
161192
$R = 0.1f;
162193
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);
163194

164195
SELECT * FROM Facts
165196
WHERE Knn::CosineDistance(embedding, $TargetEmbedding) < $R;
166197
```
198+
{% else %}
199+
```sql
200+
$R = 0.1f;
201+
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);
202+
203+
SELECT * FROM AS_TABLE($facts)
204+
WHERE Knn::CosineDistance(embedding, $TargetEmbedding) < $R;
205+
```
206+
{% endif %}
167207

168208
## Approximate search examples
169209

170210
This example differs from the [exact search example](#еxact-search-examples) by using bit quantization.
171211

172212
This allows to first do a approximate preliminary search by the `embedding_bit` column, and then refine the results by the original vector column `embegging`.
173213

214+
{% if backend_name == "YDB" %}
174215
### Creating a table
175216

176217
```sql
@@ -191,6 +232,22 @@ $vector = [1.f, 2.f, 3.f, 4.f];
191232
UPSERT INTO Facts (id, user, fact, embedding, embedding_bit)
192233
VALUES (123, "Williams", "Full name is John Williams", Untag(Knn::ToBinaryStringFloat($vector), "FloatVector"), Untag(Knn::ToBinaryStringBit($vector), "BitVector"));
193234
```
235+
{% else %}
236+
### Data declaration
237+
238+
```sql
239+
$vector = [1.f, 2.f, 3.f, 4.f];
240+
$facts = AsList(
241+
AsStruct(
242+
123 AS id, -- Id of fact
243+
"Williams" AS user, -- User name
244+
"Full name is John Williams" AS fact, -- Human-readable description of a user fact
245+
Knn::ToBinaryStringFloat($vector) AS embedding, -- Binary representation of embedding vector
246+
Knn::ToBinaryStringBit($vector) AS embedding_bit, -- Binary representation of embedding vector
247+
),
248+
);
249+
```
250+
{% endif %}
194251

195252
### Scalar quantization
196253

@@ -219,6 +276,7 @@ Approximate search algorithm:
219276
* an approximate list of vectors is obtained;
220277
* we search this list without using quantization.
221278

279+
{% if backend_name == "YDB" %}
222280
```sql
223281
$K = 10;
224282
$Target = [1.2f, 2.3f, 3.4f, 4.5f];
@@ -234,3 +292,20 @@ WHERE id IN $Ids
234292
ORDER BY Knn::CosineDistance(embedding, $TargetEmbeddingFloat)
235293
LIMIT $K;
236294
```
295+
{% else %}
296+
```sql
297+
$K = 10;
298+
$Target = [1.2f, 2.3f, 3.4f, 4.5f];
299+
$TargetEmbeddingBit = Knn::ToBinaryStringBit($Target);
300+
$TargetEmbeddingFloat = Knn::ToBinaryStringFloat($Target);
301+
302+
$Ids = SELECT id FROM AS_TABLE($facts)
303+
ORDER BY Knn::CosineDistance(embedding_bit, $TargetEmbeddingBit)
304+
LIMIT $K * 10;
305+
306+
SELECT * FROM AS_TABLE($facts)
307+
WHERE id IN $Ids
308+
ORDER BY Knn::CosineDistance(embedding, $TargetEmbeddingFloat)
309+
LIMIT $K;
310+
```
311+
{% endif %}
Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
11
items:
22
- name: Overview
33
href: index.md
4+
- { name: DateTime, href: datetime.md }
5+
- { name: Digest, href: digest.md }
6+
- { name: Histogram, href: histogram.md }
47
- { name: Hyperscan, href: hyperscan.md }
8+
- { name: Ip, href: ip.md }
9+
- { name: Knn, href: knn.md }
10+
- { name: Math, href: math.md }
511
- { name: Pcre, href: pcre.md }
612
- { name: Pire, href: pire.md }
713
- { name: Re2, href: re2.md }
814
- { name: String, href: string.md }
915
- { name: Unicode, href: unicode.md }
10-
- { name: DateTime, href: datetime.md }
1116
- { name: Url, href: url.md }
12-
- { name: Ip, href: ip.md }
13-
- { name: Knn, href: knn.md }
1417
- { name: Yson, href: yson.md }
15-
- { name: Digest, href: digest.md }
16-
- { name: Math, href: math.md }
17-
- { name: Histogram, href: histogram.md }

ydb/docs/ru/core/yql/reference/yql-core/udf/list/knn.md

Lines changed: 77 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ LIMIT 10;
3131
В данном документе приведен [пример приближенного поиска](#примеры-приближенного-поиска) с помощью скалярного квантования, не требущий построения вторичного векторного индекса.
3232

3333
**Скалярное квантование** это метод сжатия векторов, когда множество координат отображаются в множество меньшей размерности.
34-
{{ ydb-short-name }} поддерживает точный поиск по `Float`, `Int8`, `Uint8`, `Bit` векторам.
34+
Этот модуль поддерживает точный поиск по `Float`, `Int8`, `Uint8`, `Bit` векторам.
3535
Соответственно, возможно скалярное квантование из `Float` в один из этих типов.
3636

3737
Скалярное квантование уменьшает время необходимое для чтения/записи, поскольку число байт сокращается в разы.
@@ -46,7 +46,7 @@ LIMIT 10;
4646
## Типы данных
4747

4848
В математике для хранения точек используется вектор вещественных или целых чисел.
49-
В {{ ydb-short-name }} вектора хранятся в строковом типе данных `String`, который является бинарным сериализованным представлением вектора.
49+
В этом модуле вектора представлены типом данных `String`, который является бинарным сериализованным представлением вектора.
5050

5151
## Функции
5252

@@ -58,8 +58,10 @@ LIMIT 10;
5858

5959
Все функции сериализации упаковывают возвращаемые данные типа `String` в [Tagged](../../types/special.md) тип.
6060

61+
{% if backend_name == "YDB" %}
6162
Бинарное представление вектора можно сохранить в {{ ydb-short-name }} колонку.
6263
В настоящий момент {{ ydb-short-name }} не поддерживает хранение `Tagged` типов и поэтому перед сохранением бинарного представления векторов нужно извлечь `String` с помощью функции [Untag](../../builtins/basic#as-tagged).
64+
{% endif %}
6365

6466
#### Сигнатуры функций
6567

@@ -125,6 +127,7 @@ Error: Failed to find UDF function: Knn.CosineDistance, reason: Error: Module: K
125127

126128
## Примеры точного поиска
127129

130+
{% if backend_name == "YDB" %}
128131
### Создание таблицы
129132

130133
```sql
@@ -144,9 +147,25 @@ $vector = [1.f, 2.f, 3.f, 4.f];
144147
UPSERT INTO Facts (id, user, fact, embedding)
145148
VALUES (123, "Williams", "Full name is John Williams", Untag(Knn::ToBinaryStringFloat($vector), "FloatVector"));
146149
```
150+
{% else %}
151+
### Декларация данных
152+
153+
```sql
154+
$vector = [1.f, 2.f, 3.f, 4.f];
155+
$facts = AsList(
156+
AsStruct(
157+
123 AS id, -- Id of fact
158+
"Williams" AS user, -- User name
159+
"Full name is John Williams" AS fact, -- Human-readable description of a user fact
160+
Knn::ToBinaryStringFloat($vector) AS embedding, -- Binary representation of embedding vector
161+
),
162+
);
163+
```
164+
{% endif %}
147165

148166
### Точный поиск K ближайших векторов
149167

168+
{% if backend_name == "YDB" %}
150169
```sql
151170
$K = 10;
152171
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);
@@ -156,22 +175,44 @@ WHERE user="Williams"
156175
ORDER BY Knn::CosineDistance(embedding, $TargetEmbedding)
157176
LIMIT $K;
158177
```
178+
{% else %}
179+
```sql
180+
$K = 10;
181+
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);
182+
183+
SELECT * FROM AS_TABLE($facts)
184+
WHERE user="Williams"
185+
ORDER BY Knn::CosineDistance(embedding, $TargetEmbedding)
186+
LIMIT $K;
187+
```
188+
{% endif %}
159189

160190
### Точный поиск векторов, находящихся в радиусе R
161191

192+
{% if backend_name == "YDB" %}
162193
```sql
163194
$R = 0.1f;
164195
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);
165196

166197
SELECT * FROM Facts
167198
WHERE Knn::CosineDistance(embedding, $TargetEmbedding) < $R;
168199
```
200+
{% else %}
201+
```sql
202+
$R = 0.1f;
203+
$TargetEmbedding = Knn::ToBinaryStringFloat([1.2f, 2.3f, 3.4f, 4.5f]);
204+
205+
SELECT * FROM AS_TABLE($facts)
206+
WHERE Knn::CosineDistance(embedding, $TargetEmbedding) < $R;
207+
```
208+
{% endif %}
169209

170210
## Примеры приближенного поиска
171211

172212
Данный пример отличается от [примера с точным поиском](#примеры-точного-поиска) использованием битового квантования.
173213
Это позволяет сначала делать грубый предварительный поиск по колонке `embedding_bit`, а затем уточнять результаты по основной колонке с векторами `embedding`.
174214

215+
{% if backend_name == "YDB" %}
175216
### Создание таблицы
176217

177218
```sql
@@ -192,6 +233,22 @@ $vector = [1.f, 2.f, 3.f, 4.f];
192233
UPSERT INTO Facts (id, user, fact, embedding, embedding_bit)
193234
VALUES (123, "Williams", "Full name is John Williams", Untag(Knn::ToBinaryStringFloat($vector), "FloatVector"), Untag(Knn::ToBinaryStringBit($vector), "BitVector"));
194235
```
236+
{% else %}
237+
### Декларация данных
238+
239+
```sql
240+
$vector = [1.f, 2.f, 3.f, 4.f];
241+
$facts = AsList(
242+
AsStruct(
243+
123 AS id, -- Id of fact
244+
"Williams" AS user, -- User name
245+
"Full name is John Williams" AS fact, -- Human-readable description of a user fact
246+
Knn::ToBinaryStringFloat($vector) AS embedding, -- Binary representation of embedding vector
247+
Knn::ToBinaryStringBit($vector) AS embedding_bit, -- Binary representation of embedding vector
248+
),
249+
);
250+
```
251+
{% endif %}
195252

196253
### Скалярное квантование
197254

@@ -220,6 +277,7 @@ SELECT ListMap($FloatList, $MapInt8);
220277
* получается приближенный список векторов;
221278
* в этом списке производим поиск без использования квантования.
222279

280+
{% if backend_name == "YDB" %}
223281
```sql
224282
$K = 10;
225283
$Target = [1.2f, 2.3f, 3.4f, 4.5f];
@@ -235,3 +293,20 @@ WHERE id IN $Ids
235293
ORDER BY Knn::CosineDistance(embedding, $TargetEmbeddingFloat)
236294
LIMIT $K;
237295
```
296+
{% else %}
297+
```sql
298+
$K = 10;
299+
$Target = [1.2f, 2.3f, 3.4f, 4.5f];
300+
$TargetEmbeddingBit = Knn::ToBinaryStringBit($Target);
301+
$TargetEmbeddingFloat = Knn::ToBinaryStringFloat($Target);
302+
303+
$Ids = SELECT id FROM AS_TABLE($facts)
304+
ORDER BY Knn::CosineDistance(embedding_bit, $TargetEmbeddingBit)
305+
LIMIT $K * 10;
306+
307+
SELECT * FROM AS_TABLE($facts)
308+
WHERE id IN $Ids
309+
ORDER BY Knn::CosineDistance(embedding, $TargetEmbeddingFloat)
310+
LIMIT $K;
311+
```
312+
{% endif %}

0 commit comments

Comments
 (0)