Skip to content

Commit ec74293

Browse files
DOC-5149 added JSON examples for vector search
1 parent 3eecd96 commit ec74293

File tree

1 file changed

+120
-7
lines changed

1 file changed

+120
-7
lines changed

content/develop/clients/redis-py/vecsearch.md

Lines changed: 120 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -28,10 +28,12 @@ similarity of an embedding generated from some query text with embeddings stored
2828
or JSON fields, Redis can retrieve documents that closely match the query in terms
2929
of their meaning.
3030

31-
In the example below, we use the
31+
The example below uses the
3232
[`sentence-transformers`](https://pypi.org/project/sentence-transformers/)
3333
library to generate vector embeddings to store and index with
34-
Redis Query Engine.
34+
Redis Query Engine. The code is first demonstrated for hash documents with a
35+
separate section to explain the
36+
[differences with JSON documents](#differences-with-json-documents).
3537

3638
## Initialize
3739

@@ -50,6 +52,7 @@ from sentence_transformers import SentenceTransformer
5052
from redis.commands.search.query import Query
5153
from redis.commands.search.field import TextField, TagField, VectorField
5254
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
55+
from redis.commands.json.path import Path
5356

5457
import numpy as np
5558
import redis
@@ -86,7 +89,7 @@ except redis.exceptions.ResponseError:
8689
pass
8790
```
8891

89-
Next, we create the index.
92+
Next, create the index.
9093
The schema in the example below specifies hash objects for storage and includes
9194
three fields: the text content to index, a
9295
[tag]({{< relref "/develop/interact/search-and-query/advanced-concepts/tags" >}})
@@ -127,10 +130,10 @@ Use the `model.encode()` method of `SentenceTransformer`
127130
as shown below to create the embedding that represents the `content` field.
128131
The `astype()` option that follows the `model.encode()` call specifies that
129132
we want a vector of `float32` values. The `tobytes()` option encodes the
130-
vector components together as a single binary string rather than the
131-
default Python list of `float` values.
132-
Use the binary string representation when you are indexing hash objects
133-
(as we are here), but use the default list of `float` for JSON objects.
133+
vector components together as a single binary string.
134+
Use the binary string representation when you are indexing hashes
135+
or running a query (but use a list of `float` for
136+
[JSON documents](#differences-with-json-documents)).
134137

135138
```python
136139
content = "That is a very happy person"
@@ -226,6 +229,116 @@ As you would expect, the result for `doc:0` with the content text *"That is a ve
226229
is the result that is most similar in meaning to the query text
227230
*"That is a happy person"*.
228231

232+
## Differences with JSON documents
233+
234+
Indexing JSON documents is similar to hash indexing, but there are some
235+
important differences. JSON allows much richer data modelling with nested fields, so
236+
you must supply a [path]({{< relref "/develop/data-types/json/path" >}}) in the schema
237+
to identify each field you want to index. However, you can declare a short alias for each
238+
of these paths (using the `as_name` keyword argument) to avoid typing it in full for
239+
every query. Also, you must specify `IndexType.JSON` when you create the index.
240+
241+
The code below shows these differences, but the index is otherwise very similar to
242+
the one created previously for hashes:
243+
244+
```py
245+
schema = (
246+
TextField("$.content", as_name="content"),
247+
TagField("$.genre", as_name="genre"),
248+
VectorField(
249+
"$.embedding", "HNSW", {
250+
"TYPE": "FLOAT32",
251+
"DIM": 384,
252+
"DISTANCE_METRIC": "L2"
253+
},
254+
as_name="embedding"
255+
)
256+
)
257+
258+
r.ft("vector_json_idx").create_index(
259+
schema,
260+
definition=IndexDefinition(
261+
prefix=["jdoc:"], index_type=IndexType.JSON
262+
)
263+
)
264+
```
265+
266+
Use [`json().set()`]({{< relref "/commands/json.set" >}}) to add the data
267+
instead of [`hset()`]({{< relref "/commands/hset" >}}). The dictionaries
268+
that specify the fields have the same structure as the ones used for `hset()`
269+
but `json().set()` receives them in a positional argument instead of
270+
the `mapping` keyword argument.
271+
272+
An important difference with JSON indexing is that the vectors are
273+
specified using lists instead of binary strings. Generate the list
274+
using the `tolist()` method instead of `tobytes()` as you would with a
275+
hash.
276+
277+
```py
278+
content = "That is a very happy person"
279+
280+
r.json().set("jdoc:0", Path.root_path(), {
281+
"content": content,
282+
"genre": "persons",
283+
"embedding": model.encode(content).astype(np.float32).tolist(),
284+
})
285+
286+
content = "That is a happy dog"
287+
288+
r.json().set("jdoc:1", Path.root_path(), {
289+
"content": content,
290+
"genre": "pets",
291+
"embedding": model.encode(content).astype(np.float32).tolist(),
292+
})
293+
294+
content = "Today is a sunny day"
295+
296+
r.json().set("jdoc:2", Path.root_path(), {
297+
"content": content,
298+
"genre": "weather",
299+
"embedding": model.encode(content).astype(np.float32).tolist(),
300+
})
301+
```
302+
303+
The query is almost identical to the one for the hash documents. This
304+
demonstrates how the right choice of aliases for the JSON paths can
305+
save you having to write complex queries. An important thing to notice
306+
is that the vector parameter for the query is still specified as a
307+
binary string (using the `tobytes()` method), even though the data for
308+
the `embedding` field of the JSON was specified as a list.
309+
310+
```py
311+
q = Query(
312+
"*=>[KNN 3 @embedding $vec AS vector_distance]"
313+
).return_field("vector_distance").return_field("content").dialect(2)
314+
315+
query_text = "That is a happy person"
316+
317+
res = r.ft("vector_json_idx").search(
318+
q, query_params={
319+
"vec": model.encode(query_text).astype(np.float32).tobytes()
320+
}
321+
)
322+
```
323+
324+
Apart from the `jdoc:` prefixes for the keys, the result from the JSON
325+
query is the same as for hash:
326+
327+
```
328+
Result{
329+
3 total,
330+
docs: [
331+
Document {
332+
'id': 'jdoc:0',
333+
'payload': None,
334+
'vector_distance': '0.114169985056',
335+
'content': 'That is a very happy person'
336+
},
337+
.
338+
.
339+
.
340+
```
341+
229342
## Learn more
230343

231344
See

0 commit comments

Comments
 (0)