@@ -17,4 +17,293 @@ bannerText: Vector set is a new data type that is currently in preview and may b
17
17
bannerChildren : true
18
18
---
19
19
20
- Redis [ vector sets] ({{< relref "/develop/data-types/vector-sets" >}})
20
+ A Redis [ vector set] ({{< relref "/develop/data-types/vector-sets" >}}) lets
21
+ you store a set of unique keys, each with its own associated vector.
22
+ You can then retrieve keys from the set according to the similarity between
23
+ their stored vectors and a query vector that you specify.
24
+
25
+ You can use vector sets to store any type of numeric vector but they are
26
+ particularly optimized to work with text embedding vectors (see
27
+ [ Redis for AI] ({{< relref "/develop/ai" >}}) to learn more about text
28
+ embeddings). The example below shows how to use the
29
+ [ ` sentence-transformers ` ] ( https://pypi.org/project/sentence-transformers/ )
30
+ library to generate vector embeddings and then
31
+ store and retrieve them using a vector set with ` redis-py ` .
32
+
33
+ ## Initialize
34
+
35
+ Start by installing the preview version of ` redis-py ` with the following
36
+ command:
37
+
38
+ ``` bash
39
+ pip install redis==6.0.0b2
40
+ ```
41
+
42
+ Also, install ` sentence-transformers ` :
43
+
44
+ ``` bash
45
+ pip install sentence-transformers
46
+ ```
47
+
48
+ In a new Python file, import the required classes:
49
+
50
+ ``` python
51
+ from sentence_transformers import SentenceTransformer
52
+
53
+ import redis
54
+ import numpy as np
55
+ ```
56
+
57
+ The first of these imports is the
58
+ ` SentenceTransformer ` class, which generates an embedding from a section of text.
59
+ Here, we create an instance of ` SentenceTransformer ` that uses the
60
+ [ ` all-MiniLM-L6-v2 ` ] ( https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 )
61
+ model for the embeddings. This model generates vectors with 384 dimensions, regardless
62
+ of the length of the input text, but note that the input is truncated to 256
63
+ tokens (see
64
+ [ Word piece tokenization] ( https://huggingface.co/learn/nlp-course/en/chapter6/6 )
65
+ at the [ Hugging Face] ( https://huggingface.co/ ) docs to learn more about the way tokens
66
+ are related to the original text).
67
+
68
+ ``` python
69
+ model = SentenceTransformer(" sentence-transformers/all-MiniLM-L6-v2" )
70
+ ```
71
+
72
+ ## Create the data
73
+
74
+ For the example, we will use a dictionary of data that contains brief
75
+ descriptions of some famous people:
76
+
77
+ ``` python
78
+ peopleData = {
79
+ " Marie Curie" : {
80
+ " born" : 1867 , " died" : 1934 ,
81
+ " description" : """
82
+ Polish-French chemist and physicist. The only person ever to win
83
+ two Nobel prizes for two different sciences.
84
+ """
85
+ },
86
+ " Linus Pauling" : {
87
+ " born" : 1901 , " died" : 1994 ,
88
+ " description" : """
89
+ American chemist and peace activist. One of only two people to win two
90
+ Nobel prizes in different fields (chemistry and peace).
91
+ """
92
+ },
93
+ " Freddie Mercury" : {
94
+ " born" : 1946 , " died" : 1991 ,
95
+ " description" : """
96
+ British musician, best known as the lead singer of the rock band
97
+ Queen.
98
+ """
99
+ },
100
+ " Marie Fredriksson" : {
101
+ " born" : 1958 , " died" : 2019 ,
102
+ " description" : """
103
+ Swedish multi-instrumentalist, mainly known as the lead singer and
104
+ keyboardist of the band Roxette.
105
+ """
106
+ },
107
+ " Paul Erdos" : {
108
+ " born" : 1913 , " died" : 1996 ,
109
+ " description" : """
110
+ Hungarian mathematician, known for his eccentric personality almost
111
+ as much as his contributions to many different fields of mathematics.
112
+ """
113
+ },
114
+ " Maryam Mirzakhani" : {
115
+ " born" : 1977 , " died" : 2017 ,
116
+ " description" : """
117
+ Iranian mathematician. The first woman ever to win the Fields medal
118
+ for her contributions to mathematics.
119
+ """
120
+ },
121
+ " Masako Natsume" : {
122
+ " born" : 1957 , " died" : 1985 ,
123
+ " description" : """
124
+ Japanese actress. She was very famous in Japan but was primarily
125
+ known elsewhere in the world for her portrayal of Tripitaka in the
126
+ TV series Monkey.
127
+ """
128
+ },
129
+ " Chaim Topol" : {
130
+ " born" : 1935 , " died" : 2023 ,
131
+ " description" : """
132
+ Israeli actor and singer, usually credited simply as 'Topol'. He was
133
+ best known for his many appearances as Tevye in the musical Fiddler
134
+ on the Roof.
135
+ """
136
+ }
137
+ }
138
+ ```
139
+
140
+ ## Add the data to a vector set
141
+
142
+ The next step is to connect to Redis and add the data to a new vector set.
143
+
144
+ The code below uses the dictionary's
145
+ [ ` items() ` ] ( https://docs.python.org/3/library/stdtypes.html#dict.items )
146
+ view to iterate through all the key-value pairs and add corresponding
147
+ elements to a vector set called ` famousPeople ` .
148
+
149
+ We use the
150
+ [ ` encode() ` ] ( https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode )
151
+ method of ` SentenceTransformer ` to generate the
152
+ embedding as an array of ` float32 ` values. The ` tobytes() ` method converts
153
+ the array to a byte string that we pass to the
154
+ [ ` vadd() ` ] ({{< relref "/commands/vadd" >}}) command to set the embedding.
155
+ Note that ` vadd() ` can also accept a list of ` float ` values to set the
156
+ vector, but the byte string format is more compact and saves a little
157
+ transmission time. If you later use
158
+ [ ` vemb() ` ] ({{< relref "/commands/vemb" >}}) to retrieve the embedding,
159
+ it will return the vector as an array rather than the original byte
160
+ string (note that this is different from the behavior of byte strings in
161
+ [ hash vector indexing] ({{< relref "/develop/interact/search-and-query/advanced-concepts/vectors" >}})).
162
+
163
+ The call to ` vadd() ` also adds the ` born ` and ` died ` values from the
164
+ original dictionary as attribute data. You can access this during a query
165
+ or by using the [ ` vgetattr() ` ] ({{< relref "/commands/vgetattr" >}}) method.
166
+
167
+ ``` py
168
+ r = redis.Redis(decode_responses = True )
169
+
170
+ for name, details in peopleData.items():
171
+ emb = model.encode(details[" description" ]).astype(np.float32).tobytes()
172
+
173
+ r.vset().vadd(
174
+ " famousPeople" ,
175
+ emb,
176
+ name,
177
+ attributes = {
178
+ " born" : details[" born" ],
179
+ " died" : details[" died" ]
180
+ }
181
+ )
182
+ ```
183
+
184
+ ## Query the vector set
185
+
186
+ We can now query the data in the set. The basic approach is to use the
187
+ ` encode() ` method to generate another embedding vector for the query text.
188
+ (This is the same method we used when we added the elements to the set.) Then, we pass
189
+ the query vector to [ ` vsim() ` ] ({{< relref "/commands/vsim" >}}) to return elements
190
+ of the set, ranked in order of similarity to the query.
191
+
192
+ Start with a simple query for "actors":
193
+
194
+ ``` py
195
+ query_value = " actors"
196
+
197
+ actors_results = r.vset().vsim(
198
+ " famousPeople" ,
199
+ model.encode(query_value).astype(np.float32).tobytes(),
200
+ )
201
+
202
+ print (f " 'actors': { actors_results} " )
203
+ ```
204
+
205
+ This returns the following list of elements (formatted slightly for clarity):
206
+
207
+ ```
208
+ 'actors': ['Masako Natsume', 'Chaim Topol', 'Linus Pauling',
209
+ 'Marie Fredriksson', 'Maryam Mirzakhani', 'Marie Curie',
210
+ 'Freddie Mercury', 'Paul Erdos']
211
+ ```
212
+
213
+ The first two people in the list are the two actors, as expected, but none of the
214
+ people from Linus Pauling onward was especially well-known for acting (and we certainly
215
+ didn't include any information about that in the short description text).
216
+ As it stands, the search attempts to rank all the elements in the set, based
217
+ on the information contained in the embedding model.
218
+ You can use the ` count ` parameter of ` vsim() ` to limit the list of elements
219
+ to just the most relevant few items:
220
+
221
+ ``` py
222
+ query_value = " actors"
223
+
224
+ two_actors_results = r.vset().vsim(
225
+ " famousPeople" ,
226
+ model.encode(query_value).astype(np.float32).tobytes(),
227
+ count = 2
228
+ )
229
+
230
+ print (f " 'actors (2)': { two_actors_results} " )
231
+ # >>> 'actors (2)': ['Masako Natsume', 'Chaim Topol']
232
+ ```
233
+
234
+ The reason for using text embeddings rather than simple text search
235
+ is that the embeddings represent semantic information. This allows a query
236
+ to find elements with a similar meaning even if the text is
237
+ different. For example, we
238
+ don't use the word "entertainer" in any of the descriptions but
239
+ if we use it as a query, the actors and musicians are ranked highest
240
+ in the results list:
241
+
242
+ ``` py
243
+ query_value = " entertainer"
244
+
245
+ entertainer_results = r.vset().vsim(
246
+ " famousPeople" ,
247
+ model.encode(query_value).astype(np.float32).tobytes()
248
+ )
249
+
250
+ print (f " 'entertainer': { entertainer_results} " )
251
+ # >>> 'entertainer': ['Chaim Topol', 'Freddie Mercury',
252
+ # >>> 'Marie Fredriksson', 'Masako Natsume', 'Linus Pauling',
253
+ # 'Paul Erdos', 'Maryam Mirzakhani', 'Marie Curie']
254
+ ```
255
+
256
+ Similarly, if we use "science" as a query, we get the following results:
257
+
258
+ ```
259
+ 'science': ['Marie Curie', 'Linus Pauling', 'Maryam Mirzakhani',
260
+ 'Paul Erdos', 'Marie Fredriksson', 'Freddie Mercury', 'Masako Natsume',
261
+ 'Chaim Topol']
262
+ ```
263
+
264
+ The scientists are ranked highest but they are then followed by the
265
+ mathematicians. This seems reasonable given the connection between mathematics
266
+ and science.
267
+
268
+ You can also use
269
+ [ filter expressions] ({{< relref "/develop/data-types/vector-sets/filtered-search" >}})
270
+ with ` vsim() ` to restrict the search further. For example,
271
+ repeat the "science" query, but this time limit the results to people
272
+ who died before the year 2000:
273
+
274
+ ``` py
275
+ query_value = " science"
276
+
277
+ science2000_results = r.vset().vsim(
278
+ " famousPeople" ,
279
+ model.encode(query_value).astype(np.float32).tobytes(),
280
+ filter = " .died < 2000"
281
+ )
282
+
283
+ print (f " 'science2000': { science2000_results} " )
284
+ # >>> 'science2000': ['Marie Curie', 'Linus Pauling',
285
+ # 'Paul Erdos', 'Freddie Mercury', 'Masako Natsume']
286
+ ```
287
+
288
+ Note that the boolean filter expression is applied to items in the list
289
+ before the vector distance calculation is performed. Items that don't
290
+ pass the filter test are removed from the results completely, rather
291
+ than just reduced in rank. This can help to improve the performance of the
292
+ search because there is no need to calculate the vector distance for
293
+ elements that have already been filtered out of the search.
294
+
295
+ ## More information
296
+
297
+ See the [ vector sets] ({{< relref "/develop/data-types/vector-sets" >}})
298
+ docs for more information and code examples. See the
299
+ [ Redis for AI] ({{< relref "/develop/ai" >}}) section for more details
300
+ about text embeddings and other AI techniques you can use with Redis.
301
+
302
+ You may also be interested in
303
+ [ vector search] ({{< relref "/develop/clients/redis-py/vecsearch" >}}).
304
+ This is a feature of the
305
+ [ Redis query engine] ({{< relref "/develop/interact/search-and-query" >}})
306
+ that lets you retrieve
307
+ [ JSON] ({{< relref "/develop/data-types/json" >}}) and
308
+ [ hash] ({{< relref "/develop/data-types/hashes" >}}) documents based on
309
+ vector data stored in their fields.
0 commit comments