@@ -34,9 +34,6 @@ to generate the vector embeddings to store and index with Redis Query Engine.
34
34
35
35
## Initialize
36
36
37
- Install [ ` jedis ` ] ({{< relref "/develop/clients/jedis" >}}) if you
38
- have not already done so.
39
-
40
37
If you are using [ Maven] ( https://maven.apache.org/ ) , add the following
41
38
dependencies to your ` pom.xml ` file:
42
39
@@ -83,6 +80,33 @@ import java.util.List;
83
80
import ai.djl.huggingface.tokenizers.HuggingFaceTokenizer ;
84
81
```
85
82
83
+ ## Define a helper method
84
+
85
+ Our embedding model represents the vectors as an array of ` long ` integer values,
86
+ but Redis Query Engine expects the vector components to be ` float ` values.
87
+ Also, when you store vectors in a hash object, you must encode the vector
88
+ array as a ` byte ` string. To simplify this situation, we declare a helper
89
+ method ` longsToFloatsByteString() ` that takes the ` long ` array that the
90
+ embedding model returns, converts it to an array of ` float ` values, and
91
+ then encodes the ` float ` array as a ` byte ` string:
92
+
93
+ ``` java
94
+ public static byte [] longsToFloatsByteString(long [] input) {
95
+ float [] floats = new float [input. length];
96
+ for (int i = 0 ; i < input. length; i++ ) {
97
+ floats[i] = input[i];
98
+ }
99
+
100
+ byte [] bytes = new byte [Float . BYTES * floats. length];
101
+ ByteBuffer
102
+ .wrap(bytes)
103
+ .order(ByteOrder . LITTLE_ENDIAN )
104
+ .asFloatBuffer()
105
+ .put(floats);
106
+ return bytes;
107
+ }
108
+ ```
109
+
86
110
## Create a tokenizer instance
87
111
88
112
We will use the
@@ -136,8 +160,7 @@ SchemaField[] schema = {
136
160
Map . of(
137
161
" TYPE" , " FLOAT32" ,
138
162
" DIM" , 768 ,
139
- " DISTANCE_METRIC" , " L2" ,
140
- " INITIAL_CAP" , 3
163
+ " DISTANCE_METRIC" , " L2"
141
164
)
142
165
)
143
166
.build()
@@ -151,29 +174,6 @@ jedis.ftCreate("vector_idx",
151
174
);
152
175
```
153
176
154
- ## Define a helper method
155
-
156
- The embedding model represents the vectors as an array of ` long ` integer values,
157
- but Redis Query Engine expects the vector components to be ` float ` values.
158
- Also, when you store vectors in a hash object, you must encode the vector
159
- array as a ` byte ` string. To simplify this situation, we declare a helper
160
- method ` longsToFloatsByteString() ` that takes the ` long ` array that the
161
- embedding model returns, converts it to an array of ` float ` values, and
162
- then encodes the ` float ` array as a ` byte ` string:
163
-
164
- ``` java
165
- public static byte [] longsToFloatsByteString(long [] input) {
166
- float [] floats = new float [input. length];
167
- for (int i = 0 ; i < input. length; i++ ) {
168
- floats[i] = input[i];
169
- }
170
-
171
- byte [] bytes = new byte [Float . BYTES * floats. length];
172
- ByteBuffer . wrap(bytes). order(ByteOrder . LITTLE_ENDIAN ). asFloatBuffer(). put(floats);
173
- return bytes;
174
- }
175
- ```
176
-
177
177
## Add data
178
178
179
179
You can now supply the data objects, which will be indexed automatically
@@ -182,31 +182,33 @@ you use the `doc:` prefix specified in the index definition.
182
182
183
183
Use the ` encode() ` method of the ` sentenceTokenizer ` object
184
184
as shown below to create the embedding that represents the ` content ` field.
185
- The ` getIds() ` method that follows the ` encode() ` call obtains the vector
185
+ The ` getIds() ` method that follows ` encode() ` obtains the vector
186
186
of ` long ` values which we then convert to a ` float ` array stored as a ` byte `
187
- string. Use the ` byte ` string representation when you are indexing hash
188
- objects (as we are here), but use the default list of ` float ` for JSON objects.
187
+ string using our helper method. Use the ` byte ` string representation when you are
188
+ indexing hash objects (as we are here), but use the default list of ` float ` for
189
+ JSON objects. Note that when we set the ` embedding ` field, we must use an overload
190
+ of ` hset() ` that requires ` byte ` arrays for each of the key, the field name, and
191
+ the value, which is why we include the ` getBytes() ` calls on the strings.
189
192
190
193
``` java
191
194
String sentence1 = " That is a very happy person" ;
192
- jedis. hset(" doc:1" , Map . of( " content" , sentence1, " genre" , " persons" ));
195
+ jedis. hset(" doc:1" , Map . of(" content" , sentence1, " genre" , " persons" ));
193
196
jedis. hset(
194
197
" doc:1" . getBytes(),
195
198
" embedding" . getBytes(),
196
199
longsToFloatsByteString(sentenceTokenizer. encode(sentence1). getIds())
197
200
);
198
201
199
202
String sentence2 = " That is a happy dog" ;
200
- jedis. hset(" doc:2" , Map . of( " content" , sentence2, " genre" , " pets" ));
203
+ jedis. hset(" doc:2" , Map . of(" content" , sentence2, " genre" , " pets" ));
201
204
jedis. hset(
202
205
" doc:2" . getBytes(),
203
206
" embedding" . getBytes(),
204
207
longsToFloatsByteString(sentenceTokenizer. encode(sentence2). getIds())
205
208
);
206
209
207
210
String sentence3 = " Today is a sunny day" ;
208
- Map<String , String > doc3 = Map . of( " content" , sentence3, " genre" , " weather" );
209
- jedis. hset(" doc:3" , doc3);
211
+ jedis. hset(" doc:3" , Map . of(" content" , sentence3, " genre" , " weather" ));
210
212
jedis. hset(
211
213
" doc:3" . getBytes(),
212
214
" embedding" . getBytes(),
@@ -218,53 +220,65 @@ jedis.hset(
218
220
219
221
After you have created the index and added the data, you are ready to run a query.
220
222
To do this, you must create another embedding vector from your chosen query
221
- text. Redis calculates the similarity between the query vector and each
222
- embedding vector in the index as it runs the query. It then ranks the
223
- results in order of this numeric similarity value .
223
+ text. Redis calculates the vector distance between the query vector and each
224
+ embedding vector in the index as it runs the query. We can request the results to be
225
+ sorted to rank them in order of ascending distance .
224
226
225
227
The code below creates the query embedding using the ` encode() ` method, as with
226
228
the indexing, and passes it as a parameter when the query executes (see
227
229
[ Vector search] ({{< relref "/develop/interact/search-and-query/query/vector-search" >}})
228
230
for more information about using query parameters with embeddings).
231
+ The query is a
232
+ [ K nearest neighbors (KNN)] ({{< relref "/develop/interact/search-and-query/advanced-concepts/vectors#knn-vector-search" >}})
233
+ search that sorts the results in order of vector distance from the query vector.
229
234
230
235
``` java
231
236
String sentence = " That is a happy person" ;
232
237
233
238
int K = 3 ;
234
- Query q = new Query (" *=>[KNN $K @embedding $BLOB AS score]" ).
235
- returnFields(" content" , " score" ).
236
- addParam(" K" , K ).
237
- addParam(
238
- " BLOB" ,
239
- longsToFloatsByteString(
240
- sentenceTokenizer. encode(sentence). getIds()
241
- )
242
- ).
243
- dialect(2 );
239
+ Query q = new Query (" *=>[KNN $K @embedding $BLOB AS distance]" )
240
+ .returnFields(" content" , " distance" )
241
+ .addParam(" K" , K )
242
+ .addParam(
243
+ " BLOB" ,
244
+ longsToFloatsByteString(
245
+ sentenceTokenizer. encode(sentence). . getIds()
246
+ )
247
+ )
248
+ .setSortBy(" distance" , true )
249
+ .dialect(2 );
244
250
245
251
List<Document > docs = jedis. ftSearch(" vector_idx" , q). getDocuments();
246
252
247
253
for (Document doc: docs) {
248
- System . out. println(doc);
254
+ System . out. println(
255
+ String . format(
256
+ " ID: %s, Distance: %s, Content: %s" ,
257
+ doc. getId(),
258
+ doc. get(" distance" ),
259
+ doc. get(" content" )
260
+ )
261
+ );
249
262
}
250
263
```
251
264
252
- The code is now ready to run, but note that it may take a while to complete when
265
+ Assuming you have added the code from the steps above to your source file,
266
+ it is now ready to run, but note that it may take a while to complete when
253
267
you run it for the first time (which happens because the tokenizer must download the
254
268
` all-mpnet-base-v2 ` model data before it can
255
- generate the embeddings). When you run the code, it outputs the following result
256
- objects:
269
+ generate the embeddings). When you run the code, it outputs the following result text:
257
270
258
271
```
259
- id:doc:1, score: 1.0, properties:[score=9301635, content=That is a very happy person]
260
- id:doc:2, score: 1.0, properties:[score=1411344, content=That is a happy dog]
261
- id:doc:3, score: 1.0, properties:[score=67178800, content=Today is a sunny day]
272
+ Results:
273
+ ID: doc:2, Distance: 1411344, Content: That is a happy dog
274
+ ID: doc:1, Distance: 9301635, Content: That is a very happy person
275
+ ID: doc:3, Distance: 67178800, Content: Today is a sunny day
262
276
```
263
277
264
- Note that the results are ordered according to the value of the ` vector_distance `
278
+ Note that the results are ordered according to the value of the ` distance `
265
279
field, with the lowest distance indicating the greatest similarity to the query.
266
- As you would expect , the result for ` doc:0 ` with the content text * "That is a very happy person "*
267
- is the result that is most similar in meaning to the query text
280
+ For this model , the text * "That is a happy dog "*
281
+ is the result judged to be most similar in meaning to the query text
268
282
* "That is a happy person"* .
269
283
270
284
## Learn more
0 commit comments