@@ -31,6 +31,9 @@ of their meaning.
31
31
In the example below, we use the [ HuggingFace] ( https://huggingface.co/ ) model
32
32
[ ` all-mpnet-base-v2 ` ] ( https://huggingface.co/sentence-transformers/all-mpnet-base-v2 )
33
33
to generate the vector embeddings to store and index with Redis Query Engine.
34
+ The code is first demonstrated for hash documents with a
35
+ separate section to explain the
36
+ [ differences with JSON documents] ( #differences-with-json-documents ) .
34
37
35
38
## Initialize
36
39
@@ -75,6 +78,7 @@ import java.nio.ByteBuffer;
75
78
import java.nio.ByteOrder ;
76
79
import java.util.Map ;
77
80
import java.util.List ;
81
+ import org.json.JSONObject ;
78
82
79
83
// Tokenizer to generate the vector embeddings.
80
84
import ai.djl.huggingface.tokenizers.HuggingFaceTokenizer ;
@@ -185,8 +189,9 @@ as shown below to create the embedding that represents the `content` field.
185
189
The ` getIds() ` method that follows ` encode() ` obtains the vector
186
190
of ` long ` values which we then convert to a ` float ` array stored as a ` byte `
187
191
string using our helper method. Use the ` byte ` string representation when you are
188
- indexing hash objects (as we are here), but use the default list of ` float ` for
189
- JSON objects. Note that when we set the ` embedding ` field, we must use an overload
192
+ indexing hash objects (as we are here), but use an array of ` float ` for
193
+ JSON objects (see [ Differences with JSON objects] ( #differences-with-json-documents )
194
+ below). Note that when we set the ` embedding ` field, we must use an overload
190
195
of ` hset() ` that requires ` byte ` arrays for each of the key, the field name, and
191
196
the value, which is why we include the ` getBytes() ` calls on the strings.
192
197
@@ -281,6 +286,147 @@ For this model, the text *"That is a happy dog"*
281
286
is the result judged to be most similar in meaning to the query text
282
287
* "That is a happy person"* .
283
288
289
+ ## Differences with JSON documents
290
+
291
+ Indexing JSON documents is similar to hash indexing, but there are some
292
+ important differences. JSON allows much richer data modeling with nested fields, so
293
+ you must supply a [ path] ({{< relref "/develop/data-types/json/path" >}}) in the schema
294
+ to identify each field you want to index. However, you can declare a short alias for each
295
+ of these paths (using the ` as() ` option) to avoid typing it in full for
296
+ every query. Also, you must specify ` IndexDataType.JSON ` when you create the index.
297
+
298
+ The code below shows these differences, but the index is otherwise very similar to
299
+ the one created previously for hashes:
300
+
301
+ ``` java
302
+ SchemaField [] jsonSchema = {
303
+ TextField . of(" $.content" ). as(" content" ),
304
+ TagField . of(" $.genre" ). as(" genre" ),
305
+ VectorField . builder()
306
+ .fieldName(" $.embedding" ). as(" embedding" )
307
+ .algorithm(VectorAlgorithm . HNSW )
308
+ .attributes(
309
+ Map . of(
310
+ " TYPE" , " FLOAT32" ,
311
+ " DIM" , 768 ,
312
+ " DISTANCE_METRIC" , " L2"
313
+ )
314
+ )
315
+ .build()
316
+ };
317
+
318
+ jedis. ftCreate(" vector_json_idx" ,
319
+ FTCreateParams . createParams()
320
+ .addPrefix(" jdoc:" )
321
+ .on(IndexDataType . JSON ),
322
+ jsonSchema
323
+ );
324
+ ```
325
+
326
+ An important difference with JSON indexing is that the vectors are
327
+ specified using arrays of ` float ` instead of binary strings. This requires
328
+ a modified version of the ` longsToFloatsByteString() ` method
329
+ used previously:
330
+
331
+ ``` java
332
+ public static float [] longArrayToFloatArray(long [] input) {
333
+ float [] floats = new float [input. length];
334
+ for (int i = 0 ; i < input. length; i++ ) {
335
+ floats[i] = input[i];
336
+ }
337
+ return floats;
338
+ }
339
+ ```
340
+
341
+ Use [ ` jsonSet() ` ] ({{< relref "/commands/json.set" >}}) to add the data
342
+ instead of [ ` hset() ` ] ({{< relref "/commands/hset" >}}). Use instances
343
+ of ` JSONObject ` to supply the data instead of ` Map ` , as you would for
344
+ hash objects.
345
+
346
+ ``` java
347
+ String jSentence1 = " That is a very happy person" ;
348
+
349
+ JSONObject jdoc1 = new JSONObject ()
350
+ .put(" content" , jSentence1)
351
+ .put(" genre" , " persons" )
352
+ .put(
353
+ " embedding" ,
354
+ longArrayToFloatArray(
355
+ sentenceTokenizer. encode(jSentence1). getIds()
356
+ )
357
+ );
358
+
359
+ jedis. jsonSet(" jdoc:1" , Path2 . ROOT_PATH , jdoc1);
360
+
361
+ String jSentence2 = " That is a happy dog" ;
362
+
363
+ JSONObject jdoc2 = new JSONObject ()
364
+ .put(" content" , jSentence2)
365
+ .put(" genre" , " pets" )
366
+ .put(
367
+ " embedding" ,
368
+ longArrayToFloatArray(
369
+ sentenceTokenizer. encode(jSentence2). getIds()
370
+ )
371
+ );
372
+
373
+ jedis. jsonSet(" jdoc:2" , Path2 . ROOT_PATH , jdoc2);
374
+
375
+ String jSentence3 = " Today is a sunny day" ;
376
+
377
+ JSONObject jdoc3 = new JSONObject ()
378
+ .put(" content" , jSentence3)
379
+ .put(" genre" , " weather" )
380
+ .put(
381
+ " embedding" ,
382
+ longArrayToFloatArray(
383
+ sentenceTokenizer. encode(jSentence3). getIds()
384
+ )
385
+ );
386
+
387
+ jedis. jsonSet(" jdoc:3" , Path2 . ROOT_PATH , jdoc3);
388
+ ```
389
+
390
+ The query is almost identical to the one for the hash documents. This
391
+ demonstrates how the right choice of aliases for the JSON paths can
392
+ save you having to write complex queries. An important thing to notice
393
+ is that the vector parameter for the query is still specified as a
394
+ binary string (using the ` longsToFloatsByteString() ` method), even though
395
+ the data for the ` embedding ` field of the JSON was specified as an array.
396
+
397
+ ``` java
398
+ String jSentence = " That is a happy person" ;
399
+
400
+ int jK = 3 ;
401
+ Query jq = new Query (" *=>[KNN $K @embedding $BLOB AS distance]" ).
402
+ returnFields(" content" , " distance" ).
403
+ addParam(" K" , jK).
404
+ addParam(
405
+ " BLOB" ,
406
+ longsToFloatsByteString(
407
+ sentenceTokenizer. encode(jSentence). getIds()
408
+ )
409
+ )
410
+ .setSortBy(" distance" , true )
411
+ .dialect(2 );
412
+
413
+ // Execute the query
414
+ List<Document > jDocs = jedis
415
+ .ftSearch(" vector_json_idx" , jq)
416
+ .getDocuments();
417
+
418
+ ```
419
+
420
+ Apart from the ` jdoc: ` prefixes for the keys, the result from the JSON
421
+ query is the same as for hash:
422
+
423
+ ```
424
+ Results:
425
+ ID: jdoc:2, Distance: 1411344, Content: That is a happy dog
426
+ ID: jdoc:1, Distance: 9301635, Content: That is a very happy person
427
+ ID: jdoc:3, Distance: 67178800, Content: Today is a sunny day
428
+ ```
429
+
284
430
## Learn more
285
431
286
432
See
0 commit comments