@@ -84,69 +84,78 @@ import java.util.List;
84
84
import ai.djl.huggingface.tokenizers.HuggingFaceTokenizer ;
85
85
```
86
86
87
+ ## Create a tokenizer instance
87
88
88
- The first of these imports is the
89
- ` SentenceTransformer ` class, which generates an embedding from a section of text.
90
- Here, we create an instance of ` SentenceTransformer ` that uses the
91
- [ ` all-MiniLM-L6-v2 ` ] ( https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 )
92
- model for the embeddings. This model generates vectors with 384 dimensions, regardless
93
- of the length of the input text, but note that the input is truncated to 256
94
- tokens (see
95
- [ Word piece tokenization] ( https://huggingface.co/learn/nlp-course/en/chapter6/6 )
96
- at the [ Hugging Face] ( https://huggingface.co/ ) docs to learn more about the way tokens
97
- are related to the original text).
89
+ We will use the
90
+ [ ` all-mpnet-base-v2 ` ] ( https://huggingface.co/sentence-transformers/all-mpnet-base-v2 )
91
+ tokenizer to generate the embeddings. The vectors that represent the
92
+ embeddings have 768 components, regardless of the length of the input
93
+ text.
98
94
99
- ``` python
100
- model = SentenceTransformer(" sentence-transformers/all-MiniLM-L6-v2" )
95
+ ``` java
96
+ HuggingFaceTokenizer sentenceTokenizer = HuggingFaceTokenizer . newInstance(
97
+ " sentence-transformers/all-mpnet-base-v2" ,
98
+ Map . of(" maxLength" , " 768" , " modelMaxLength" , " 768" )
99
+ );
101
100
```
102
101
103
102
## Create the index
104
103
105
104
Connect to Redis and delete any index previously created with the
106
- name ` vector_idx ` . (The ` dropindex ()` call throws an exception if
105
+ name ` vector_idx ` . (The ` ftDropIndex ()` call throws an exception if
107
106
the index doesn't already exist, which is why you need the
108
- ` try: except: ` block.)
107
+ ` try...catch ` block.)
109
108
110
- ``` python
111
- r = redis.Redis( decode_responses = True )
109
+ ``` java
110
+ UnifiedJedis jedis = new UnifiedJedis ( " redis://localhost:6379 " );
112
111
113
- try :
114
- r.ft(" vector_idx" ).dropindex(True )
115
- except redis.exceptions.ResponseError:
116
- pass
112
+ try {jedis. ftDropIndex(" vector_idx" );} catch (JedisDataException j){}
117
113
```
118
114
119
115
Next, we create the index.
120
- The schema in the example below specifies hash objects for storage and includes
121
- three fields: the text content to index, a
116
+ The schema in the example below includes three fields: the text content to index, a
122
117
[ tag] ({{< relref "/develop/interact/search-and-query/advanced-concepts/tags" >}})
123
118
field to represent the "genre" of the text, and the embedding vector generated from
124
119
the original text content. The ` embedding ` field specifies
125
120
[ HNSW] ({{< relref "/develop/interact/search-and-query/advanced-concepts/vectors#hnsw-index" >}})
126
121
indexing, the
127
122
[ L2] ({{< relref "/develop/interact/search-and-query/advanced-concepts/vectors#distance-metrics" >}})
128
123
vector distance metric, ` Float32 ` values to represent the vector's components,
129
- and 384 dimensions, as required by the ` all-MiniLM-L6 -v2 ` embedding model.
124
+ and 768 dimensions, as required by the ` all-mpnet-base -v2 ` embedding model.
130
125
131
- ``` python
132
- schema = (
133
- TextField(" content" ),
134
- TagField(" genre" ),
135
- VectorField(" embedding" , " HNSW" , {
136
- " TYPE" : " FLOAT32" ,
137
- " DIM" : 384 ,
138
- " DISTANCE_METRIC" :" L2"
139
- })
140
- )
126
+ The ` FTCreateParams ` object specifies hash objects for storage and a
127
+ prefix ` doc: ` that identifies the hash objects we want to index.
141
128
142
- r.ft(" vector_idx" ).create_index(
143
- schema,
144
- definition = IndexDefinition(
145
- prefix = [" doc:" ], index_type = IndexType.HASH
146
- )
147
- )
129
+ ``` java
130
+ SchemaField [] schema = {
131
+ TextField . of(" content" ),
132
+ TagField . of(" genre" ),
133
+ VectorField . builder()
134
+ .fieldName(" embedding" )
135
+ .algorithm(VectorAlgorithm . HNSW )
136
+ .attributes(
137
+ Map . of(
138
+ " TYPE" , " FLOAT32" ,
139
+ " DIM" , 768 ,
140
+ " DISTANCE_METRIC" , " L2" ,
141
+ " INITIAL_CAP" , 3
142
+ )
143
+ )
144
+ .build()
145
+ };
146
+
147
+ jedis. ftCreate(" vector_idx" ,
148
+ FTCreateParams . createParams()
149
+ .addPrefix(" doc:" )
150
+ .on(IndexDataType . HASH ),
151
+ schema
152
+ );
148
153
```
149
154
155
+ ## Define some helper methods
156
+
157
+
158
+
150
159
## Add data
151
160
152
161
You can now supply the data objects, which will be indexed automatically
@@ -162,30 +171,24 @@ default Python list of `float` values.
162
171
Use the binary string representation when you are indexing hash objects
163
172
(as we are here), but use the default list of ` float ` for JSON objects.
164
173
165
- ``` python
166
- content = " That is a very happy person"
167
-
168
- r.hset(" doc:0" , mapping = {
169
- " content" : content,
170
- " genre" : " persons" ,
171
- " embedding" : model.encode(content).astype(np.float32).tobytes(),
172
- })
173
-
174
- content = " That is a happy dog"
175
-
176
- r.hset(" doc:1" , mapping = {
177
- " content" : content,
178
- " genre" : " pets" ,
179
- " embedding" : model.encode(content).astype(np.float32).tobytes(),
180
- })
181
-
182
- content = " Today is a sunny day"
174
+ ``` java
175
+ String sentence1 = " That is a very happy person" ;
176
+ jedis. hset(" doc:1" , Map . of( " content" , sentence1, " genre" , " persons" ));
177
+ jedis. hset(
178
+ " doc:1" . getBytes(),
179
+ " embedding" . getBytes(),
180
+ longArrayToByteArray(sentenceTokenizer. encode(sentence1). getIds())
181
+ );
182
+
183
+ String sentence2 = " That is a happy dog" ;
184
+ jedis. hset(" doc:2" , Map . of( " content" , sentence2, " genre" , " pets" ));
185
+ jedis. hset(" doc:2" . getBytes(), " embedding" . getBytes(), longArrayToByteArray(sentenceTokenizer. encode(sentence2). getIds()));
186
+
187
+ String sentence3 = " Today is a sunny day" ;
188
+ Map<String , String > doc3 = Map . of( " content" , sentence3, " genre" , " weather" );
189
+ jedis. hset(" doc:3" , doc3);
190
+ jedis. hset(" doc:3" . getBytes(), " embedding" . getBytes(), longArrayToByteArray(sentenceTokenizer. encode(sentence3). getIds()));
183
191
184
- r.hset(" doc:2" , mapping = {
185
- " content" : content,
186
- " genre" : " weather" ,
187
- " embedding" : model.encode(content).astype(np.float32).tobytes(),
188
- })
189
192
```
190
193
191
194
## Run a query
0 commit comments