Skip to content

Commit 48c40d4

Browse files
DOC-4837 added index creation code
1 parent 48e6af3 commit 48c40d4

File tree

1 file changed

+65
-62
lines changed

1 file changed

+65
-62
lines changed

content/develop/clients/jedis/vecsearch.md

Lines changed: 65 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -84,69 +84,78 @@ import java.util.List;
8484
import ai.djl.huggingface.tokenizers.HuggingFaceTokenizer;
8585
```
8686

87+
## Create a tokenizer instance
8788

88-
The first of these imports is the
89-
`SentenceTransformer` class, which generates an embedding from a section of text.
90-
Here, we create an instance of `SentenceTransformer` that uses the
91-
[`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
92-
model for the embeddings. This model generates vectors with 384 dimensions, regardless
93-
of the length of the input text, but note that the input is truncated to 256
94-
tokens (see
95-
[Word piece tokenization](https://huggingface.co/learn/nlp-course/en/chapter6/6)
96-
at the [Hugging Face](https://huggingface.co/) docs to learn more about the way tokens
97-
are related to the original text).
89+
We will use the
90+
[`all-mpnet-base-v2`](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
91+
tokenizer to generate the embeddings. The vectors that represent the
92+
embeddings have 768 components, regardless of the length of the input
93+
text.
9894

99-
```python
100-
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
95+
```java
96+
HuggingFaceTokenizer sentenceTokenizer = HuggingFaceTokenizer.newInstance(
97+
"sentence-transformers/all-mpnet-base-v2",
98+
Map.of("maxLength", "768", "modelMaxLength", "768")
99+
);
101100
```
102101

103102
## Create the index
104103

105104
Connect to Redis and delete any index previously created with the
106-
name `vector_idx`. (The `dropindex()` call throws an exception if
105+
name `vector_idx`. (The `ftDropIndex()` call throws an exception if
107106
the index doesn't already exist, which is why you need the
108-
`try: except:` block.)
107+
`try...catch` block.)
109108

110-
```python
111-
r = redis.Redis(decode_responses=True)
109+
```java
110+
UnifiedJedis jedis = new UnifiedJedis("redis://localhost:6379");
112111

113-
try:
114-
r.ft("vector_idx").dropindex(True)
115-
except redis.exceptions.ResponseError:
116-
pass
112+
try {jedis.ftDropIndex("vector_idx");} catch (JedisDataException j){}
117113
```
118114

119115
Next, we create the index.
120-
The schema in the example below specifies hash objects for storage and includes
121-
three fields: the text content to index, a
116+
The schema in the example below includes three fields: the text content to index, a
122117
[tag]({{< relref "/develop/interact/search-and-query/advanced-concepts/tags" >}})
123118
field to represent the "genre" of the text, and the embedding vector generated from
124119
the original text content. The `embedding` field specifies
125120
[HNSW]({{< relref "/develop/interact/search-and-query/advanced-concepts/vectors#hnsw-index" >}})
126121
indexing, the
127122
[L2]({{< relref "/develop/interact/search-and-query/advanced-concepts/vectors#distance-metrics" >}})
128123
vector distance metric, `Float32` values to represent the vector's components,
129-
and 384 dimensions, as required by the `all-MiniLM-L6-v2` embedding model.
124+
and 768 dimensions, as required by the `all-mpnet-base-v2` embedding model.
130125

131-
```python
132-
schema = (
133-
TextField("content"),
134-
TagField("genre"),
135-
VectorField("embedding", "HNSW", {
136-
"TYPE": "FLOAT32",
137-
"DIM": 384,
138-
"DISTANCE_METRIC":"L2"
139-
})
140-
)
126+
The `FTCreateParams` object specifies hash objects for storage and a
127+
prefix `doc:` that identifies the hash objects we want to index.
141128

142-
r.ft("vector_idx").create_index(
143-
schema,
144-
definition=IndexDefinition(
145-
prefix=["doc:"], index_type=IndexType.HASH
146-
)
147-
)
129+
```java
130+
SchemaField[] schema = {
131+
TextField.of("content"),
132+
TagField.of("genre"),
133+
VectorField.builder()
134+
.fieldName("embedding")
135+
.algorithm(VectorAlgorithm.HNSW)
136+
.attributes(
137+
Map.of(
138+
"TYPE", "FLOAT32",
139+
"DIM", 768,
140+
"DISTANCE_METRIC", "L2",
141+
"INITIAL_CAP", 3
142+
)
143+
)
144+
.build()
145+
};
146+
147+
jedis.ftCreate("vector_idx",
148+
FTCreateParams.createParams()
149+
.addPrefix("doc:")
150+
.on(IndexDataType.HASH),
151+
schema
152+
);
148153
```
149154

155+
## Define some helper methods
156+
157+
158+
150159
## Add data
151160

152161
You can now supply the data objects, which will be indexed automatically
@@ -162,30 +171,24 @@ default Python list of `float` values.
162171
Use the binary string representation when you are indexing hash objects
163172
(as we are here), but use the default list of `float` for JSON objects.
164173

165-
```python
166-
content = "That is a very happy person"
167-
168-
r.hset("doc:0", mapping={
169-
"content": content,
170-
"genre": "persons",
171-
"embedding": model.encode(content).astype(np.float32).tobytes(),
172-
})
173-
174-
content = "That is a happy dog"
175-
176-
r.hset("doc:1", mapping={
177-
"content": content,
178-
"genre": "pets",
179-
"embedding": model.encode(content).astype(np.float32).tobytes(),
180-
})
181-
182-
content = "Today is a sunny day"
174+
```java
175+
String sentence1 = "That is a very happy person";
176+
jedis.hset("doc:1", Map.of( "content", sentence1, "genre", "persons"));
177+
jedis.hset(
178+
"doc:1".getBytes(),
179+
"embedding".getBytes(),
180+
longArrayToByteArray(sentenceTokenizer.encode(sentence1).getIds())
181+
);
182+
183+
String sentence2 = "That is a happy dog";
184+
jedis.hset("doc:2", Map.of( "content", sentence2, "genre", "pets"));
185+
jedis.hset("doc:2".getBytes(), "embedding".getBytes(), longArrayToByteArray(sentenceTokenizer.encode(sentence2).getIds()));
186+
187+
String sentence3 = "Today is a sunny day";
188+
Map<String, String> doc3 = Map.of( "content", sentence3, "genre", "weather");
189+
jedis.hset("doc:3", doc3);
190+
jedis.hset("doc:3".getBytes(), "embedding".getBytes(), longArrayToByteArray(sentenceTokenizer.encode(sentence3).getIds()));
183191

184-
r.hset("doc:2", mapping={
185-
"content": content,
186-
"genre": "weather",
187-
"embedding": model.encode(content).astype(np.float32).tobytes(),
188-
})
189192
```
190193

191194
## Run a query

0 commit comments

Comments
 (0)