@@ -56,7 +56,7 @@ import numpy as np
56
56
57
57
The first of these imports is the
58
58
` SentenceTransformer ` class, which generates an embedding from a section of text.
59
- Here, we create an instance of ` SentenceTransformer ` that uses the
59
+ This example uses an instance of ` SentenceTransformer ` with the
60
60
[ ` all-MiniLM-L6-v2 ` ] ( https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 )
61
61
model for the embeddings. This model generates vectors with 384 dimensions, regardless
62
62
of the length of the input text, but note that the input is truncated to 256
@@ -71,8 +71,8 @@ model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
71
71
72
72
## Create the data
73
73
74
- For the example, we will use a dictionary of data that contains brief
75
- descriptions of some famous people:
74
+ The example data is contained a dictionary with some brief
75
+ descriptions of famous people:
76
76
77
77
``` python
78
78
peopleData = {
@@ -146,11 +146,11 @@ The code below uses the dictionary's
146
146
view to iterate through all the key-value pairs and add corresponding
147
147
elements to a vector set called ` famousPeople ` .
148
148
149
- We use the
149
+ Use the
150
150
[ ` encode() ` ] ( https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer.encode )
151
151
method of ` SentenceTransformer ` to generate the
152
152
embedding as an array of ` float32 ` values. The ` tobytes() ` method converts
153
- the array to a byte string that we pass to the
153
+ the array to a byte string that you can pass to the
154
154
[ ` vadd() ` ] ({{< relref "/commands/vadd" >}}) command to set the embedding.
155
155
Note that ` vadd() ` can also accept a list of ` float ` values to set the
156
156
vector, but the byte string format is more compact and saves a little
@@ -183,9 +183,9 @@ for name, details in peopleData.items():
183
183
184
184
## Query the vector set
185
185
186
- We can now query the data in the set. The basic approach is to use the
186
+ You can now query the data in the set. The basic approach is to use the
187
187
` encode() ` method to generate another embedding vector for the query text.
188
- (This is the same method we used when we added the elements to the set.) Then, we pass
188
+ (This is the same method used to add the elements to the set.) Then, pass
189
189
the query vector to [ ` vsim() ` ] ({{< relref "/commands/vsim" >}}) to return elements
190
190
of the set, ranked in order of similarity to the query.
191
191
@@ -211,8 +211,8 @@ This returns the following list of elements (formatted slightly for clarity):
211
211
```
212
212
213
213
The first two people in the list are the two actors, as expected, but none of the
214
- people from Linus Pauling onward was especially well-known for acting (and we certainly
215
- didn't include any information about that in the short description text).
214
+ people from Linus Pauling onward was especially well-known for acting (and there certainly
215
+ isn't any information about that in the short description text).
216
216
As it stands, the search attempts to rank all the elements in the set, based
217
217
on the information contained in the embedding model.
218
218
You can use the ` count ` parameter of ` vsim() ` to limit the list of elements
@@ -234,10 +234,9 @@ print(f"'actors (2)': {two_actors_results}")
234
234
The reason for using text embeddings rather than simple text search
235
235
is that the embeddings represent semantic information. This allows a query
236
236
to find elements with a similar meaning even if the text is
237
- different. For example, we
238
- don't use the word "entertainer" in any of the descriptions but
239
- if we use it as a query, the actors and musicians are ranked highest
240
- in the results list:
237
+ different. For example, the word "entertainer" doesn't appear in any of the
238
+ descriptions but if you use it as a query, the actors and musicians are ranked
239
+ highest in the results list:
241
240
242
241
``` py
243
242
query_value = " entertainer"
@@ -253,7 +252,7 @@ print(f"'entertainer': {entertainer_results}")
253
252
# 'Paul Erdos', 'Maryam Mirzakhani', 'Marie Curie']
254
253
```
255
254
256
- Similarly, if we use "science" as a query, we get the following results:
255
+ Similarly, if you use "science" as a query, you get the following results:
257
256
258
257
```
259
258
'science': ['Marie Curie', 'Linus Pauling', 'Maryam Mirzakhani',
0 commit comments