You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traditionally, if you have a data point `x`, and want to find the most similar point(s) to `x` in your database, you would compute the similarity between `x` and all of the points in your database, and keep whichever points were the most similar. For instance, this type of approach is used by the classic [k-nearest neighbors algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm). However, it has two major problems:
18
-
19
-
- The time to find the most similar point(s) to `x` is linear in the number of points in your database. This can make similarity search prohibitively expensive for even moderately large datasets.
20
-
- In addition, the time complexity to compute the similarity between two datapoints is typically linear in the number of dimensions of those datapoints. If your data are high-dimensional (i.e. in the thousands to millions of dimensions), every similarity computation you perform can be fairly costly.
21
-
22
-
**Locality-sensitive hashing** (LSH) is a technique for accelerating these kinds of similarity searches. Instead of measuring how similar your query point is to every point in your database, you calculate a few hashes of the query point and only compare it against those points with which it experiences a hash collision. Locality-sensitive hash functions are randomly generated, with the fundamental property that as the similarity between `x` and `y` increases, the probability of a hash collision between `x` and `y` also increases.
18
+
Traditionally, if you have a data point `x`, and want to find the most similar
19
+
point(s) to `x` in your database, you would compute the similarity between `x`
20
+
and all of the points in your database, and keep whichever points were the most
21
+
similar. For instance, this type of approach is used by the classic [k-nearest
- The time to find the most similar point(s) to `x` is linear in the number of
26
+
points in your database. This can make similarity search prohibitively
27
+
expensive for even moderately large datasets.
28
+
- In addition, the time complexity to compute the similarity between two
29
+
datapoints is typically linear in the number of dimensions of those
30
+
datapoints. If your data are high-dimensional (i.e. in the thousands to
31
+
millions of dimensions), every similarity computation you perform can be
32
+
fairly costly.
33
+
34
+
**Locality-sensitive hashing** (LSH) is a technique for accelerating these kinds
35
+
of similarity searches. Instead of measuring how similar your query point is to
36
+
every point in your database, you calculate a few hashes of the query point and
37
+
only compare it against those points with which it experiences a hash collision.
38
+
Locality-sensitive hash functions are randomly generated, with the fundamental
39
+
property that as the similarity between `x` and `y` increases, the probability
40
+
of a hash collision between `x` and `y` also increases.
23
41
24
42
25
43
## Installation
@@ -43,18 +61,21 @@ So far, there are hash functions for the similarity functions:
43
61
-`MonteCarloHash`
44
62
-`ChebHash`
45
63
46
-
This package still needs a lot of work, including improvement to the documentation and API.
64
+
This package still needs a lot of work, including improvement to the
65
+
documentation and API.
47
66
48
67
## Examples
49
-
The easiest way to start constructing new hash functions is by calling `LSHFunction` with the following syntax:
68
+
The easiest way to start constructing new hash functions is by calling
69
+
`LSHFunction` with the following syntax:
50
70
51
71
```
52
72
hashfn = LSHFunction(similarity function,
53
73
number of hash functions to generate;
54
74
[LSH family-specific keyword arguments])
55
75
```
56
76
57
-
For example, the following snippet generates 10 locality-sensitive hash functions (bundled together into a single `SimHash` struct) for cosine similarity:
77
+
For example, the following snippet generates 10 locality-sensitive hash
78
+
functions (bundled together into a single `SimHash` ) for cosine similarity:
58
79
59
80
```julia
60
81
julia>using LSHFunctions;
@@ -76,4 +97,5 @@ julia> x = randn(128);
76
97
julia> x_hashes =hashfn(x);
77
98
```
78
99
79
-
For more details, [check out the LSHFunctions.jl documentation](https://kernelmethod.github.io/LSHFunctions.jl/dev/).
0 commit comments