You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So far, there are hash functions for the following measures of similarity:
16
+
## What's LSH?
17
+
Traditionally, if you have a data point `x`, and want to find the most similar point(s) to `x` in your database, you would compute the similarity between `x` and all of the points in your database, and keep whichever points were the most similar. For instance, this type of approach is used by the classic [k-nearest neighbors algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm). However, it has two major problems:
18
+
19
+
- The time to find the most similar point(s) to `x` is linear in the number of points in your database. This can make similarity search prohibitively expensive for even moderately large datasets.
20
+
- In addition, the time complexity to compute the similarity between two datapoints is typically linear in the number of dimensions of those datapoints. If your data are high-dimensional (i.e. in the thousands to millions of dimensions), every similarity computation you perform can be fairly costly.
21
+
22
+
**Locality-sensitive hashing** (LSH) is a technique for accelerating these kinds of similarity searches. Instead of measuring how similar your query point is to every point in your database, you calculate a few hashes of the query point and only compare it against those points with which it experiences a hash collision. Locality-sensitive hash functions are randomly generated, with the fundamental property that as the similarity between `x` and `y` increases, the probability of a hash collision between `x` and `y` also increases.
23
+
24
+
## Installation
25
+
You can install LSH.jl from the Julia REPL with
26
+
27
+
```
28
+
pkg> add https://github.com/kernelmethod/LSH.jl
29
+
```
30
+
31
+
## Supported similarity functions
32
+
So far, there are hash functions for the similarity functions:
14
33
15
34
- Cosine similarity (`SimHash`)
16
35
- Jaccard similarity (`MinHash`)
@@ -19,12 +38,41 @@ So far, there are hash functions for the following measures of similarity:
19
38
- Inner product
20
39
-`SignALSH` (recommended)
21
40
-`MIPSHash`
22
-
- Function-space hashes
23
-
-`MonteCarloHash` (supports L1, L2, and cosine similarity)
24
-
-`ChebHash` (supports L1, L2, and cosine similarity)
41
+
- Function-space hashes (supports L1, L2, and cosine similarity)
42
+
-`MonteCarloHash`
43
+
-`ChebHash`
25
44
26
45
This package still needs a lot of work, including improvement to the documentation and API. In general, if you want to draw one or more new hash functions, you can use the following syntax:
27
46
47
+
## Examples
48
+
The easiest way to start constructing new hash functions is by calling `LSHFunction` with the following syntax:
49
+
50
+
```
51
+
hashfn = LSHFunction(similarity function,
52
+
number of hash functions to generate;
53
+
[LSH family-specific keyword arguments])
54
+
```
55
+
56
+
For example, the following snippet generates 10 locality-sensitive hash functions (bundled together into a single `SimHash` struct) for cosine similarity:
57
+
58
+
```julia
59
+
julia>using LSH;
60
+
61
+
julia> hashfn =LSHFunction(cossim, 10);
62
+
63
+
julia>n_hashes(hashfn)
64
+
10
65
+
66
+
julia>similarity(hashfn)
67
+
cossim
68
+
```
69
+
70
+
You can then start hashing new vectors by calling `hashfn()`:
0 commit comments