Skip to content

Commit 8206f71

Browse files
committed
Squashed commit of the following:
commit 4cc0ab4152a16d915962d02f4ec1f40844d9fc14 Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Sat Jan 18 01:35:23 2020 -0700 Add a couple more links to the API reference on the cosine similarity page. commit c2e2cb6 Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Sat Jan 18 01:34:06 2020 -0700 Fix some bad formatting on the cosine similarity page. commit e6465ab Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Sat Jan 18 01:27:50 2020 -0700 Add an FAQ to the docs. commit 548b0ca Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 22:37:14 2020 -0700 Add some basic content to the page for LSHFunction(). commit 9606638 Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 22:14:37 2020 -0700 Add warnings blocks to the docstrings for MIPSHash and SignALSH. commit 1fa1535 Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 21:58:32 2020 -0700 Fix footnotes on the page for cosine similarity. commit b6e79e3 Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 21:35:44 2020 -0700 Add a README for the docs/ directory. commit 4cd0327 Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 21:29:36 2020 -0700 Add content to the page for hashing on cosine similarity. commit 122a9f3 Merge: ab0b60f 999d73b Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 20:41:07 2020 -0700 Merge branch 'master' into docs
1 parent 999d73b commit 8206f71

File tree

7 files changed

+338
-3
lines changed

7 files changed

+338
-3
lines changed

docs/README.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# LSH.jl documentation
2+
Documentation for the LSH.jl package.
3+
4+
The module documentation is automatically built and updated whenever `master` is updated. To develop the documentation locally, run
5+
6+
```
7+
$ cd docs/
8+
$ julia make.jl
9+
$ julia --project=. --color=yes make.jl
10+
$ python3 -m http.server 8000
11+
```
12+
13+
Then go to [http://localhost:8000/build/](http://localhost:8000/build/) in your browser.
14+
15+
- [Stable docs](https://kernelmethod.github.io/LSH.jl/stable/)
16+
- [Developer docs](https://kernelmethod.github.io/LSH.jl/dev/)

docs/make.jl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ makedocs(
2020
"Inner product similarity" => joinpath("similarities", "inner_prod.md")],
2121
"Performance tips" => "performance.md",
2222
"API reference" => "full_api.md",
23+
"FAQ" => "faq.md"
2324
]
2425
)
2526

docs/src/faq.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# FAQ
2+
3+
## Why do we compute multiple hashes for every input?
4+
In a traditional hash table data structure, you have a single hash function (e.g. MurmurHash) that you use to convert inputs into hashes, which you can then use to index into the table. With LSH, you randomly generate multiple hash functions from a single LSH family. To index into the hash table, you apply each of those hash functions to your input and concatenate your computed hashes together. The concatenated hashes form the key you use to index into the hash table.
5+
6+
The reason for computing multiple hashes is that every LSH function provides (at most) only a few bits of additional information with which to partition the input space. For example, [`SimHash`](@ref) is a single-bit hash: that is, if you create `hashfn = SimHash()`, then `hashfn(x)` can only return either `BitArray([0])` or `BitArray([1])`. If you're trying to use `hashfn` to speed up similarity search, then the hash you compute will -- *at best* -- reduce the number of points you have to search through by only 50% on average.
7+
8+
In fact, the situation can be much more dire than that. If your data are highly structured, it is likely that each of your hashes will place data points into a tiny handful of buckets -- even just one bucket. For instance, in the snippet below we have a dataset of 100 points that all have very high cosine similarity with one another. If we only create a single hash function when we call [`SimHash`](@ref), then it's very likely that all of the data points will have the same hash.
9+
10+
```jldoctest; setup = :(using LSH, Random; Random.seed!(0))
11+
julia> hashfn = SimHash();
12+
13+
julia> data = ones(10, 100); # Each column is a data point
14+
15+
julia> data[end,1:end] .= rand(100); # Randomize the last dimension of each point
16+
17+
julia> hashes = map(x -> hashfn(x), eachcol(data));
18+
19+
julia> unique(hashes)
20+
1-element Array{BitArray{1},1}:
21+
[0]
22+
```
23+
24+
The solution to this is to generate multiple hash functions, and combine each of the hashes we compute for an input into a single key. In the snippet below, we create 20 hash functions with [`SimHash`](@ref). Each hash computed in `map(x -> hashfn(x), eachcol(data))` is a length-20 `BitArray`.
25+
26+
27+
```jldoctest; setup = :(using LSH, Random; Random.seed!(0))
28+
julia> hashfn = SimHash(20);
29+
30+
julia> data = ones(10,100); # Each column is a data point
31+
32+
julia> data[end,1:end] .= rand(100); # Randomize the last dimension of each point
33+
34+
julia> hashes = map(x -> hashfn(x), eachcol(data));
35+
36+
julia> unique(hashes) |> length
37+
3
38+
39+
julia> for uh in unique(hashes)
40+
println(sum(uh == h for h in hashes))
41+
end
42+
72
43+
16
44+
12
45+
```
46+
47+
Our hash function has generated 3 unique 20-bit hashes, with 72 points sharing the first hash, 16 points sharing the second hash, and 12 points sharing the third hash. That's not a great split, but could still drastically reduce the size of the search space. For instance, the following benchmarks (on an Intel Core i7-8565U @ 1.80 GHz) suggest that the cost of computing [`SimHash`](@ref) on 10-dimensional data is about 34 times the cost of computing [`cossim`](@ref):
48+
49+
```
50+
julia> using BenchmarkTools
51+
52+
julia> @benchmark(hashfn(x), setup=(x=rand(10)))
53+
BenchmarkTools.Trial:
54+
memory estimate: 4.66 KiB
55+
allocs estimate: 6
56+
--------------
57+
minimum time: 612.231 ns (0.00% GC)
58+
median time: 1.563 μs (0.00% GC)
59+
mean time: 1.728 μs (17.60% GC)
60+
maximum time: 24.123 μs (92.03% GC)
61+
--------------
62+
samples: 10000
63+
evals/sample: 169
64+
65+
julia> @benchmark(cossim(x,y), setup=(x=rand(10);y=rand(10)))
66+
BenchmarkTools.Trial:
67+
memory estimate: 0 bytes
68+
allocs estimate: 0
69+
--------------
70+
minimum time: 46.203 ns (0.00% GC)
71+
median time: 46.415 ns (0.00% GC)
72+
mean time: 47.467 ns (0.00% GC)
73+
maximum time: 160.076 ns (0.00% GC)
74+
--------------
75+
samples: 10000
76+
evals/sample: 988
77+
78+
julia> 1.563e-6 / 46.415e-9
79+
33.67445868792416
80+
```
81+
82+
So as long as [`SimHash`](@ref) reduces the size of the search space by 34 data points on average, it's faster than calculating the similarity between every pair of points. Even for our tiny dataset, which only had 100 points, that's still well worth it: with the 72/16/12 split that we got, [`SimHash`](@ref) reduces the number of similarities we have to calculate by ``100 - \left(\frac{72^2}{100} + \frac{16^2}{100} + \frac{12^2}{100}\right) \approx 44`` points on average.
83+
84+
!!! info "Improving LSH partitioning"
85+
LSH can be poor at partitioning your input space when data points are very similar to one another. In these cases, it may be helpful to find ways to transform your data in order to reduce their similarity.
86+
87+
For instance, in the example above, we created a synthetic dataset with the following code:
88+
89+
```julia
90+
julia> data = ones(10,100); # Each column is a data point
91+
92+
julia> data[end,1:end] .= rand(100); # Randomize the last dimension of each point
93+
```
94+
95+
These data are, for all practical purposes, one-dimensional. Their first nine dimensions are all the same; only the last dimension provides any unique information about a given data point. As a result, a dimensionality reduction technique like principal component analysis (PCA) would have helped de-correlate the dimensions of the data and thereby reduced the cosine similarity between pairs of points.
96+
97+
98+

docs/src/lshfunction_api.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,117 @@
22

33
!!! warning "Under construction"
44
This section is currently being developed. If you're interested in helping write this section, feel free to [open a pull request](https://github.com/kernelmethod/LSH.jl/pulls); otherwise, please check back later.
5+
6+
## LSHFunction
7+
The `LSH` module exposes a relatively easy interface for constructing new hash functions. Namely, you call [`LSHFunction`](@ref) with
8+
9+
- the similarity function you want to use;
10+
- the number of hash functions you want to generate; and
11+
- keyword parameters specific to the LSH function family that you're sampling from.
12+
13+
```
14+
LSHFunction(similarity, n_hashes::Integer=1; kws...)
15+
```
16+
17+
For instance, in the snippet below we create a single hash function corresponding to cosine similarity:
18+
19+
```jldoctest
20+
julia> using LSH
21+
22+
julia> hashfn = LSHFunction(cossim);
23+
24+
julia> typeof(hashfn)
25+
SimHash{Float32}
26+
27+
julia> n_hashes(hashfn)
28+
1
29+
30+
julia> similarity(hashfn)
31+
cossim (generic function with 2 methods)
32+
```
33+
34+
As another example, following code snippet creates 10 hash functions for inner product similarity. All of the generated hash functions are bundled together into a single [`SignALSH`](@ref) struct. We specify the following keyword arguments:
35+
36+
- `dtype`: the data type to use internally in the [`SignALSH`](@ref) struct.
37+
- `maxnorm`: an upper bound on the norm of the data points we're hashing, and a required parameter for [`SignALSH`](@ref).
38+
39+
```jldoctest
40+
julia> using LSH
41+
42+
julia> hashfn = LSHFunction(inner_prod, 10; dtype=Float64, maxnorm=5.0);
43+
44+
julia> n_hashes(hashfn)
45+
10
46+
47+
julia> typeof(hashfn)
48+
SignALSH{Float64}
49+
50+
julia> hashfn.maxnorm
51+
5.0
52+
```
53+
54+
!!! info "Creating multiple hash functions"
55+
In practice, you usually want to use multiple hash functions at the same time, and combine their hashes together in order to form a key with which to index into the hash table. To create `N` hash functions simultaneously, run
56+
57+
```julia
58+
hashfn = LSHFunction(similarity, N; kws...)
59+
```
60+
61+
`hashfn` will automatically generate and compute `N` different hash functions. It will then return a `Vector` of those hashes (unless `hashtype(hashfn)` is `Bool`, in which case it will return a `BitArray`).
62+
63+
- [See the FAQ](@ref Why-do-we-compute-multiple-hashes-for-every-input?) for the reasoning behind using multiple locality-sensitive hash functions simultaneously.
64+
65+
If you want to know what hash function will be created for a given similarity, you can use [`lsh_family`](@ref):
66+
67+
```jldoctest; setup = :(using LSH)
68+
julia> lsh_family(jaccard)
69+
MinHash
70+
71+
julia> lsh_family(ℓ1)
72+
L1Hash
73+
```
74+
75+
## Utilities
76+
LSH.jl provides a few common utility functions that you can use across [`LSHFunction`](@ref) subtypes:
77+
78+
- [`n_hashes`](@ref): returns the number of hash functions computed by an [`LSHFunction`](@ref).
79+
80+
```jldoctest; setup = :(using LSH)
81+
julia> hashfn = LSHFunction(jaccard);
82+
83+
julia> n_hashes(hashfn)
84+
1
85+
86+
julia> hashfn = LSHFunction(jaccard, 10);
87+
88+
julia> n_hashes(hashfn)
89+
10
90+
```
91+
92+
- [`similarity`](@ref): returns the similarity function for which the input [`LSHFunction`](@ref) is locality-sensitive:
93+
94+
```jldoctest; setup = :(using LSH)
95+
julia> hashfn = LSHFunction(cossim);
96+
97+
julia> similarity(hashfn)
98+
cossim (generic function with 2 methods)
99+
```
100+
101+
- [`hashtype`](@ref): returns the type of hash computed by the input hash function. Note that in practice `hashfn(x)` (or [`index_hash(hashfn,x)`](@ref) and [`query_hash(hashfn,x)`](@ref) for an [`AsymmetricLSHFunction`](@ref)) will return an array of hashes, one for each hash function you generated. [`hashtype`](@ref) is the data type of each element of `hashfn(x)`.
102+
103+
```jldoctest; setup = :(using LSH)
104+
julia> hashfn = LSHFunction(cossim, 5);
105+
106+
julia> hashtype(hashfn)
107+
Bool
108+
109+
julia> hashes = hashfn(rand(100));
110+
111+
julia> typeof(hashes)
112+
BitArray{1}
113+
114+
julia> typeof(hashes[1]) == hashtype(hashfn)
115+
true
116+
```
117+
118+

docs/src/similarities/cosine.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,101 @@
33
!!! warning "Under construction"
44
This section is currently being developed. If you're interested in helping write this section, feel free to [open a pull request](https://github.com/kernelmethod/LSH.jl/pulls); otherwise, please check back later.
55

6+
## Definition
7+
*Cosine similarity*, roughly speaking, measures the angle between a pair of inputs. Two inputs are very similar if the angle between them is low, and their similarity drops as the angle between them increases.
8+
9+
Concretely, cosine similarity is computed as
10+
11+
``\text{cossim}(x,y) = \frac{\left\langle x,y\right\rangle}{\|x\|\cdot\|y\|} = \left\langle\frac{x}{\|x\|},\frac{y}{\|y\|}\right\rangle``
12+
13+
where ``\left\langle\cdot,\cdot\right\rangle`` is an inner product (e.g., dot product) and ``\|\cdot\|`` is the norm derived from that inner product. ``\text{cossim}(x,y)`` goes from ``-1`` to ``1``, where ``-1`` corresponds to low similarity and ``1`` corresponds to high similarity. To calculate cosine similarity, you can use the [`cossim`](@ref) function exported from the `LSH` module:
14+
15+
```jldoctest
16+
julia> using LSH, LinearAlgebra
17+
18+
julia> x = [5, 3, -1, 1]; # norm(x) == 6
19+
20+
julia> y = [2, -2, -2, 2]; # norm(y) == 4
21+
22+
julia> cossim(x,y) == dot(x,y) / (norm(x)*norm(y))
23+
true
24+
25+
julia> cossim(x,y) == (5*2 + 3*(-2) + (-1)*(-2) + 1*2) / (6*4)
26+
true
27+
```
28+
29+
## SimHash
30+
*SimHash*[^1][^2] is a family of LSH functions for hashing with respect to cosine similarity. You can generate a new hash function from this family by calling [`SimHash`](@ref):
31+
32+
```jldoctest; setup = :(using LSH)
33+
julia> hashfn = SimHash();
34+
35+
julia> n_hashes(hashfn)
36+
1
37+
38+
julia> hashfn = SimHash(40);
39+
40+
julia> n_hashes(hashfn)
41+
40
42+
```
43+
44+
Once constructed, you can start hashing vectors by calling `hashfn(x)`:
45+
46+
```jldoctest; setup = :(using LSH, Random; Random.seed!(0)), output = false
47+
hashfn = SimHash(100)
48+
49+
# x and y have high cosine similarity since they point in the same direction
50+
# x and z have low cosine similarity since they point in opposite directions
51+
x = randn(128)
52+
y = 2x
53+
z = -x
54+
55+
hx, hy, hz = hashfn(x), hashfn(y), hashfn(z)
56+
57+
# Among the 100 hash functions that we generated, we expect more hash
58+
# collisions between x and y than between x and z
59+
sum(hx .== hy) > sum(hx .== hz)
60+
61+
# output
62+
true
63+
64+
```
65+
66+
Note that [`SimHash`](@ref) is a one-bit hash function, meaning that each hash you compute is just one bit. As a result, `hashfn(x)` returns a `BitArray`:
67+
68+
```jldoctest; setup = :(using LSH)
69+
julia> hashfn = SimHash();
70+
71+
julia> n_hashes(hashfn)
72+
1
73+
74+
julia> hashes = hashfn(randn(4));
75+
76+
julia> typeof(hashes)
77+
BitArray{1}
78+
79+
julia> length(hashes)
80+
1
81+
```
82+
83+
Since a single-bit hash doesn't do much to reduce the cost of similarity search, you usually want to generate multiple hash functions at once. For instance, in the snippet below we sample 10 hash functions, so that `hashfn(x)` is a length-10 `BitArray`:
84+
85+
```jldoctest; setup = :(using LSH)
86+
julia> hashfn = SimHash(10);
87+
88+
julia> n_hashes(hashfn)
89+
10
90+
91+
julia> hashes = hashfn(randn(4));
92+
93+
julia> length(hashes)
94+
10
95+
```
96+
97+
---
98+
99+
### Footnotes
100+
101+
[^1]: Moses S. Charikar. *Similarity estimation techniques from rounding algorithms*. In Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC '02, page 380–388, New York, NY, USA, 2002. Association for Computing Machinery. 10.1145/509907.509965.
102+
103+
[^2]: [`SimHash` API reference](@ref SimHash)

src/hashes/mips_hash.jl

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,11 @@ Create a `MIPSHash` hash function for hashing on inner product similarity.
4242
4343
# Keyword parameters
4444
- $(DTYPE_DOCSTR(MIPSHash))
45-
- `maxnorm::Union{Nothing,Real}` (default: `nothing`): an upper bound on the ``\\ell^2``-norm of the data points. **Note: this keyword argument must be explicitly specified.** If it left unspecified (or set to `nothing`), `MIPSHash()` will raise an error.
45+
- `maxnorm::Union{Nothing,Real}` (default: `nothing`): an upper bound on the ``\\ell^2``-norm of the data points.
46+
47+
!!! warning "Warning: maxnorm must be explicitly set"
48+
The `maxnorm` keyword parameter must be explicitly specified. If it is left unspecified (or set to `nothing`), `MIPSHash()` will raise an error.
49+
4650
- `scale::Real` (default: `1`): parameter that affects the probability of a hash collision. Large values of `scale` increases hash collision probability (even for inputs with low inner product similarity); small values of `scale` will decrease hash collision probability.
4751
4852
# Examples

src/hashes/sign_alsh.jl

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,12 +41,16 @@ Create a `SignALSH` hash function for hashing on inner product similarity.
4141
4242
# Keyword parameters
4343
- $(DTYPE_DOCSTR(SignALSH))
44-
- `maxnorm::Union{Nothing,Real}` (default: `nothing`): an upper bound on the ``\\ell^2``-norm of the data points. **Note: this keyword argument must be explicitly specified.** If it left unspecified (or set to `nothing`), `SignALSH()` will raise an error.
44+
- `maxnorm::Union{Nothing,Real}` (default: `nothing`): an upper bound on the ``\\ell^2``-norm of the data points.
45+
46+
!!! warning "Warning: maxnorm must be set"
47+
The `maxnorm` keyword parameter must be explicitly specified. If it is left unspecified (or set to `nothing`), `SignALSH()` will raise an error.
48+
4549
- `m::Integer` (default: `3`): parameter `m` that affects the probability of a hash collision.
4650
- $(RESIZE_POW2_DOCSTR(SignALSH))
4751
4852
# Examples
49-
`SignALSH` is an [`AsymmetricLSHFunction`](@ref), and hence hashes must be computed using `index_hash` and `query_hash`.
53+
`SignALSH` is an [`AsymmetricLSHFunction`](@ref), and hence hashes must be computed using [`index_hash`](@ref) and [`query_hash`](@ref).
5054
5155
```jldoctest; setup = :(using LSH)
5256
julia> hashfn = SignALSH(12; maxnorm=10);

0 commit comments

Comments
 (0)