Skip to content

Commit e72794c

Browse files
committed
Update docs.
1 parent ebb9aa9 commit e72794c

File tree

2 files changed

+98
-27
lines changed

2 files changed

+98
-27
lines changed

docs/src/lshfunction_api.md

Lines changed: 27 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
This section is currently being developed. If you're interested in helping write this section, feel free to [open a pull request](https://github.com/kernelmethod/LSHFunctions.jl/pulls); otherwise, please check back later.
55

66
## LSHFunction
7-
The `LSH` module exposes a relatively easy interface for constructing new hash functions. Namely, you call [`LSHFunction`](@ref) with
7+
The `LSHFunctions` module exposes a relatively easy interface for constructing new hash functions. Namely, you call [`LSHFunction`](@ref) with
88

99
- the similarity function you want to use;
1010
- the number of hash functions you want to generate; and
@@ -77,48 +77,48 @@ LSHFunctions.jl provides a few common utility functions that you can use across
7777

7878
- [`n_hashes`](@ref): returns the number of hash functions computed by an [`LSHFunction`](@ref).
7979

80-
```jldoctest; setup = :(using LSHFunctions)
81-
julia> hashfn = LSHFunction(jaccard);
80+
```jldoctest; setup = :(using LSHFunctions)
81+
julia> hashfn = LSHFunction(jaccard);
8282
83-
julia> n_hashes(hashfn)
84-
1
83+
julia> n_hashes(hashfn)
84+
1
8585
86-
julia> hashfn = LSHFunction(jaccard, 10);
86+
julia> hashfn = LSHFunction(jaccard, 10);
8787
88-
julia> n_hashes(hashfn)
89-
10
88+
julia> n_hashes(hashfn)
89+
10
9090
91-
julia> hashes = hashfn(randn(50));
91+
julia> hashes = hashfn(randn(50));
9292
93-
julia> length(hashes)
94-
10
95-
```
93+
julia> length(hashes)
94+
10
95+
```
9696

9797
- [`similarity`](@ref): returns the similarity function for which the input [`LSHFunction`](@ref) is locality-sensitive:
9898

99-
```jldoctest; setup = :(using LSHFunctions)
100-
julia> hashfn = LSHFunction(cossim);
99+
```jldoctest; setup = :(using LSHFunctions)
100+
julia> hashfn = LSHFunction(cossim);
101101
102-
julia> similarity(hashfn)
103-
cossim (generic function with 2 methods)
104-
```
102+
julia> similarity(hashfn)
103+
cossim (generic function with 2 methods)
104+
```
105105

106106
- [`hashtype`](@ref): returns the type of hash computed by the input hash function. Note that in practice `hashfn(x)` (or [`index_hash(hashfn,x)`](@ref) and [`query_hash(hashfn,x)`](@ref) for an [`AsymmetricLSHFunction`](@ref)) will return an array of hashes, one for each hash function you generated. [`hashtype`](@ref) is the data type of each element of `hashfn(x)`.
107107

108-
```jldoctest; setup = :(using LSHFunctions)
109-
julia> hashfn = LSHFunction(cossim, 5);
108+
```jldoctest; setup = :(using LSHFunctions)
109+
julia> hashfn = LSHFunction(cossim, 5);
110110
111-
julia> hashtype(hashfn)
112-
Bool
111+
julia> hashtype(hashfn)
112+
Bool
113113
114-
julia> hashes = hashfn(rand(100));
114+
julia> hashes = hashfn(rand(100));
115115
116-
julia> typeof(hashes)
117-
BitArray{1}
116+
julia> typeof(hashes)
117+
BitArray{1}
118118
119-
julia> typeof(hashes[1]) == hashtype(hashfn)
120-
true
121-
```
119+
julia> typeof(hashes[1]) == hashtype(hashfn)
120+
true
121+
```
122122

123123
- [`collision_probability`](@ref): returns the probability of collision for two inputs with a given similarity. For instance, the probability that a single MinHash hash function causes a collision between inputs `A` and `B` is equal to [`jaccard(A,B)`](@ref jaccard):
124124

docs/src/similarities/jaccard.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,74 @@
22

33
!!! warning "Under construction"
44
This section is currently being developed. If you're interested in helping write this section, feel free to [open a pull request](https://github.com/kernelmethod/LSHFunctions.jl/pulls); otherwise, please check back later.
5+
6+
## Definition
7+
*Jaccard similarity* is a statistic that measures the amount of overlap between two sets. It is defined as
8+
9+
```math
10+
J(A,B) = \frac{|A \cap B|}{|A \cup B|}
11+
```
12+
13+
``J(A,B)`` is bounded by ``0 \le J(A,B) \le 1``, with values close to 1 indicating high similarity and values close to 0 indicating low similarity.
14+
15+
You can calculate Jaccard similarity with the LSHFunctions package by calling [`jaccard`](@ref):
16+
17+
```jldoctest
18+
julia> using LSHFunctions;
19+
20+
julia> A = Set([1, 2, 3]); B = Set([2, 3, 4]);
21+
22+
julia> jaccard(A,B) ==
23+
length(A ∩ B) / length(A ∪ B) ==
24+
0.5
25+
true
26+
```
27+
28+
## MinHash
29+
*MinHash*[^Broder97] is a hash function for Jaccard similarity. It takes as input a set, and returns as output a `UInt32` or a `UInt64`. To sample a function from the MinHash LSH family, simply call [`MinHash`](@ref) with the number of hash functions you want to generate:
30+
31+
```jldoctest; setup = :(using LSHFunctions, Random; Random.seed!(0))
32+
julia> hashfn = MinHash(5);
33+
34+
julia> n_hashes(hashfn)
35+
5
36+
37+
julia> hashtype(hashfn)
38+
UInt64
39+
40+
julia> A = Set([1, 2, 3]);
41+
42+
julia> hashfn(A)
43+
5-element Array{UInt64,1}:
44+
0x21be0e591a3b69ea
45+
0x19c5f638a776ab3c
46+
0x63c12fd5d2f073ab
47+
0x5c6b11e538a36352
48+
0x129ef927e80a1b39
49+
```
50+
51+
The probability of a collision for an individual hash between sets ``A`` and ``B`` is just equal to their Jaccard similarity, i.e.
52+
53+
```math
54+
Pr[h(A) = h(B)] = J(A,B)
55+
```
56+
57+
```@eval
58+
using PyPlot, LSHFunctions;
59+
fig = figure();
60+
hashfn = MinHash();
61+
x = range(0, 1; length=1024);
62+
y = collision_probability(hashfn, x; n_hashes=1);
63+
64+
plot(x, y)
65+
title("Probability of hash collision for MinHash")
66+
xlabel(raw"$J(A,B)$")
67+
ylabel(raw"$Pr[h(x) = h(y)]$")
68+
69+
savefig("minhash_collision_probability.svg")
70+
```
71+
72+
![Probability of collision for MinHash](minhash_collision_probability.svg)
73+
74+
## Footnotes
75+
[^Broder97]: Broder, A. *On the resemblance and containment of documents*. Compression and Complexity of Sequences: Proceedings, Positano, Amalfitan Coast, Salerno, Italy, June 11-13, 1997. doi:10.1109/SEQUEN.1997.666900.

0 commit comments

Comments
 (0)