Skip to content

Commit 41e6c45

Browse files
committed
Add collision_probability to the LSHFunction API, and add documentation for it. We now use this function in place of single_hash_collision_probability in order to compute the probability of a collision between a pair of inputs, as well as to compute the probability that multiple inputs simultaneously collide. This fixes issue #6.
1 parent 40c7d8a commit 41e6c45

File tree

13 files changed

+206
-24
lines changed

13 files changed

+206
-24
lines changed

docs/src/full_api.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ lsh_family
88
hashtype
99
n_hashes
1010
similarity
11+
collision_probability
1112
index_hash
1213
query_hash
1314
SymmetricLSHFunction

docs/src/lshfunction_api.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,4 +120,40 @@ julia> typeof(hashes[1]) == hashtype(hashfn)
120120
true
121121
```
122122

123+
- [`collision_probability`](@ref): returns the probability of collision for two inputs with a given similarity. For instance, the probability that a single MinHash hash function causes a collision between inputs `A` and `B` is equal to [`jaccard(A,B)`](@ref jaccard):
124+
125+
```jldoctest; setup = :(using LSH)
126+
julia> hashfn = MinHash();
127+
128+
julia> A = Set(["a", "b", "c"]);
129+
130+
julia> B = Set(["b", "c", "d"]);
131+
132+
julia> collision_probability(hashfn, A, B) ==
133+
collision_probability(hashfn, jaccard(A,B)) ==
134+
jaccard(A,B)
135+
true
136+
```
137+
138+
We often want to compute the probability that not just one hash collides, but that multiple hashes collide simultaneously. You can calculate this using the `n_hashes` keyword argument. If left unspecified, then [`collision_probability`](@ref) will use [`n_hashes(hashfn)`](@ref n_hashes) hash functions to compute the probability.
139+
140+
```jldoctest; setup = :(using LSH)
141+
julia> hashfn = MinHash(5);
142+
143+
julia> A = Set(["a", "b", "c"]);
144+
145+
julia> B = Set(["b", "c", "d"]);
146+
147+
julia> collision_probability(hashfn, A, B) ==
148+
collision_probability(hashfn, A, B; n_hashes=5) ==
149+
collision_probability(hashfn, A, B; n_hashes=1)^5
150+
true
151+
152+
julia> sim = jaccard(A,B);
153+
154+
julia> collision_probability(hashfn, sim) ==
155+
collision_probability(hashfn, sim; n_hashes=5) ==
156+
collision_probability(hashfn, sim; n_hashes=1)^5
157+
true
158+
```
123159

src/LSH.jl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,6 @@ export SimHash, L1Hash, L2Hash, MIPSHash, SignALSH, MinHash,
4848

4949
# Helper / utility functions for LSHFunctions
5050
export index_hash, query_hash, n_hashes, hashtype, similarity, lsh_family,
51-
embedded_similarity
51+
embedded_similarity, collision_probability
5252

5353
end # module

src/LSHBase.jl

Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,136 @@ macro register_similarity! end
5757
function LSHFunction end
5858
function lsh_family end
5959

60+
@doc """
61+
collision_probability(hashfn::H, sim;
62+
n_hashes::Union{Symbol,Integer}=:auto) where {H <: LSHFunction}
63+
64+
Compute the probability of hash collision between two inputs with similarity `sim` for an [`LSHFunction`](@ref) of type `H`. This function returns the probability that `n_hashes` hashes simultaneously collide.
65+
66+
# Arguments
67+
- `hashfn::LSHFunction`: the `LSHFunction` for which we want to compute the probability of collision.
68+
- `sim`: a similarity (or vector of similarities), computed using the similarity function returned by `similarity(hashfn)`.
69+
70+
# Keyword arguments
71+
- `n_hashes::Union{Symbol,Integer}` (default: `:auto`): the number of hash functions to use to compute the probability of collision. If the probability that a single hash collides is ``p``, then the probability that `n_hashes` hashes simultaneously collide is
72+
73+
```math
74+
p^{\\text{n_hashes}}
75+
```
76+
77+
As a result, `collision_probability(hashfn, sim; n_hashes=N)` is the same as `collision_probability(hashfn, sim; n_hashes=1).^N`. If `n_hashes = :auto` then this function will select the number of hashes to be `n_hashes(hashfn)` (using the [`n_hashes`](@ref) function from the [`LSHFunction`](@ref) API).
78+
79+
# Examples
80+
The probability that a single MinHash hash function causes a hash collision between inputs `A` and `B` is equal to `jaccard(A,B)`:
81+
82+
```jldoctest; setup = :(using LSH)
83+
julia> hashfn = MinHash();
84+
85+
julia> A = Set(["a", "b", "c"]);
86+
87+
julia> B = Set(["b", "c", "d"]);
88+
89+
julia> jaccard(A,B)
90+
0.5
91+
92+
julia> collision_probability(hashfn, jaccard(A,B); n_hashes=1)
93+
0.5
94+
```
95+
96+
If our [`MinHash`](@ref) struct keeps track of `N` hash functions simultaneously, then the probability of collision is `jaccard(A,B)^N`:
97+
98+
```jldoctest; setup = :(using LSH)
99+
julia> hashfn = MinHash(10);
100+
101+
julia> A = Set(["a", "b", "c"]);
102+
103+
julia> B = Set(["b", "c", "d"]);
104+
105+
julia> collision_probability(hashfn, jaccard(A,B)) ==
106+
collision_probability(hashfn, jaccard(A,B); n_hashes=10) ==
107+
collision_probability(hashfn, jaccard(A,B); n_hashes=1)^10
108+
true
109+
```
110+
111+
See also: [`n_hashes`](@ref), [`similarity`](@ref)
112+
"""
113+
@generated function collision_probability(hashfn::LSHFunction, sim;
114+
n_hashes::Union{Symbol,Integer} = :auto)
115+
116+
error_msg = :("n_hashes must be :auto or a positive Integer" |>
117+
ErrorException |>
118+
throw)
119+
120+
n_hashes = begin
121+
if n_hashes <: Symbol
122+
quote
123+
if n_hashes != :auto
124+
$error_msg
125+
end
126+
127+
n_hashes = _n_hashes(hashfn)
128+
end
129+
else
130+
quote
131+
if n_hashes 0
132+
$error_msg
133+
end
134+
nh = n_hashes
135+
end
136+
end
137+
end
138+
139+
quote
140+
$n_hashes
141+
single_hash_collision_probability(hashfn, sim).^n_hashes
142+
end
143+
end
144+
145+
@doc """
146+
collision_probability(hashfn::LSHFunction, x, y;
147+
n_hashes::Union{Symbol,Integer} = :auto)
148+
149+
Computes the probability of a hash collision between two inputs `x` and `y` for a given hash function `hashfn`. This is the same as calling
150+
151+
collision_probability(hashfn, similarity(hashfn)(x,y); n_hashes=n_hashes)
152+
153+
# Examples
154+
The following snippet computes the probability of collision between two sets `A` and `B` for a single MinHash. For MinHash, this probability is just equal to the Jaccard similarity between `A` and `B`.
155+
156+
```jldoctest; setup = :(using LSH)
157+
julia> hashfn = MinHash();
158+
159+
julia> A = Set(["a", "b", "c"]);
160+
161+
julia> B = Set(["a", "b", "c"]);
162+
163+
julia> similarity(hashfn) == jaccard
164+
true
165+
166+
julia> collision_probability(hashfn, A, B) ==
167+
collision_probability(hashfn, jaccard(A,B)) ==
168+
jaccard(A,B)
169+
true
170+
```
171+
172+
We can use the `n_hashes` argument to specify the probability that `n_hashes` MinHash hash functions simultaneously collide. If left unspecified, then we'll simply use `n_hashes(hashfn)` as the number of hash functions:
173+
174+
```jldoctest; setup = :(using LSH)
175+
julia> hashfn = MinHash(10);
176+
177+
julia> A = Set(["a", "b", "c"]);
178+
179+
julia> B = Set(["a", "b", "c"]);
180+
181+
julia> collision_probability(hashfn, A, B) ==
182+
collision_probability(hashfn, A, B; n_hashes=10) ==
183+
collision_probability(hashfn, A, B; n_hashes=1)^10
184+
true
185+
```
186+
"""
187+
collision_probability(hashfn::LSHFunction, A, B; kws...) =
188+
collision_probability(hashfn, similarity(hashfn)(A,B); kws...)
189+
60190
#=
61191
The following functions must be defined for all LSHFunction subtypes
62192
=#
@@ -129,6 +259,19 @@ julia> length(hashes)
129259
"""
130260
function n_hashes end
131261

262+
# Alias for n_hashes that's occasionally useful when we need to process
263+
# variables that are named n_hashes
264+
const _n_hashes = n_hashes
265+
266+
# The function
267+
#
268+
# single_hash_collision_probability(hashfn::H, sim)
269+
#
270+
# must be implemented for every subtype H of LSHFunction. Note that users don't
271+
# access this function directly; instead, they use the collision_probability
272+
# function exported by the LSH API.
273+
function single_hash_collision_probability end
274+
132275
#========================
133276
SymmetricLSHFunction API
134277
========================#

src/function_hashing/chebhash.jl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -83,8 +83,8 @@ hashtype(hashfn::ChebHash) =
8383
hashtype(hashfn.discrete_hashfn)
8484

8585
# TODO: this may not be true
86-
single_hash_collision_probability(hashfn::ChebHash, args...; kws...) =
87-
single_hash_collision_probability(hashfn.discrete_hashfn, args...; kws...)
86+
collision_probability(hashfn::ChebHash, args...; kws...) =
87+
collision_probability(hashfn.discrete_hashfn, args...; kws...)
8888

8989
#===============
9090
Hash computation

src/function_hashing/monte_carlo.jl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -82,8 +82,8 @@ n_hashes(hashfn::MonteCarloHash) =
8282
n_hashes(hashfn.discrete_hashfn)
8383

8484
# TODO: this may not be true
85-
single_hash_collision_probability(hashfn::MonteCarloHash, args...; kws...) =
86-
single_hash_collision_probability(hashfn.discrete_hashfn, args...; kws...)
85+
collision_probability(hashfn::MonteCarloHash, args...; kws...) =
86+
collision_probability(hashfn.discrete_hashfn, args...; kws...)
8787

8888
#========================
8989
SymmetricLSHFunction API compliance

src/hashes/lphash.jl

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -183,18 +183,20 @@ LSHFunction and SymmetricLSHFunction API compliance
183183
n_hashes(h::LpHash) = length(h.shift)
184184
hashtype(::LpHash) = Int32
185185

186-
# See Section 3.2 of the reference
186+
# See Section 3.2 of the reference paper
187187
function single_hash_collision_probability(hashfn::LpHash, sim::Real)
188+
### Compute the collision probability for a single hash function
188189
distr, r = hashfn.distr, hashfn.r
189-
integral, err = quadgk(x -> pdf(distr, x/sim) * (1 - x/r), 0, r, rtol=1e-5)
190-
integral /= sim
190+
integral, err = quadgk(x -> pdf(distr, x/sim) * (1 - x/r),
191+
0, r, rtol=1e-5)
192+
integral = integral ./ sim
191193

192194
# Note that from the reference for the L^p LSH family, we're supposed to
193-
# integrate over the p.d.f. for the _absolute value_ of the underlying random
194-
# variable, rather than the raw p.d.f.. Luckily, all of the distributions we
195-
# have to deal with here are symmetric and centered at zero, so all we have
196-
# to do is multiply the integral by two.
197-
integral *= 2
195+
# integrate over the p.d.f. for the _absolute value_ of the underlying
196+
# random variable, rather than the raw p.d.f. Luckily, all of the
197+
# distributions we have to deal with here are symmetric and centered at
198+
# zero, so all we have to do is multiply the integral by two.
199+
single_hash_prob = integral .* 2
198200
end
199201

200202
function similarity(hashfn::LpHash)

src/hashes/simhash.jl

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,8 @@ LSHFunction and SymmetricLSHFunction API compliance
117117
hashtype(::SimHash) = Bool
118118
n_hashes(hashfn::SimHash) = size(hashfn.coeff, 2)
119119
similarity(::SimHash) = cossim
120-
single_hash_collision_probability(::SimHash, sim::Real) = (1 - acos(sim) / π)
120+
single_hash_collision_probability(::SimHash, sim::Real) =
121+
@. (1 - acos(sim) / π)
121122

122123
### Hash computation
123124

src/similarities.jl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -395,8 +395,8 @@ Compute the order-``p`` Wasserstein distance between two probability distributio
395395
- `p::Real`: the order of Wasserstein distance to compute.
396396
"""
397397
function wasserstein_1d(f, g, p::Real)
398-
# For one-dimensional probability distributions, the Wasserstein distance has the
399-
# closed form
398+
# For one-dimensional probability distributions, the Wasserstein distance has
399+
# the closed form
400400
#
401401
# ∫_0^1 |F^{-1}(x) - G^{-1}(x)|^p dx
402402
#

test/function_hashing/test_chebhash.jl

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ Tests
7272

7373
sim = cossim(f, g, interval)
7474
hf, hg = hashfn(f), hashfn(g)
75-
prob = LSH.single_hash_collision_probability(hashfn, sim)
75+
prob = collision_probability(hashfn, sim; n_hashes=1)
7676

7777
prob - 0.05 mean(hf .== hg) prob + 0.05
7878
end
@@ -123,7 +123,7 @@ Tests
123123

124124
sim = L2(f, g, interval)
125125
hf, hg = hashfn(f), hashfn(g)
126-
prob = LSH.single_hash_collision_probability(hashfn, sim)
126+
prob = collision_probability(hashfn, sim; n_hashes=1)
127127

128128
prob - 0.05 mean(hf .== hg) prob + 0.05
129129
end

0 commit comments

Comments
 (0)