Skip to content

Commit fa0994c

Browse files
committed
Separate documentation for different similarities and related hash functions into their own pages. Improve documentation across the board for various functions and LSHFunction subtypes. In addition, add DEFAULT_* constants to define default values of common arguments used by LSHFunction subtypes. Finally, add utilities to LSHBase.jl that generate docstrings for common arguments and keyword parameters.
Squashed commit of the following: commit cdaacc5 Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 16:03:11 2020 -0700 Remove old MIPSHash documentation. commit 64596cd Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 16:00:48 2020 -0700 Fix SignALSH docs. Add improved documentation for MIPSHash, and fix an issue where MIPSHash doesn't raise an error when the input exceeds maxnorm. commit 0e5f506 Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 15:34:03 2020 -0700 Add improved documentation for SignALSH. commit 653c5cc Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 15:13:24 2020 -0700 Ensure that only documentation for inner_prod(::AbstractVector, ::AbstractVector) is shown on the page for inner product similarity. commit a2caca8 Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 15:11:55 2020 -0700 Add DEFAULT_* arguments to specify default values for common keyword arguments used in the LSH module. Add some functions to automatically generate docstrings for those arguments. commit 8954625 Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 14:32:05 2020 -0700 Add missing docstring for SimHash. commit 5c575e9 Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 14:22:28 2020 -0700 Reformat reference cited in MinHash documentation. commit 2f8c9ef Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 14:21:21 2020 -0700 Fix some docstrings for l^p / L^p distance. commit 003ed87 Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 14:01:11 2020 -0700 Add some refs for l^p and L^p distances in similarities.jl. commit a528d43 Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 13:49:16 2020 -0700 Generate docs for L1Hash and L2Hash. commit 6f261a9 Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 13:19:38 2020 -0700 Use names of hash functions (instead of just a section titled 'Hash Functions') on the pages for each similarity function. commit 7d4cf5f Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 13:10:51 2020 -0700 Separate similarity functions so that they have their own pages. commit 277ed49 Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 13:01:15 2020 -0700 Remove information about other similarity search techniques from the landing page of the docs.
1 parent f85536b commit fa0994c

File tree

15 files changed

+299
-131
lines changed

15 files changed

+299
-131
lines changed

docs/make.jl

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,12 @@ makedocs(
1212
format = Documenter.HTML(),
1313
modules = [LSH],
1414
pages = ["Home" => "index.md",
15-
"Similarity functions" => "similarities.md"]
15+
"Similarity functions" => [
16+
"Cosine similarity" => joinpath("similarities", "cosine.md"),
17+
raw"``\ell^p`` distance" => joinpath("similarities", "lp_distance.md"),
18+
"Jaccard similarity" => joinpath("similarities", "jaccard.md"),
19+
"Inner product similarity" => joinpath("similarities", "inner_prod.md")]
20+
]
1621
)
1722

1823
deploydocs(

docs/src/index.md

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,6 @@ Broadly, there are two computational issues with this approach:
1616
- First, the database may be massive, much larger than could possibly fit in memory. This would make the brute-force approach of computing ``s(x,y)`` for every point ``y`` in the database far too expensive to be practical.
1717
- Second, the dimensionality of the data may be such that computing ``s(x,y)`` is itself expensive. In addition, the similarity function itself may simply be intrinsically difficult to compute. For instance, calculating Wasserstein distance entails solving a very high-dimensional linear program.
1818

19-
In order to solve these problems, researchers have over time developed a variety of techniques to accelerate similarity search:
20-
21-
- [``k``-d trees](https://en.wikipedia.org/wiki/K-d_tree)
22-
- [Ball trees](https://en.wikipedia.org/wiki/Ball_tree)
23-
- Data reduction techniques
24-
2519
## Locality-sensitive hashing
2620
*Locality-sensitive hashing* (LSH) is a technique for accelerating similarity search that works by using a hash function on the query point ``x`` and limiting similarity search to only those points in the database that experience a hash collision with ``x``. The hash functions that are used are randomly generated from a family of *locality-sensitive hash functions*. These hash functions have the property that ``Pr[h(x) = h(y)]`` (i.e., the probability of a hash collision) increases the more similar that ``x`` and ``y`` are.
2721

@@ -34,5 +28,7 @@ LSH.jl is a package that provides definitions of locality-sensitive hash functio
3428
- Inner product (`inner_prod`)
3529
- Function-space hashes (`L1`, `L2`, and `cossim`)
3630

31+
## Contents
32+
3733
```@contents
3834
```

docs/src/similarities.md

Lines changed: 0 additions & 29 deletions
This file was deleted.

docs/src/similarities/cosine.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Cosine similarity
2+
3+
## SimHash
4+
5+
```@docs
6+
SimHash
7+
```
8+
9+
## Utilities
10+
11+
```@docs
12+
cossim
13+
```

docs/src/similarities/inner_prod.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Inner product similarity
2+
3+
## SignALSH
4+
5+
```@docs
6+
SignALSH
7+
```
8+
9+
## MIPSHash
10+
11+
```@docs
12+
MIPSHash
13+
```
14+
15+
## Utilities
16+
17+
```@docs
18+
inner_prod(::AbstractVector, ::AbstractVector)
19+
```

docs/src/similarities/jaccard.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Jaccard similarity
2+
3+
## MinHash
4+
5+
```@docs
6+
MinHash
7+
```
8+
9+
## Utilities
10+
11+
```@docs
12+
jaccard
13+
```

docs/src/similarities/lp_distance.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# ``\ell^p`` distance
2+
3+
## L1Hash and L2Hash
4+
5+
```@docs
6+
L1Hash
7+
L2Hash
8+
```
9+
10+
## Utility functions
11+
12+
```@docs
13+
ℓp
14+
ℓp_norm
15+
Lp(::AbstractVector{T}, ::AbstractVector, ::Real) where T
16+
Lp_norm(::AbstractVector, ::Real)
17+
```

src/LSHBase.jl

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,10 @@ Global variables and constants
1212
# been associated with hash functions via the register_similarity! macro.
1313
const available_similarities = Set()
1414

15-
available_similarities_as_strings() = available_similarities .|> string |> sort
15+
# Defaults to use for common arguments
16+
const DEFAULT_N_HASHES = 1
17+
const DEFAULT_DTYPE = Float32
18+
const DEFAULT_RESIZE_POW2 = false
1619

1720
#========================
1821
Abstract typedefs
@@ -64,3 +67,20 @@ The following functions must be defined for all AsymmetricLSHFunction subtypes
6467
=#
6568
function index_hash end
6669
function query_hash end
70+
71+
#========================
72+
Documentation utilities
73+
========================#
74+
75+
available_similarities_as_strings() = available_similarities .|> string |> sort
76+
77+
### Docstring generators for common keyword arguments
78+
N_HASHES_DOCSTR(; default = DEFAULT_N_HASHES) = """
79+
`n_hashes::Integer` (default: `$(default)`): the number of hash functions to generate."""
80+
81+
DTYPE_DOCSTR(hashfn; default = DEFAULT_DTYPE) = """
82+
`dtype::DataType` (default: `$(default)`): the data type to use in the $(hashfn) internals. For performance reasons you should pick `dtype` to match the type of the data you're hashing."""
83+
84+
RESIZE_POW2_DOCSTR(hashfn; default = DEFAULT_RESIZE_POW2) = """
85+
`resize_pow2::Bool` (default: `$(default)`): affects the way in which the returned `$(hashfn)` resizes to hash inputs of different sizes. If you think you'll be hashing inputs of many different sizes, it's more efficient to set `resize_pow2 = true`."""
86+

src/hashes/lphash.jl

Lines changed: 34 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -52,10 +52,10 @@ end
5252

5353
### External LpHash constructors
5454

55-
function LpHash{T}(n_hashes::Integer = 1;
55+
function LpHash{T}(n_hashes::Integer = DEFAULT_N_HASHES;
5656
r::Real = T(1.0),
5757
power::Integer = 2,
58-
resize_pow2::Bool = false) where {T <: Union{Float32,Float64}}
58+
resize_pow2::Bool = DEFAULT_RESIZE_POW2) where {T <: Union{Float32,Float64}}
5959

6060
coeff = Matrix{T}(undef, n_hashes, 0)
6161
shift = rand(T, n_hashes)
@@ -77,53 +77,53 @@ L1Hash(args...; kws...) where {T} = LpHash(args...; power = 1, kws...)
7777

7878
L2Hash(args...; kws...) where {T} = LpHash(args...; power = 2, kws...)
7979

80-
LpHash(args...; dtype::DataType = Float32, kws...) =
80+
LpHash(args...; dtype::DataType = DEFAULT_DTYPE, kws...) =
8181
LpHash{dtype}(args...; kws...)
8282

83-
# Documentation for L1Hash and L2Hash
84-
@doc raw"""
85-
L1Hash(n_hashes::Integer = 1;
86-
dtype::DataType = Float32,
87-
r::Real = 1.0,
88-
resize_pow2::Bool = false)
89-
L2Hash(n_hashes::Integer = 1;
90-
dtype::DataType = Float32,
91-
r::Real = 1.0,
92-
resize_pow2::Bool = false)
83+
### Documentation for L1Hash and L2Hash
84+
for (hashfn, power) in zip((:L1Hash, :L2Hash), (1, 2))
85+
sim = "$(power)"
86+
equation = (power == 1) ?
87+
"\\|x - y\\|_$(power) = \\sum_i |x_i - y_i|" :
88+
"\\|x - y\\|_$(power) = \\left(\\sum_i |x_i - y_i|^$(power)\\right)^{1/$(power)}"
9389

94-
Constructs a locality-sensitive hash for ``\ell^p`` distance (``\|x - y\|_p``). `L1Hash` constructs a hash function for ``\ell^1`` distance, and `L2Hash` constructs a hash function for ``\ell^2`` distance.
90+
quote
91+
@doc """
92+
$($hashfn)(
93+
n_hashes::Integer = $(DEFAULT_N_HASHES);
94+
dtype::DataType = $(DEFAULT_DTYPE),
95+
r::Real = 1.0,
96+
resize_pow2::Bool = $(DEFAULT_RESIZE_POW2)
97+
)
98+
99+
Constructs a locality-sensitive hash for ``\\ell^$($power)`` distance (``\\|x - y\\|_$($power)``), defined as
100+
101+
``$($equation)``
95102
96103
# Arguments
97-
- `n_hashes::Integer` (default: `1`): the number of hash functions to generate.
104+
- $(N_HASHES_DOCSTR())
98105
99106
# Keyword parameters
100-
- `dtype::DataType` (default: `Float32`): the type to use for the resulting `LSH.LpHash`'s coefficients. Can be either `Float32` or `Float64`. You generally want to pick `dtype` to match the type of the data you're hashing.
107+
- $(DTYPE_DOCSTR($hashfn))
101108
- `r::Real` (default: `1.0`): a positive coefficient whose magnitude influences the collision rate. Larger values of `r` will increase the collision rate, even for distant points. See references for more information.
102-
- `resize_pow2::Bool` (default: `false`): affects the way in which the `LSH.LpHash` struct resizes to hash inputs of different sizes. If you think you'll be hashing inputs of many different sizes, it's more efficient to set `resize_pow2 = true`.
109+
- $(RESIZE_POW2_DOCSTR($hashfn))
103110
104111
# Examples
105-
Construct an `LSH.LpHash` by calling `L1Hash` or `L2Hash` with the number of hash functions you want to generate:
112+
Construct an `$($hashfn)` with the number of hash functions you want to generate:
106113
107114
```jldoctest; setup = :(using LSH)
108-
julia> hashfn = L1Hash();
109-
110-
julia> hashfn.power == 1 &&
111-
n_hashes(hashfn) == 1 &&
112-
similarity(hashfn) == ℓ1
113-
true
114-
115-
julia> hashfn = L2Hash(128);
115+
julia> hashfn = $($hashfn)(128);
116116
117-
julia> hashfn.power == 2 &&
117+
julia> hashfn.power == $($power) &&
118118
n_hashes(hashfn) == 128 &&
119-
similarity(hashfn) == ℓ2
119+
similarity(hashfn) == $($sim)
120120
true
121121
```
122122
123123
After creating a hash function, you can compute hashes with `hashfn(x)`:
124124
125125
```jldoctest; setup = :(using LSH)
126-
julia> hashfn = L1Hash(20);
126+
julia> hashfn = $($hashfn)(20);
127127
128128
julia> x = rand(4);
129129
@@ -133,14 +133,12 @@ julia> hashes = hashfn(x);
133133
134134
# References
135135
136-
```
137-
Datar, Mayur & Indyk, Piotr & Immorlica, Nicole & Mirrokni, Vahab. (2004). Locality-sensitive hashing scheme based on p-stable distributions. Proceedings of the Annual Symposium on Computational Geometry. 10.1145/997817.997857.
138-
```
136+
- Datar, Mayur & Indyk, Piotr & Immorlica, Nicole & Mirrokni, Vahab. (2004). *Locality-sensitive hashing scheme based on p-stable distributions*. Proceedings of the Annual Symposium on Computational Geometry. 10.1145/997817.997857.
139137
140-
See also: [`ℓp`](@ref), [`ℓ1`](@ref), [`ℓ2`](@ref)
141-
""" L1Hash
142-
143-
@doc (@doc L1Hash) L2Hash
138+
See also: [`$($sim)`](@ref ℓp)
139+
""" $hashfn
140+
end |> eval
141+
end
144142

145143
#========================
146144
Helper functions for LpHash

src/hashes/minhash.jl

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -24,14 +24,14 @@ struct MinHash{T, I <: Union{UInt32,UInt64}} <: SymmetricLSHFunction
2424
end
2525

2626
"""
27-
MinHash(n_hashes::Integer = 1;
27+
MinHash(n_hashes::Integer = $(DEFAULT_N_HASHES);
2828
dtype::DataType = Any,
2929
symbols::Union{Vector,Set} = Set())
3030
3131
Construct a locality-sensitive hash function for Jaccard similarity.
3232
3333
# Arguments
34-
- `n_hashes::Integer` (default: `1`): the number of hash functions to generate.
34+
- $(N_HASHES_DOCSTR())
3535
3636
# Keyword parameters
3737
- `dtype::DataType` (default: `Any`): the type of symbols in the sets you're hashing. This is overriden by the data type contained in `symbols` when `symbols` is non-empty.
@@ -75,9 +75,7 @@ julia> hashfn(Set(["a", "b", "c"]));
7575
```
7676
7777
# References
78-
```
79-
Broder, A. "On the resemblance and containment of documents". Compression and Complexity of Sequences: Proceedings, Positano, Amalfitan Coast, Salerno, Italy, June 11-13, 1997. doi:10.1109/SEQUEN.1997.666900.
80-
```
78+
- Broder, A. *On the resemblance and containment of documents*. Compression and Complexity of Sequences: Proceedings, Positano, Amalfitan Coast, Salerno, Italy, June 11-13, 1997. doi:10.1109/SEQUEN.1997.666900.
8179
8280
See also: [`jaccard`](@ref)
8381
"""
@@ -92,7 +90,7 @@ function MinHash(args...;
9290
end
9391
end
9492

95-
function MinHash{T}(n_hashes::Integer = 1;
93+
function MinHash{T}(n_hashes::Integer = DEFAULT_N_HASHES;
9694
symbols::C = Set{T}()) where {T, C <: Union{Vector{<:T},Set{<:T}}}
9795

9896
fixed_symbols = (length(symbols) > 0)

0 commit comments

Comments
 (0)