Squashed commit of the following:

kernelmethod · kernelmethod · commit 8206f71f11b2 · 2020-01-18T01:35:32.000-07:00
commit 4cc0ab4152a16d915962d02f4ec1f40844d9fc14 Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Sat Jan 18 01:35:23 2020 -0700 Add a couple more links to the API reference on the cosine similarity page. commit c2e2cb6 Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Sat Jan 18 01:34:06 2020 -0700 Fix some bad formatting on the cosine similarity page. commit e6465ab Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Sat Jan 18 01:27:50 2020 -0700 Add an FAQ to the docs. commit 548b0ca Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 22:37:14 2020 -0700 Add some basic content to the page for LSHFunction(). commit 9606638 Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 22:14:37 2020 -0700 Add warnings blocks to the docstrings for MIPSHash and SignALSH. commit 1fa1535 Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 21:58:32 2020 -0700 Fix footnotes on the page for cosine similarity. commit b6e79e3 Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 21:35:44 2020 -0700 Add a README for the docs/ directory. commit 4cd0327 Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 21:29:36 2020 -0700 Add content to the page for hashing on cosine similarity. commit 122a9f3 Merge: ab0b60f 999d73b Author: kernelmethod <17100608+kernelmethod@users.noreply.github.com> Date: Fri Jan 17 20:41:07 2020 -0700 Merge branch 'master' into docs
diff --git a/docs/README.md b/docs/README.md
@@ -0,0 +1,16 @@
+# LSH.jl documentation
+Documentation for the LSH.jl package.
+
+The module documentation is automatically built and updated whenever `master` is updated. To develop the documentation locally, run
+
+```
+$ cd docs/
+$ julia make.jl
+$ julia --project=. --color=yes make.jl
+$ python3 -m http.server 8000
+```
+
+Then go to [http://localhost:8000/build/](http://localhost:8000/build/) in your browser.
+
+- [Stable docs](https://kernelmethod.github.io/LSH.jl/stable/)
+- [Developer docs](https://kernelmethod.github.io/LSH.jl/dev/)
diff --git a/docs/make.jl b/docs/make.jl
@@ -20,6 +20,7 @@ makedocs(
                     "Inner product similarity" => joinpath("similarities", "inner_prod.md")],
                 "Performance tips" => "performance.md",
                 "API reference" => "full_api.md",
+                "FAQ" => "faq.md"
                ]
 )
 
diff --git a/docs/src/faq.md b/docs/src/faq.md
@@ -0,0 +1,98 @@
+# FAQ
+
+## Why do we compute multiple hashes for every input?
+In a traditional hash table data structure, you have a single hash function (e.g. MurmurHash) that you use to convert inputs into hashes, which you can then use to index into the table. With LSH, you randomly generate multiple hash functions from a single LSH family. To index into the hash table, you apply each of those hash functions to your input and concatenate your computed hashes together. The concatenated hashes form the key you use to index into the hash table.
+
+The reason for computing multiple hashes is that every LSH function provides (at most) only a few bits of additional information with which to partition the input space. For example, [`SimHash`](@ref) is a single-bit hash: that is, if you create `hashfn = SimHash()`, then `hashfn(x)` can only return either `BitArray([0])` or `BitArray([1])`. If you're trying to use `hashfn` to speed up similarity search, then the hash you compute will -- *at best* -- reduce the number of points you have to search through by only 50% on average.
+
+In fact, the situation can be much more dire than that. If your data are highly structured, it is likely that each of your hashes will place data points into a tiny handful of buckets -- even just one bucket. For instance, in the snippet below we have a dataset of 100 points that all have very high cosine similarity with one another. If we only create a single hash function when we call [`SimHash`](@ref), then it's very likely that all of the data points will have the same hash.
+
+```jldoctest; setup = :(using LSH, Random; Random.seed!(0))
+julia> hashfn = SimHash();
+
+julia> data = ones(10, 100);  # Each column is a data point
+
+julia> data[end,1:end] .= rand(100);  # Randomize the last dimension of each point
+
+julia> hashes = map(x -> hashfn(x), eachcol(data));
+
+julia> unique(hashes)
+1-element Array{BitArray{1},1}:
+ [0]
+```
+
+The solution to this is to generate multiple hash functions, and combine each of the hashes we compute for an input into a single key. In the snippet below, we create 20 hash functions with [`SimHash`](@ref). Each hash computed in `map(x -> hashfn(x), eachcol(data))` is a length-20 `BitArray`.
+
+
+```jldoctest; setup = :(using LSH, Random; Random.seed!(0))
+julia> hashfn = SimHash(20);
+
+julia> data = ones(10,100);  # Each column is a data point
+
+julia> data[end,1:end] .= rand(100);  # Randomize the last dimension of each point
+
+julia> hashes = map(x -> hashfn(x), eachcol(data));
+
+julia> unique(hashes) |> length
+3
+
+julia> for uh in unique(hashes)
+           println(sum(uh == h for h in hashes))
+       end
+72
+16
+12
+```
+
+Our hash function has generated 3 unique 20-bit hashes, with 72 points sharing the first hash, 16 points sharing the second hash, and 12 points sharing the third hash. That's not a great split, but could still drastically reduce the size of the search space. For instance, the following benchmarks (on an Intel Core i7-8565U @ 1.80 GHz) suggest that the cost of computing [`SimHash`](@ref) on 10-dimensional data is about 34 times the cost of computing [`cossim`](@ref):
+
+```
+julia> using BenchmarkTools
+
+julia> @benchmark(hashfn(x), setup=(x=rand(10)))
+BenchmarkTools.Trial: 
+  memory estimate:  4.66 KiB
+  allocs estimate:  6
+  --------------
+  minimum time:     612.231 ns (0.00% GC)
+  median time:      1.563 μs (0.00% GC)
+  mean time:        1.728 μs (17.60% GC)
+  maximum time:     24.123 μs (92.03% GC)
+  --------------
+  samples:          10000
+  evals/sample:     169
+
+julia> @benchmark(cossim(x,y), setup=(x=rand(10);y=rand(10)))
+BenchmarkTools.Trial: 
+  memory estimate:  0 bytes
+  allocs estimate:  0
+  --------------
+  minimum time:     46.203 ns (0.00% GC)
+  median time:      46.415 ns (0.00% GC)
+  mean time:        47.467 ns (0.00% GC)
+  maximum time:     160.076 ns (0.00% GC)
+  --------------
+  samples:          10000
+  evals/sample:     988
+
+julia> 1.563e-6 / 46.415e-9
+33.67445868792416
+```
+
+So as long as [`SimHash`](@ref) reduces the size of the search space by 34 data points on average, it's faster than calculating the similarity between every pair of points. Even for our tiny dataset, which only had 100 points, that's still well worth it: with the 72/16/12 split that we got, [`SimHash`](@ref) reduces the number of similarities we have to calculate by ``100 - \left(\frac{72^2}{100} + \frac{16^2}{100} + \frac{12^2}{100}\right) \approx 44`` points on average.
+
+!!! info "Improving LSH partitioning"
+    LSH can be poor at partitioning your input space when data points are very similar to one another. In these cases, it may be helpful to find ways to transform your data in order to reduce their similarity.
+
+    For instance, in the example above, we created a synthetic dataset with the following code:
+
+    ```julia
+    julia> data = ones(10,100);  # Each column is a data point
+
+    julia> data[end,1:end] .= rand(100);  # Randomize the last dimension of each point 
+    ```
+
+    These data are, for all practical purposes, one-dimensional. Their first nine dimensions are all the same; only the last dimension provides any unique information about a given data point. As a result, a dimensionality reduction technique like principal component analysis (PCA) would have helped de-correlate the dimensions of the data and thereby reduced the cosine similarity between pairs of points.
+
+
+
diff --git a/docs/src/lshfunction_api.md b/docs/src/lshfunction_api.md
@@ -2,3 +2,117 @@
 
 !!! warning "Under construction"
     This section is currently being developed. If you're interested in helping write this section, feel free to [open a pull request](https://github.com/kernelmethod/LSH.jl/pulls); otherwise, please check back later.
+
+## LSHFunction
+The `LSH` module exposes a relatively easy interface for constructing new hash functions. Namely, you call [`LSHFunction`](@ref) with 
+
+- the similarity function you want to use;
+- the number of hash functions you want to generate; and
+- keyword parameters specific to the LSH function family that you're sampling from.
+
+```
+LSHFunction(similarity, n_hashes::Integer=1; kws...)
+```
+
+For instance, in the snippet below we create a single hash function corresponding to cosine similarity:
+
+```jldoctest
+julia> using LSH
+
+julia> hashfn = LSHFunction(cossim);
+
+julia> typeof(hashfn)
+SimHash{Float32}
+
+julia> n_hashes(hashfn)
+1
+
+julia> similarity(hashfn)
+cossim (generic function with 2 methods)
+```
+
+As another example, following code snippet creates 10 hash functions for inner product similarity. All of the generated hash functions are bundled together into a single [`SignALSH`](@ref) struct. We specify the following keyword arguments:
+
+- `dtype`: the data type to use internally in the [`SignALSH`](@ref) struct.
+- `maxnorm`: an upper bound on the norm of the data points we're hashing, and a required parameter for [`SignALSH`](@ref).
+
+```jldoctest
+julia> using LSH
+
+julia> hashfn = LSHFunction(inner_prod, 10; dtype=Float64, maxnorm=5.0);
+
+julia> n_hashes(hashfn)
+10
+
+julia> typeof(hashfn)
+SignALSH{Float64}
+
+julia> hashfn.maxnorm
+5.0
+```
+
+!!! info "Creating multiple hash functions"
+    In practice, you usually want to use multiple hash functions at the same time, and combine their hashes together in order to form a key with which to index into the hash table. To create `N` hash functions simultaneously, run 
+    
+    ```julia
+    hashfn = LSHFunction(similarity, N; kws...)
+    ```
+
+    `hashfn` will automatically generate and compute `N` different hash functions. It will then return a `Vector` of those hashes (unless `hashtype(hashfn)` is `Bool`, in which case it will return a `BitArray`).
+
+    - [See the FAQ](@ref Why-do-we-compute-multiple-hashes-for-every-input?) for the reasoning behind using multiple locality-sensitive hash functions simultaneously.
+
+If you want to know what hash function will be created for a given similarity, you can use [`lsh_family`](@ref):
+
+```jldoctest; setup = :(using LSH)
+julia> lsh_family(jaccard)
+MinHash
+
+julia> lsh_family(ℓ1)
+L1Hash
+```
+
+## Utilities
+LSH.jl provides a few common utility functions that you can use across [`LSHFunction`](@ref) subtypes:
+
+- [`n_hashes`](@ref): returns the number of hash functions computed by an [`LSHFunction`](@ref).
+
+```jldoctest; setup = :(using LSH)
+julia> hashfn = LSHFunction(jaccard);
+
+julia> n_hashes(hashfn)
+1
+
+julia> hashfn = LSHFunction(jaccard, 10);
+
+julia> n_hashes(hashfn)
+10
+```
+
+- [`similarity`](@ref): returns the similarity function for which the input [`LSHFunction`](@ref) is locality-sensitive:
+
+```jldoctest; setup = :(using LSH)
+julia> hashfn = LSHFunction(cossim);
+
+julia> similarity(hashfn)
+cossim (generic function with 2 methods)
+```
+
+- [`hashtype`](@ref): returns the type of hash computed by the input hash function. Note that in practice `hashfn(x)` (or [`index_hash(hashfn,x)`](@ref) and [`query_hash(hashfn,x)`](@ref) for an [`AsymmetricLSHFunction`](@ref)) will return an array of hashes, one for each hash function you generated. [`hashtype`](@ref) is the data type of each element of `hashfn(x)`.
+
+```jldoctest; setup = :(using LSH)
+julia> hashfn = LSHFunction(cossim, 5);
+
+julia> hashtype(hashfn)
+Bool
+
+julia> hashes = hashfn(rand(100));
+
+julia> typeof(hashes)
+BitArray{1}
+
+julia> typeof(hashes[1]) == hashtype(hashfn)
+true
+```
+
+
diff --git a/docs/src/similarities/cosine.md b/docs/src/similarities/cosine.md
@@ -3,3 +3,101 @@
 !!! warning "Under construction"
     This section is currently being developed. If you're interested in helping write this section, feel free to [open a pull request](https://github.com/kernelmethod/LSH.jl/pulls); otherwise, please check back later.
 
+## Definition
+*Cosine similarity*, roughly speaking, measures the angle between a pair of inputs. Two inputs are very similar if the angle between them is low, and their similarity drops as the angle between them increases.
+
+Concretely, cosine similarity is computed as
+
+``\text{cossim}(x,y) = \frac{\left\langle x,y\right\rangle}{\|x\|\cdot\|y\|} = \left\langle\frac{x}{\|x\|},\frac{y}{\|y\|}\right\rangle``
+
+where ``\left\langle\cdot,\cdot\right\rangle`` is an inner product (e.g., dot product) and ``\|\cdot\|`` is the norm derived from that inner product. ``\text{cossim}(x,y)`` goes from ``-1`` to ``1``, where ``-1`` corresponds to low similarity and ``1`` corresponds to high similarity. To calculate cosine similarity, you can use the [`cossim`](@ref) function exported from the `LSH` module:
+
+```jldoctest
+julia> using LSH, LinearAlgebra
+
+julia> x = [5, 3, -1, 1];  # norm(x) == 6
+
+julia> y = [2, -2, -2, 2]; # norm(y) == 4
+
+julia> cossim(x,y) == dot(x,y) / (norm(x)*norm(y))
+true
+
+julia> cossim(x,y) == (5*2 + 3*(-2) + (-1)*(-2) + 1*2) / (6*4)
+true
+```
+
+## SimHash
+*SimHash*[^1][^2] is a family of LSH functions for hashing with respect to cosine similarity. You can generate a new hash function from this family by calling [`SimHash`](@ref):
+
+```jldoctest; setup = :(using LSH)
+julia> hashfn = SimHash();
+
+julia> n_hashes(hashfn)
+1
+
+julia> hashfn = SimHash(40);
+
+julia> n_hashes(hashfn)
+40
+```
+
+Once constructed, you can start hashing vectors by calling `hashfn(x)`:
+
+```jldoctest; setup = :(using LSH, Random; Random.seed!(0)), output = false
+hashfn = SimHash(100)
+
+# x and y have high cosine similarity since they point in the same direction
+# x and z have low cosine similarity since they point in opposite directions
+x = randn(128)
+y = 2x
+z = -x
+
+hx, hy, hz = hashfn(x), hashfn(y), hashfn(z)
+
+# Among the 100 hash functions that we generated, we expect more hash
+# collisions between x and y than between x and z
+sum(hx .== hy) > sum(hx .== hz)
+
+# output
+true
+
+```
+
+Note that [`SimHash`](@ref) is a one-bit hash function, meaning that each hash you compute is just one bit. As a result, `hashfn(x)` returns a `BitArray`:
+
+```jldoctest; setup = :(using LSH)
+julia> hashfn = SimHash();
+
+julia> n_hashes(hashfn)
+1
+
+julia> hashes = hashfn(randn(4));
+
+julia> typeof(hashes)
+BitArray{1}
+
+julia> length(hashes)
+1
+```
+
+Since a single-bit hash doesn't do much to reduce the cost of similarity search, you usually want to generate multiple hash functions at once. For instance, in the snippet below we sample 10 hash functions, so that `hashfn(x)` is a length-10 `BitArray`:
+
+```jldoctest; setup = :(using LSH)
+julia> hashfn = SimHash(10);
+
+julia> n_hashes(hashfn)
+10
+
+julia> hashes = hashfn(randn(4));
+
+julia> length(hashes)
+10
+```
+
+---
+
+### Footnotes
+
+[^1]: Moses S. Charikar. *Similarity estimation techniques from rounding algorithms*. In Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC '02, page 380–388, New York, NY, USA, 2002. Association for Computing Machinery. 10.1145/509907.509965.
+
+[^2]: [`SimHash` API reference](@ref SimHash)
diff --git a/src/hashes/mips_hash.jl b/src/hashes/mips_hash.jl
@@ -42,7 +42,11 @@ Create a `MIPSHash` hash function for hashing on inner product similarity.
 
 # Keyword parameters
 - $(DTYPE_DOCSTR(MIPSHash))
-- `maxnorm::Union{Nothing,Real}` (default: `nothing`): an upper bound on the ``\\ell^2``-norm of the data points. **Note: this keyword argument must be explicitly specified.** If it left unspecified (or set to `nothing`), `MIPSHash()` will raise an error.
+- `maxnorm::Union{Nothing,Real}` (default: `nothing`): an upper bound on the ``\\ell^2``-norm of the data points.
+
+!!! warning "Warning: maxnorm must be explicitly set"
+    The `maxnorm` keyword parameter must be explicitly specified. If it is left unspecified (or set to `nothing`), `MIPSHash()` will raise an error.
+
 - `scale::Real` (default: `1`): parameter that affects the probability of a hash collision. Large values of `scale` increases hash collision probability (even for inputs with low inner product similarity); small values of `scale` will decrease hash collision probability.
 
 # Examples
diff --git a/src/hashes/sign_alsh.jl b/src/hashes/sign_alsh.jl
@@ -41,12 +41,16 @@ Create a `SignALSH` hash function for hashing on inner product similarity.
 
 # Keyword parameters
 - $(DTYPE_DOCSTR(SignALSH))
-- `maxnorm::Union{Nothing,Real}` (default: `nothing`): an upper bound on the ``\\ell^2``-norm of the data points. **Note: this keyword argument must be explicitly specified.** If it left unspecified (or set to `nothing`), `SignALSH()` will raise an error.
+- `maxnorm::Union{Nothing,Real}` (default: `nothing`): an upper bound on the ``\\ell^2``-norm of the data points.
+
+!!! warning "Warning: maxnorm must be set"
+    The `maxnorm` keyword parameter must be explicitly specified. If it is left unspecified (or set to `nothing`), `SignALSH()` will raise an error.
+
 - `m::Integer` (default: `3`): parameter `m` that affects the probability of a hash collision.
 - $(RESIZE_POW2_DOCSTR(SignALSH))
 
 # Examples
-`SignALSH` is an [`AsymmetricLSHFunction`](@ref), and hence hashes must be computed using `index_hash` and `query_hash`.
+`SignALSH` is an [`AsymmetricLSHFunction`](@ref), and hence hashes must be computed using [`index_hash`](@ref) and [`query_hash`](@ref).
 
 ```jldoctest; setup = :(using LSH)
 julia> hashfn = SignALSH(12; maxnorm=10);

Original file line number	Diff line number	Diff line change
`@@ -20,6 +20,7 @@ makedocs(`
`20`	`20`	`"Inner product similarity" => joinpath("similarities", "inner_prod.md")],`
`21`	`21`	`"Performance tips" => "performance.md",`
`22`	`22`	`"API reference" => "full_api.md",`
	`23`	`+ "FAQ" => "faq.md"`
`23`	`24`	`]`
`24`	`25`	`)`
`25`	`26`