Skip to content

Commit 2d0a87a

Browse files
committed
Update the docs.
1 parent 39bae7e commit 2d0a87a

File tree

4 files changed

+169
-4
lines changed

4 files changed

+169
-4
lines changed

docs/README.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@ The module documentation is automatically built and updated whenever `master` is
55

66
```
77
$ cd docs/
8-
$ julia make.jl
98
$ julia --project=. --color=yes make.jl
109
$ python3 -m http.server 8000
1110
```

docs/src/faq.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ julia> 1.563e-6 / 46.415e-9
8282
So as long as [`SimHash`](@ref) reduces the size of the search space by 34 data points on average, it's faster than calculating the similarity between every pair of points. Even for our tiny dataset, which only had 100 points, that's still well worth it: with the 72/16/12 split that we got, [`SimHash`](@ref) reduces the number of similarities we have to calculate by ``100 - \left(\frac{72^2}{100} + \frac{16^2}{100} + \frac{12^2}{100}\right) \approx 44`` points on average.
8383

8484
!!! info "Improving LSH partitioning"
85-
LSH can be poor at partitioning your input space when data points are very similar to one another. In these cases, it may be helpful to find ways to transform your data in order to reduce their similarity.
85+
LSH can be poor at partitioning your input space when all of your data points are very similar to one another. In these cases, it may be helpful to find ways to transform your data in order to reduce their similarity.
8686

8787
For instance, in the example above, we created a synthetic dataset with the following code:
8888

docs/src/similarities/cosine.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,9 @@
88

99
Concretely, cosine similarity is computed as
1010

11-
``\text{cossim}(x,y) = \frac{\left\langle x,y\right\rangle}{\|x\|\cdot\|y\|} = \left\langle\frac{x}{\|x\|},\frac{y}{\|y\|}\right\rangle``
11+
```math
12+
\text{cossim}(x,y) = \frac{\left\langle x,y\right\rangle}{\|x\|\cdot\|y\|} = \left\langle\frac{x}{\|x\|},\frac{y}{\|y\|}\right\rangle
13+
```
1214

1315
where ``\left\langle\cdot,\cdot\right\rangle`` is an inner product (e.g., dot product) and ``\|\cdot\|`` is the norm derived from that inner product. ``\text{cossim}(x,y)`` goes from ``-1`` to ``1``, where ``-1`` corresponds to low similarity and ``1`` corresponds to high similarity. To calculate cosine similarity, you can use the [`cossim`](@ref) function exported from the `LSH` module:
1416

@@ -96,12 +98,15 @@ julia> length(hashes)
9698

9799
The probability of a hash collision (for a single hash) is
98100

99-
``Pr[h(x) = h(y)] = 1 - \frac{\theta}{\pi}``
101+
```math
102+
Pr[h(x) = h(y)] = 1 - \frac{\theta}{\pi}
103+
```
100104

101105
where ``\theta = \text{arccos}(\text{cossim}(x,y))`` is the angle between ``x`` and ``y``. This collision probability is shown in the plot below.
102106

103107
```@eval
104108
using PyPlot, LSHFunctions
109+
fig = figure()
105110
hashfn = SimHash()
106111
x = range(-1, 1; length=1024)
107112
y = [LSHFunctions.single_hash_collision_probability(hashfn, xii) for xii in x]

docs/src/similarities/lp_distance.md

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,164 @@
22

33
!!! warning "Under construction"
44
This section is currently being developed. If you're interested in helping write this section, feel free to [open a pull request](https://github.com/kernelmethod/LSHFunctions.jl/pulls); otherwise, please check back later.
5+
6+
## Definition
7+
``\ell^p`` distance is a generalization of our usual notion of distance between a pair of points. If you're not familiar with it, you can think of it as a generalization of the Pythagorean theorem: if we have two points ``(a_1,b_1)`` and ``(a_2,b_2)``, then the distance between them is
8+
9+
```math
10+
\text{distance} = \sqrt{(a_1 - b_1)^2 + (a_2 - b_2)^2}
11+
```
12+
13+
This is known as the *``\ell^2`` distance* (or Euclidean distance) between ``(a_1,b_1)`` and ``(a_2,b_2)``. In higher dimensions, the ``\ell^2`` distance between the points ``x = (x_1,\ldots,x_n)`` and ``y = (y_1,\ldots,y_n)`` is denoted as ``\|x - y\|_{\ell^2}`` (since ``\ell^2`` distance, and, for that matter, all ``\ell^p`` distances of order ``\ge 1``, are [norms](https://en.wikipedia.org/wiki/Norm_(mathematics))) and defined as[^1]
14+
15+
```math
16+
\|x - y\|_{\ell^2} = \sqrt{\sum_{i=1}^n \left|x_i - y_i\right|^2}
17+
```
18+
19+
More generally, the ``\ell^p`` distance between the two length-``n`` vectors ``x`` and ``y`` is given by
20+
21+
```math
22+
\|x - y\|_{\ell^p} = \left(\sum_{i=1}^n \left|x_i - y_i\right|^p\right)^{1/p}
23+
```
24+
25+
In the LSHFunctions module, you can calculate the ``\ell^p`` distance between two points using the function [`ℓp`](@ref). The functions [`ℓ1`](@ref ℓp) and [`ℓ2`](@ref ℓp) are also defined for ``\ell^1`` and ``\ell^2`` distance, respectively, since they're so commonly used:
26+
27+
```jldoctest
28+
julia> using LSHFunctions;
29+
30+
julia> x = [1, 2, 3]; y = [4, 5, 6];
31+
32+
julia> ℓ1(x,y) == ℓp(x,y,1) == abs(1-4) + abs(2-5) + abs(3-6)
33+
true
34+
35+
julia> ℓ2(x,y) == ℓp(x,y,2) == √(abs(1-4)^2 + abs(2-5)^2 + abs(3-6)^2)
36+
true
37+
```
38+
39+
You can also compute the ``\ell^p``-norm of a vector (``\|x\|_{\ell^p}``, or equivalently ``\|x - 0\|_{\ell^p}``) by calling [`ℓ1_norm`](@ref ℓp_norm), [`ℓ2_norm`](@ref ℓp_norm), or [`ℓp_norm`](@ref):
40+
41+
```jldoctest; setup = :(using LSHFunctions)
42+
julia> x = [1, 2, 3];
43+
44+
julia> ℓ1_norm(x) == ℓ1(x,zero(x))
45+
true
46+
47+
julia> ℓ2_norm(x) == ℓ2(x,zero(x))
48+
true
49+
50+
julia> ℓp_norm(x,2.2) == ℓp(x,zero(x),2.2)
51+
true
52+
```
53+
54+
## `LpHash`
55+
This module defines [`L1Hash`](@ref L1Hash) and [`L2Hash`](@ref L2Hash) to hash vectors on their ``\ell^1`` and ``\ell^2`` distances. It is based on Datar et al. (2004)[^Datar04], who use the notion of a [*``p``-stable distribution*](https://en.wikipedia.org/wiki/Stable_distribution) to construct their hash function. Such distributions exist for all ``p`` such that ``0 < p \le 2``; the LSH family of Datar et al. (2004)[^Datar04] is able to hash vectors on their ``\ell^p`` distance for all ``p`` in this range.
56+
57+
!!! info "Limitations on p"
58+
The LSHFunctions package currently only supports hashing ``\ell^p`` distances of order ``p = 1`` and ``p = 2`` due to some additional complexity involved with sampling ``p``-stable distributions of different orders. This problem has been filed under [issue #18](https://github.com/kernelmethod/LSHFunctions.jl/issues/18).
59+
60+
### Using [`L1Hash`](@ref) and [`L2Hash`](@ref)
61+
Currently only ``p = 1`` and ``p = 2`` are supported. You can construct hash functions for ``\ell^1`` distance and ``\ell^2`` distance using [`L1Hash`](@ref) and [`L2Hash`](@ref):
62+
63+
```jldoctest; setup = :(using LSHFunctions)
64+
julia> hashfn = L1Hash();
65+
66+
julia> n_hashes(hashfn)
67+
1
68+
69+
julia> hashfn = L2Hash(10);
70+
71+
julia> n_hashes(hashfn)
72+
10
73+
```
74+
75+
To hash a vector, simply call `hashfn(x)`. Note that the hashes returned by an `LpHash` type such as [`L1Hash`](@ref) or [`L2Hash`](@ref) are signed integers:
76+
77+
```jldoctest; setup = :(using LSHFunctions)
78+
julia> hashfn = L2Hash(128);
79+
80+
julia> hashtype(hashfn)
81+
Int32
82+
83+
julia> x = rand(20);
84+
85+
julia> hashes = hashfn(x);
86+
87+
julia> typeof(hashes)
88+
Array{Int32,1}
89+
```
90+
91+
`L1Hash` and `L2Hash` support a keyword parameter called `scale`. `scale` impacts the collision probability: if `scale` is large then hash collisions are more likely (even among distant points). If `scale` is small, then hash collisions are less likely (even among close points).
92+
93+
```jldoctest; setup = :(using LSHFunctions, Random; Random.seed!(0))
94+
julia> x = rand(10); y = rand(10);
95+
96+
julia> hashfn_1 = L1Hash(128; scale=0.1); # Small value of scale
97+
98+
julia> n_collisions_1 = sum(hashfn_1(x) .== hashfn_1(y));
99+
100+
julia> hashfn_2 = L1Hash(128; scale=10.); # Large value of scale
101+
102+
julia> n_collisions_2 = sum(hashfn_2(x) .== hashfn_2(y));
103+
104+
julia> n_collisions_2 > n_collisions_1
105+
true
106+
```
107+
108+
Good values of `scale` will depend on your dataset. If your data points are very far apart then you will likely want to choose a large value of `scale`; if they're tightly packed together then a small value is generally better. You can use the [`collision_probability`](@ref) function to help you choose a good value of `scale`.
109+
110+
### Collision probability
111+
The probability that two vectors ``x`` and ``y`` collide under a hash function sampled from the `LpHash` family is
112+
113+
```math
114+
Pr[h(x) = h(y)] = \int_0^r \frac{1}{c}f_p\left(\frac{t}{c}\right)\left(1 - \frac{t}{r}\right) \hspace{0.15cm} dt
115+
```
116+
117+
where
118+
119+
- ``r`` is the **reciprocal** of the `scale` factor used by `LpHash`, i.e. `r = 1/scale`;
120+
- ``c = \|x - y\|_{\ell^p}``; and
121+
- ``f_p`` is the p.d.f. of the **absolute value** of the ``p``-stable distribution used to construct the hash.
122+
123+
The most important ideas to take away from this equation are that the collision probability ``Pr[h(x) = h(y)]`` increases as `scale` increases (or, equivalently, as ``r`` increases), and that it decreases as ``\|x - y\|_{\ell^p}`` increases. The figure below visualizes the relationship between ``\ell^p`` distance and collision probability for ``p = 1`` (left) and ``p = 2`` (right).
124+
125+
```@eval
126+
using PyPlot, LSHFunctions
127+
fig, axes = subplots(1, 2, figsize=(12,6))
128+
rc("font", size=12)
129+
x = range(0, 3; length=256)
130+
131+
for scale in (0.25, 1.0, 4.0)
132+
l1_hashfn = L1Hash(; scale=scale)
133+
l2_hashfn = L2Hash(; scale=scale)
134+
135+
y1 = [collision_probability(l1_hashfn, xii) for xii in x]
136+
y2 = [collision_probability(l2_hashfn, xii) for xii in x]
137+
138+
axes[1].plot(x, y1, label="\$r = $scale\$")
139+
axes[2].plot(x, y2, label="\$r = $scale\$")
140+
end
141+
142+
axes[1].set_xlabel(raw"$\|x - y\|_{\ell^1}$", fontsize=20)
143+
axes[1].set_ylabel(raw"$Pr[h(x) = h(y)]$", fontsize=20)
144+
axes[2].set_xlabel(raw"$\|x - y\|_{\ell^2}$", fontsize=20)
145+
146+
axes[1].set_title("Collision probability for L1Hash")
147+
axes[2].set_title("Collision probability for L2Hash")
148+
149+
for ax in axes
150+
ax.legend(fontsize=14)
151+
end
152+
153+
savefig("lphash_collision_probability.svg")
154+
```
155+
156+
![Probability of collision for L1Hash and L2Hash](lphash_collision_probability.svg)
157+
158+
For further information about the collision probability, see Section 3.2 of the reference paper[^Datar04].
159+
160+
### Footnotes
161+
162+
[^1]: In general, ``x`` and ``y`` are allowed to be complex vectors. We sum over ``\left|x_i - y_i\right|`` (the magnitude of ``x_i - y_i``) instead of ``(x_i - y_i)^2`` to guarantee that ``\|x - y\|_{\ell^2}`` is a real number even when ``x`` and ``y`` are complex.
163+
164+
[^Datar04]: Datar, Mayur & Indyk, Piotr & Immorlica, Nicole & Mirrokni, Vahab. (2004). *Locality-sensitive hashing scheme based on p-stable distributions*. Proceedings of the Annual Symposium on Computational Geometry. 10.1145/997817.997857.
165+

0 commit comments

Comments
 (0)