You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/src/similarities/cosine.md
+24Lines changed: 24 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -94,6 +94,30 @@ julia> length(hashes)
94
94
10
95
95
```
96
96
97
+
The probability of a hash collision (for a single hash) is
98
+
99
+
```
100
+
Pr[h(x) = h(y)] = 1 - \frac{\theta}{\pi}
101
+
```
102
+
103
+
where ``\theta = \text{arccos}(\text{cossim}(x,y))`` is the angle between ``x`` and ``y``. This collision probability is shown in the plot below.
104
+
105
+
```@eval
106
+
using PyPlot, LSH
107
+
hashfn = SimHash()
108
+
x = range(-1, 1; length=1024)
109
+
y = [LSH.single_hash_collision_probability(hashfn, xii) for xii in x]
110
+
111
+
plot(x, y)
112
+
title("Probability of hash collision for SimHash")
113
+
xlabel(raw"$cossim(x,y)$")
114
+
ylabel(raw"$Pr[h(x) = h(y)]$")
115
+
116
+
savefig("simhash_collision_probability.svg")
117
+
```
118
+
119
+

120
+
97
121
### Footnotes
98
122
99
123
[^1]: Moses S. Charikar. *Similarity estimation techniques from rounding algorithms*. In Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC '02, page 380–388, New York, NY, USA, 2002. Association for Computing Machinery. 10.1145/509907.509965.
0 commit comments