Skip to content

Commit 6920d48

Browse files
committed
Merge branch 'master' into docs
2 parents 7fe03f8 + c062507 commit 6920d48

File tree

1 file changed

+55
-7
lines changed

1 file changed

+55
-7
lines changed

README.md

Lines changed: 55 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,30 @@
66
[![codecov](https://codecov.io/gh/kernelmethod/LSH.jl/branch/master/graph/badge.svg)](https://codecov.io/gh/kernelmethod/LSH.jl)
77
- DOI to cite this code: [![DOI](https://zenodo.org/badge/197700982.svg)](https://zenodo.org/badge/latestdoi/197700982)
88

9-
Implementations of different [locality-sensitive hash functions](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) in Julia.
9+
A Julia package for [locality-sensitive hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) to accelerate similarity search.
1010

11-
**Installation**: `julia> Pkg.add("https://github.com/kernelmethod/LSH.jl")`
11+
- [What's LSH?](#whats-lsh)
12+
- [Installation](#installation)
13+
- [Supported similarity functions](#supported-similarity-functions)
14+
- [Examples](#examples)
1215

13-
So far, there are hash functions for the following measures of similarity:
16+
## What's LSH?
17+
Traditionally, if you have a data point `x`, and want to find the most similar point(s) to `x` in your database, you would compute the similarity between `x` and all of the points in your database, and keep whichever points were the most similar. For instance, this type of approach is used by the classic [k-nearest neighbors algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm). However, it has two major problems:
18+
19+
- The time to find the most similar point(s) to `x` is linear in the number of points in your database. This can make similarity search prohibitively expensive for even moderately large datasets.
20+
- In addition, the time complexity to compute the similarity between two datapoints is typically linear in the number of dimensions of those datapoints. If your data are high-dimensional (i.e. in the thousands to millions of dimensions), every similarity computation you perform can be fairly costly.
21+
22+
**Locality-sensitive hashing** (LSH) is a technique for accelerating these kinds of similarity searches. Instead of measuring how similar your query point is to every point in your database, you calculate a few hashes of the query point and only compare it against those points with which it experiences a hash collision. Locality-sensitive hash functions are randomly generated, with the fundamental property that as the similarity between `x` and `y` increases, the probability of a hash collision between `x` and `y` also increases.
23+
24+
## Installation
25+
You can install LSH.jl from the Julia REPL with
26+
27+
```
28+
pkg> add https://github.com/kernelmethod/LSH.jl
29+
```
30+
31+
## Supported similarity functions
32+
So far, there are hash functions for the similarity functions:
1433

1534
- Cosine similarity (`SimHash`)
1635
- Jaccard similarity (`MinHash`)
@@ -19,12 +38,41 @@ So far, there are hash functions for the following measures of similarity:
1938
- Inner product
2039
- `SignALSH` (recommended)
2140
- `MIPSHash`
22-
- Function-space hashes
23-
- `MonteCarloHash` (supports L1, L2, and cosine similarity)
24-
- `ChebHash` (supports L1, L2, and cosine similarity)
41+
- Function-space hashes (supports L1, L2, and cosine similarity)
42+
- `MonteCarloHash`
43+
- `ChebHash`
2544

2645
This package still needs a lot of work, including improvement to the documentation and API. In general, if you want to draw one or more new hash functions, you can use the following syntax:
2746

47+
## Examples
48+
The easiest way to start constructing new hash functions is by calling `LSHFunction` with the following syntax:
49+
50+
```
51+
hashfn = LSHFunction(similarity function,
52+
number of hash functions to generate;
53+
[LSH family-specific keyword arguments])
54+
```
55+
56+
For example, the following snippet generates 10 locality-sensitive hash functions (bundled together into a single `SimHash` struct) for cosine similarity:
57+
58+
```julia
59+
julia> using LSH;
60+
61+
julia> hashfn = LSHFunction(cossim, 10);
62+
63+
julia> n_hashes(hashfn)
64+
10
65+
66+
julia> similarity(hashfn)
67+
cossim
68+
```
69+
70+
You can then start hashing new vectors by calling `hashfn()`:
71+
2872
```julia
29-
hashfn = LSHFunction(similarity; [LSH family-specific keyword arguments])
73+
julia> x = randn(128);
74+
75+
julia> x_hashes = hashfn(x);
3076
```
77+
78+
For more details, [check out the LSH.jl documentation](https://kernelmethod.github.io/LSH.jl/dev/).

0 commit comments

Comments
 (0)