# Introduction

Document matching is computationally intensive. Recent methods such as MinHash and locality-sensitive hashing were developed to solve this problem efficiently. Resources are available on the internet, including recorded videos by Dr. Ullman, who is also the co-author of the book Mining of Massive Datasets, which covers these topics well.

# Method

## MinHash

Hashing is a technique to encode string data to an integer type by assigning arbitrary values that are then mapped to a uniformed representation. This requires a hash table and a hash function. The hash table is a collection of items stored and named by an integer value. The hash function maps values or items to the hash table. The result is a simplified representation of a string as an integer. Through hashing, it is now unnecessary to use the entire dictionary to calculate string metrics.

MinHash is a fast approximation of the Jaccard Index. It is taking the union of two sets, shuffling the order, and selecting the minimum hashed value. MinHash starts with reducing sets by randomly selecting shingles. These shingles are then hashed using a hash function and the minimum value is taken to create MinHash signatures. This process is repeated using a different hash function.

We have the following two sets after MinHash:

• set 1: 4 25 78 65 3 26.
• set 2: 78 3 4 75 25 65.

And we took the union of these two sets:

• union of 1 and 2: 4 25 78 65 3 26 75

The probability that the common elements (e.g., 4, 25, 65, 3, 78) appear first in the list is 57 or 0.7142857. The way we estimated the similarity using MinHash is similar to the Jaccard Index.

Let’s now try this with the following three strings:

• sentence 1: “The best pyschic pokemon is Lugia”
• sentence 2: “The greatest pyschic pokemon is Lugia”
• sentence 3: “The best physical pokemon is Geodude”
sentence1 <- "The best pyschic pokemon is Lugia"
sentence2 <- "The greatest pyschic pokemon is Lugia"
sentence3 <- "The best physical pokemon is Geodude"

corpus <- c(sentence1, sentence2, sentence3)
corpus <- sapply(corpus, function(x) strsplit(x, " "))
corpus_all <- unique(unlist(corpus))
corpus_table <- sapply(corpus, function(x) table(factor(x, levels=corpus_all)))

minhash_index <- 1:length(corpus_all)

corpus_table * minhash_index
The best pyschic pokemon is Lugia The greatest pyschic pokemon is Lugia The best physical pokemon is Geodude
The 1 1 1
best 2 0 2
pyschic 3 3 0
pokemon 4 4 4
is 5 5 5
Lugia 6 6 0
greatest 0 7 0
physical 0 0 8
Geodude 0 0 9

Here we are concerned with the first occurrence within each row. This is the value that is taken to create the MinHash signature. We can run this again $$N$$ times on a random permutation to create the final MinHash signature for each string.

for(i in 1:6) {
print(paste('Iteration:', i))
corpus_all <- sample(corpus_all)
corpus_table <- sapply(corpus, function(x) table(factor(x, levels=corpus_all)))
print(knitr::kable(corpus_table * minhash_index))
}

#### Iteration: 1

The best pyschic pokemon is Lugia The greatest pyschic pokemon is Lugia The best physical pokemon is Geodude
pyschic 1 1 0
The 2 2 2
Lugia 3 3 0
greatest 0 4 0
is 5 5 5
best 6 0 6
pokemon 7 7 7
physical 0 0 8
Geodude 0 0 9

#### Iteration: 2

The best pyschic pokemon is Lugia The greatest pyschic pokemon is Lugia The best physical pokemon is Geodude
physical 0 0 1
Lugia 2 2 0
Geodude 0 0 3
The 4 4 4
greatest 0 5 0
is 6 6 6
pokemon 7 7 7
pyschic 8 8 0
best 9 0 9

#### Iteration: 3

The best pyschic pokemon is Lugia The greatest pyschic pokemon is Lugia The best physical pokemon is Geodude
Geodude 0 0 1
pokemon 2 2 2
greatest 0 3 0
best 4 0 4
Lugia 5 5 0
physical 0 0 6
The 7 7 7
pyschic 8 8 0
is 9 9 9

#### Iteration: 4

The best pyschic pokemon is Lugia The greatest pyschic pokemon is Lugia The best physical pokemon is Geodude
greatest 0 1 0
is 2 2 2
The 3 3 3
Geodude 0 0 4
pyschic 5 5 0
physical 0 0 6
best 7 0 7
Lugia 8 8 0
pokemon 9 9 9

#### Iteration: 5

The best pyschic pokemon is Lugia The greatest pyschic pokemon is Lugia The best physical pokemon is Geodude
pyschic 1 1 0
physical 0 0 2
is 3 3 3
best 4 0 4
greatest 0 5 0
Geodude 0 0 6
Lugia 7 7 0
pokemon 8 8 8
The 9 9 9

#### Iteration: 6

The best pyschic pokemon is Lugia The greatest pyschic pokemon is Lugia The best physical pokemon is Geodude
The 1 1 1
best 2 0 2
is 3 3 3
Lugia 4 4 0
Geodude 0 0 5
physical 0 0 6
pyschic 7 7 0
greatest 0 8 0
pokemon 9 9 9

Our final MinHash signatures are as follows:

'The best pyschic pokemon is Lugia':    [1, 2, 2, 2, 1, 1]
'The best greatest pokemon is Lugia':   [1, 2, 2, 1, 1, 1]
'The best physical pokemon is Geodude': [2, 1, 1, 2, 2, 1]

We can now identify which strings are similar by calculating their Jaccard Index. For example, we can see that the Jaccard Index between sentence 1 & 2 (JI : 0.833) is larger than between sentence 2 & 3 (JI : 0.167), and a quick eye-ball test verifies our results.

## Locality-sensitive hashing

MinHash still leaves us with many signatures that need to further be compared. Locality-sensitive hashing achieves this by “bucketing” the MinHash signatures. Here we group –or bucket– our MinHash signature into 2 groups with 3 elements each:

'The best pyschic pokemon is Lugia':    [122, 211]
'The best greatest pokemon is Lugia':   [122, 111]
'The best physical pokemon is Geodude': [211, 221]

What we have here resembles a hash table. In other words, the MinHash signatures are transformed into a different representation of groups and elements, which can be thought of as hashes. But unlike other hashing techniques where different inputs are placed into different buckets, LSH attempts to place different inputs into the same buckets if they are reasonably similar to one another.

## How likely are we to detect a match using LSH?

The probability of detecting a match using LSH is dependent on the number elements and groups and the Jaccard Index threshold. Say we wanted to set the Jaccard Index threshold to 0.75 using the same number of groups and elements above. The probability of detecting a match can then be solved using the following formula:

$$1 - (1 - J^n)^b$$

Where

• $$J$$ is the Jaccard index
• $$n$$ is the number of elements in a group
• $$b$$ is the number of groups

In this case, the probability of LSH detecting a match is 0.6657715. If the documents were less similar, then the probability of LSH matching would be much lower. In many examples, this relationship between the number of MinHash signatures, the Jaccard Index of the documents being matched and the probability of LSH detecting a match resembles a diffusion curve.