# Introduction

String metrics can help match documents without needing a lot of resources. Imagine finding the time to sit and tabulate all the words within thousands of texts. The trade-off, however, with computational methods is that string metric algorithms can be computationally intensive. Thankfully, recently introduced techniques can help get around this problem such as MinHash and locality-sensitive hashing.

Many resources are available on the internet, including recorded videos by Dr. Ullman, co-author of Mining of Massive Datasets, which can be downloaded here (see chapter 3).

# How are we making comparisons?

## Jaccard index

The Jaccard index is a string metric that accounts for the intersection over a union. The calculation is simply the count of common elements between two sets divided by the number of elements that belong to either set. In this example, the size of the intersection is 5 and the size of the union is 7. The Jaccard Index is 5 / 7 = 0.7142857.

``````# comparison table of our strings
sentence1 <- "The greatest pyschic pokemon is Mewtwo"
sentence2 <- "The best pyschic pokemon is Mewtwo"

corpus <- c(sentence1, sentence2)
corpus <- sapply(corpus, function(x) strsplit(x, " "))
corpus_all <- unique(unlist(corpus))
corpus_table <- sapply(corpus, function(x) table(factor(x, levels=corpus_all)))
``````
The greatest pyschic pokemon is Mewtwo The best pyschic pokemon is Mewtwo
The 1 1
greatest 1 0
pyschic 1 1
pokemon 1 1
is 1 1
Mewtwo 1 1
best 0 1
``````# calculating the Jaccard Index
library(sets)

common_words <- intersect(corpus[[1]], corpus[[2]])
M1 <- as.set(corpus[[1]])
I <- length(intersect(corpus[[1]], corpus[[2]]))
S <- I/(length(corpus[[1]])+length(corpus[[2]])-I)
> S
[1] 0.7142857
``````

## Shingling

Shingling is a process of taking subsets from a string. It is an encoded representation of word content and order. We can think of shingles as being similar to n-grams. For example, if the shingle size k was set to 2 and the string was “The best pyschic pokemon is Lugia”, then we would see the follow list of shingles.

{“the best”, “best psychic”, “psychic pokemon”, “pokemon is”, “is Lugia”}.

As you can now imagine, string data and shingles can potentially be very unwieldy to compute string metrics.

# Why do we hash?

## Hashing

We can encode string data as an integer type through a technique called hashing where arbitrary values are mapped to a uniformed representation. This requires a hash table, or a collection of items stored and named by an integer value, and a hash function, which maps values or items to the hash table. The result is a simplified representation of a string as an integer. Through hashing, it is now unnecessary to use the entire dictionary to calculate string metrics.

## MinHash

MinHash starts with reducing sets by randomly selecting shingles. These shingles are then hashed using a hash function and the minimum value is taken to create MinHash signatures. This process is repeated using a different hash function.

So what is happening during this process? MinHash is a fast approximation of Jaccard Index. It is taking the union of two sets, shuffling the order, and selecting the minimum hashed value.

Let’s pretend that we have the following two sets after MinHash:

• set 1: 4 25 78 65 3 26.
• set 2: 78 3 4 75 25 65.

And we took the union of these two sets:

• union of 1 and 2: 4 25 78 65 3 26 75

What is the probability that the common elements (e.g., 4, 25, 65, 3, 78) appear first in the list? The odds is 57 or 0.7142857. The way we estimated the similarity using MinHash is similar to the Jaccard Index.

Let’s now try this with the following three strings:

• sentence 1: “The best pyschic pokemon is Lugia”
• sentence 2: “The greatest pyschic pokemon is Lugia”
• sentence 3: “The best physical pokemon is Geodude”
``````sentence1 <- "The best pyschic pokemon is Lugia"
sentence2 <- "The greatest pyschic pokemon is Lugia"
sentence3 <- "The best physical pokemon is Geodude"

corpus <- c(sentence1, sentence2, sentence3)
corpus <- sapply(corpus, function(x) strsplit(x, " "))
corpus_all <- unique(unlist(corpus))
corpus_table <- sapply(corpus, function(x) table(factor(x, levels=corpus_all)))

minhash_index <- 1:length(corpus_all)

corpus_table * minhash_index
``````
The best pyschic pokemon is Lugia The greatest pyschic pokemon is Lugia The best physical pokemon is Geodude
The 1 1 1
best 2 0 2
pyschic 3 3 0
pokemon 4 4 4
is 5 5 5
Lugia 6 6 0
greatest 0 7 0
physical 0 0 8
Geodude 0 0 9

As you review the table, note that we are only concerned with the first occurrence within each row. This is the value that is taken to create the MinHash signature. We can run this again N times on a random permutation to create the final MinHash signature for each string.

``````for(i in 1:6) {
print(paste('Iteration:', i))
corpus_all <- sample(corpus_all)
corpus_table <- sapply(corpus, function(x) table(factor(x, levels=corpus_all)))
print(knitr::kable(corpus_table * minhash_index))
}
``````

[1] “Iteration: 1”

The best pyschic pokemon is Lugia The greatest pyschic pokemon is Lugia The best physical pokemon is Geodude
pyschic 1 1 0
The 2 2 2
Lugia 3 3 0
greatest 0 4 0
is 5 5 5
best 6 0 6
pokemon 7 7 7
physical 0 0 8
Geodude 0 0 9

[1] “Iteration: 2”

The best pyschic pokemon is Lugia The greatest pyschic pokemon is Lugia The best physical pokemon is Geodude
physical 0 0 1
Lugia 2 2 0
Geodude 0 0 3
The 4 4 4
greatest 0 5 0
is 6 6 6
pokemon 7 7 7
pyschic 8 8 0
best 9 0 9

[1] “Iteration: 3”

The best pyschic pokemon is Lugia The greatest pyschic pokemon is Lugia The best physical pokemon is Geodude
Geodude 0 0 1
pokemon 2 2 2
greatest 0 3 0
best 4 0 4
Lugia 5 5 0
physical 0 0 6
The 7 7 7
pyschic 8 8 0
is 9 9 9

[1] “Iteration: 4”

The best pyschic pokemon is Lugia The greatest pyschic pokemon is Lugia The best physical pokemon is Geodude
greatest 0 1 0
is 2 2 2
The 3 3 3
Geodude 0 0 4
pyschic 5 5 0
physical 0 0 6
best 7 0 7
Lugia 8 8 0
pokemon 9 9 9

[1] “Iteration: 5”

The best pyschic pokemon is Lugia The greatest pyschic pokemon is Lugia The best physical pokemon is Geodude
pyschic 1 1 0
physical 0 0 2
is 3 3 3
best 4 0 4
greatest 0 5 0
Geodude 0 0 6
Lugia 7 7 0
pokemon 8 8 8
The 9 9 9

[1] “Iteration: 6”

The best pyschic pokemon is Lugia The greatest pyschic pokemon is Lugia The best physical pokemon is Geodude
The 1 1 1
best 2 0 2
is 3 3 3
Lugia 4 4 0
Geodude 0 0 5
physical 0 0 6
pyschic 7 7 0
greatest 0 8 0
pokemon 9 9 9

Our final MinHash signatures are as follows:

``````'The best pyschic pokemon is Lugia':    [1, 2, 2, 2, 1, 1]
'The best greatest pokemon is Lugia':   [1, 2, 2, 1, 1, 1]
'The best physical pokemon is Geodude': [2, 1, 1, 2, 2, 1]
``````

We can now identify which strings are similar by calculating their Jaccard Index. For example, we can see that the Jaccard Index between sentence 1 & 2 (JI : 0.833) is larger than between sentence 2 & 3 (JI : 0.167), and a quick eye-ball test verifies our findings.

## Locality-sensitive hashing

MinHash still leaves us with many signatures that need to further be compared. Locality-sensitive hashing achieves this by “bucketing” the MinHash signatures. Here we group –or bucket– our MinHash signature into 2 groups with 3 elements each:

``````'The best pyschic pokemon is Lugia':    [122, 211]
'The best greatest pokemon is Lugia':   [122, 111]
'The best physical pokemon is Geodude': [211, 221]
``````

What we have here resembles a hash table. In other words, the MinHash signatures are transformed into a different representation of groups and elements, which can be thought of as hashes. But unlike other hashing techniques where different inputs are placed into different buckets, LSH attempts to place different inputs into the same buckets if they are reasonably similar to one another.

## How likely are we to detect a match using LSH?

The probability of detecting a match using LSH is dependent on the number elements and groups and the Jaccard Index threshold. Say we wanted to set the Jaccard Index threshold to 0.75 using the same number of groups and elements above. The probability of detecting a match can then be solved using the following formula:

\$ 1 - (1 - J^n)^b \$

Where

• J is the Jaccard index
• n is the number of elements in a group
• b is the number of groups

In this case, the probability of LSH detecting a match is 0.6657715. If the documents were less similar, then the probability of LSH matching would be much lower. In many examples, this relationship between the number of MinHash signatures, the Jaccard Index of the documents being matched and the probability of LSH detecting a match resembles a diffusion curve.

# Conclusion

It is still recommended that a set of eyes passes through the results to ensure that LSH did decently well or if the bands and groups needs adjusting. Is MinHashing and LSH a silver bullet? Probably not. But I’m willing to bet it is definitely a lot better than what many analysts have and don’t have (e.g., time).

# Toy example: cleaning campaign contribution

MinHash and LSH are useful techniques when wanting to churn through a large amount of string data. In this example, I will use these methods to normalize employer names from a campaign contribution file.

``````library(data.table)
library(stringr)
library(textreuse)
library(textstem)
library(ggplot2)
``````

## Data

The campaign contribution file was sourced from Hawaii Open Data’s portal.

``````data_file <-'Campaign_Contributions_Received_By_Hawaii_State_and_County_Candidates_From_November_8__2006_Through_December_31__2017.csv'

df_names <- make.names(names(sample_df))
names(sample_df) <- tolower(df_names)

sample_df <- sample_df[election.period == '2016-2018' & office == 'Governor', ]
``````

## Preprocess

Here I employ some common string cleaning methods.

``````preprocessNGram <- function(x) {
# stop_words = c('inc', 'llc', 'llp', 'ltd', 'inc', '&', 'and')
x = str_trim(x)
x = tolower(x)
x = gsub('[[:punct:] ]+',' ', x)
x = str_replace_all(x, '\\.{2,}', ' ')
x = str_replace_all(x, '\\.', ' ')
x = str_replace(gsub("\\s+", " ", x), "B", "b")
# x = str_replace_all(x, paste0(stop_words, collapse = "|"), '')
x = stem_strings(x)
x = str_replace_all(x, ' ', '_')
x = str_replace_all(x, '\\_{2,}', '_')
return(x)
}
sample_df[, c('employer2') := lapply(.SD, preprocessNGram), .SDcols = c('employer')]
sample_df = sample_df[employer2!='n_a' & employer2!='',  ]
``````

## LSH

### shingles

I am creating character-level shingles. The reason for this is that employer names are short and often either misspelled or presented in various forms.

``````shingles <- lapply(1:nrow(sample_df), function(i) {
shingle_n = 2
x = sample_df\$employer2[i]
x = substring(x, seq(1, nchar(x), shingle_n), seq(shingle_n ,nchar(x), shingle_n))
x = x[!x %in% c('')]
x = data.table(shingle = paste(x, collapse = ' '))
x
})
shingles <- rbindlist(shingles)
``````

### minhashing

Here I set the total number of MinHashes. As shown above, the more permutations we do, the closer the approximations get.

``````minhash <- minhash_generator(n = 240, seed = 808)

corpus <- TextReuseCorpus(
text = shingles\$shingle,
tokenizer = tokenize_ngrams,
n = 1,
minhash_func = minhash,
progress = TRUE)

length(minhashes(corpus[[1]]))
``````

### LSH

Given that I am using 240 hashes, I set my bands to 80. Please note that the number of bands need to be evenly divisible by the number of elements. Look above where the MinHash signature is “bucketed” into groups by an even number of elements. Given these parameters, LSH should detect any pairs of matches above the Jaccard Index of 0.2320794.

``````lsh_threshold(h = 240, b = 80)

buckets <- lsh(corpus, bands = 80, progress = FALSE)
candidates <- lsh_candidates(buckets)
res <- lsh_compare(candidates, corpus, jaccard_similarity, progress = TRUE)
``````

### Create match table

Finally we can create a matched table on the results. The new ID serves as a key that groups similar strings of employer names together. This table can then be merged back to the original set.

``````setDT(res)
employer_list <- data.table(employer = sample_df\$employer, employer2 = sample_df\$employer2)

pairs_list <- unique(c(unique(res\$a), unique(res\$b)))

id = 0
for (i in unique(res\$a)) {
id <- id + 1
x <- res[a == i, ]
y <- res[b %in% x\$b, ]
x_i <- unique(c(unique(y\$a), unique(y\$b)))
x_i <- as.numeric(str_replace_all(x_i, 'doc-', ''))

if(length(x_i) != 0) {
employer_list[x_i]
employer_list[x_i, group := id]
pairs_list <- pairs_list[! pairs_list %in% x_i]
}
}
employer_list\$employer <- NULL
employer_list <- unique(employer_list)

# sample
sample_df2 <- merge(sample_df, employer_list, by=c('employer', 'employer2'))
sample_df2[, ':='(
amount    = as.numeric(str_replace_all(amount, '\\\$','')),
aggregate = as.numeric(str_replace_all(aggregate, '\\\$',''))
) ]

sample_df2[group==810, ] # 28000.00
sample_df2[group==880, ] # 74700.00

``````

### Original vs new employer column

As we can see, if we relied on the original columns we would have understated the total contribution by donors associated with various employers in Hawaii. For example, we can see that donors who listed Bower & Kubota Consulting as their employer actually donated a sum of \$74,700. We would also have understated the sum by donors who listed R.M. Towill Corporation by \$8,000.

original

employer V1
Retired 72150
Not Employed 60000
Self Employed 36075
Bower & Kubota Consulting 35900
R.M. Towill Corporation 20000

new

employer V1
Bower & Kubota Consulting 74700
R.M. Towill Corporation 28000