# Introduction

At the Hawaii Machine Learning Meetup, we have stressed the importance of participating in Kaggle. It is a great place to learn new techniques and grow as data scientists. I wrote this post to demonstrate one example of how I adapted my experience on Kaggle into my everyday work. In the not-so-distant past, my friend Matt Motoki and I had worked on identifying duplicate questions asked on Quora. The competition exposed us to some interesting methods including string metrics.

# String metrics: elements versus content

## elements

String metrics are useful when wanting to know some quantifiable measure of closeness between two strings. String metrics possess several useful properties for analysis and many of these properties follow the triangle inequality, which states that the sum of the lengths of any two sides for any triangle must be greater than or equal to the length of the remaining side. We can conceptually think of this as the following: if A is close to B, and B is close to C, then A is close to C. A popular string metric that follows along the lines of this property is the Levenshtein distance, which is a comparison of sequences of elements at the character-level and the edits needed to match two strings. With that said, it is natural for us to think that the more edits needed to create a match, the further away two strings are from one another.

``````# calculating the number of deletions, insertions and substitutions necessary to turn b into a.
word1 <- "apples"
word2 <- "lappes"

stringdist::stringdist(word1, word2, method='lv')
[1] 2

# including an extra character that needs to be deleted
word1 <- "apples"
word2 <- "lappesp"

stringdist::stringdist(word1, word2, method='lv')
[1] 3

``````

## content

Although the triangle inequality provides a simple explanation on string metrics, popular metrics such as cosine similarity do not satisfy triangle inequality (see proof). Unlike Levenshtein distance, cosine similarity instead cares only about the angle difference between two strings; in other words, cosine similarity is a measurement of orientation and will yield a stronger measure of “closeness” when two strings have the same words - rather than focus on the string elements, we are more concerned about the string’s content. In order to elucidate this point, I have created the example below.

``````# calculating the cosine angle between two strings
sentence1 <- "I like apples more than Bob likes apples"
sentence2 <- "I like bananas more than my friend but we both like apples"

corpus <- c(sentence1, sentence2)
corpus <- sapply(corpus, function(x) strsplit(x, " "))
corpus_all <- unique(unlist(corpus))
corpus_table <- sapply(corpus, function(x) table(factor(x, levels=corpus_all)))

> corpus_table
I like apples more than Bob likes apples I like bananas more than my friend but we both like apples
I                                              1                                                          1
like                                           1                                                          2
apples                                         2                                                          1
more                                           1                                                          1
than                                           1                                                          1
Bob                                            1                                                          0
likes                                          1                                                          0
bananas                                        0                                                          1
my                                             0                                                          1
friend                                         0                                                          1
but                                            0                                                          1
we                                             0                                                          1
both                                           0                                                          1

a <- corpus_table[,1]
b <- corpus_table[,2]

(a %*% b) / (sqrt(sum(a^2)) * sqrt(sum(b^2)))
[,1]
[1,] 0.591608

# calculating the cosine angle between two strings where sentences are more "similar"
sentence1 <- "I like apples more than Bob likes apples"
sentence2 <- "I like apples more than Bob likes pineapple apple pens"

corpus <- c(sentence1, sentence2)
corpus <- sapply(corpus, function(x) strsplit(x, " "))
corpus_all <- unique(unlist(corpus))
corpus_table <- sapply(corpus, function(x) table(factor(x, levels=corpus_all)))

> corpus_table
I like apples more than Bob likes apples I like apples more than Bob likes pineapple apple pens
I                                                1                                                      1
like                                             1                                                      1
apples                                           2                                                      1
more                                             1                                                      1
than                                             1                                                      1
Bob                                              1                                                      1
likes                                            1                                                      1
pineapple                                        0                                                      1
apple                                            0                                                      1
pens                                             0                                                      1

a <- corpus_table[,1]
b <- corpus_table[,2]

(a %*% b) / (sqrt(sum(a^2)) * sqrt(sum(b^2)))

[,1]
[1,]  0.8

``````

# Using string distance to clean addresses

Knowing that string metrics can be used to identify “closeness”, we can leverage this technique to “clean” free text such as street addresses. This can be achieved by matching street addresses to a prepared dataset. In this case, I am using a dataset on TMK parcels in Hawaii where the matched string address will then serve as a key to then extract the property centroids as a “cheap” geocoding workaround.

## Create dummy data

``````library(data.table)
library(sf)
library(text2vec)
library(Matrix)
library(stringr)
library(stringdist)

FILE_DIR <- 'SET YOUR WORKING DIRECTORY HERE'

'100 N BERETANIA ST 120 HONOLULU 96817',
'3353 WAIALAE AVE 305 HONOLULU 96816',
'94750 HIKIMOE ST WAIPAHU 96797'
),
'100 NORT BERETANIA ST 120 HNOLULU 96817',
'3353 WAIALE AVE 305 HONOLULU 96816',
'94750 HIKIME ST WAIPAHU 96707'
)
)

)]

``````
100 n beretania st 120 honolulu 96817 100 nort beretania st 120 hnolulu 96817
3353 waialae ave 305 honolulu 96816 3353 waiale ave 305 honolulu 96816
94750 hikimoe st waipahu 96797 94750 hikime st waipahu 96707

## Prepare reference table

``````setDT(address_pts)

address_pts[, c("City2", "Town") := tstrsplit(CITY, " / ", fixed=TRUE)]

``````
STREETNAME CITY ZIP match_address
MALUHIA ST Honolulu / Waialae Kahala 96816 3415 maluhia st honolulu 96816
SIERRA DR Honolulu / Waialae Kahala 96816 3777 sierra dr honolulu 96816
MELEINOA PL Waipahu 96797 941172 meleinoa pl waipahu 96797
WAIANAE VALLEY RD Waianae 96792 851412 waianae valley rd waianae 96792
HAIEA PL Ewa Beach 96706 910121 haiea pl ewa beach 96706

## Match and score distance

``````matchDistance <- function(input_address, dictionary_address){
x <- paste(dictionary_address[which.min(dist)], min(dist), sep='_')
return(x)
}

, c('dist_match_address', 'score') := tstrsplit(dist_match_address, '_', fixed=TRUE)]

``````
``````geocodes_df <- address_pts[match_address %in% addresses_df\$dist_match_address, .(match_address, geometry)]