String Metrics

Mar 22, 2018 00:00 · 1038 words · 5 minute read

Introduction

At the Hawaii Machine Learning Meetup, we have stressed the importance of participating in Kaggle. It is a great place to learn new techniques and grow as data scientists. I wrote this post to demonstrate one example of how I adapted my experience on Kaggle into my everyday work. In the not-so-distant past, my friend Matt Motoki and I had worked on identifying duplicate questions asked on Quora. The competition exposed us to some interesting methods including string metrics.

String metrics: elements versus content

elements

String metrics are useful when wanting to know some quantifiable measure of closeness between two strings. String metrics possess several useful properties for analysis and many of these properties follow the triangle inequality, which states that the sum of the lengths of any two sides for any triangle must be greater than or equal to the length of the remaining side. We can conceptually think of this as the following: if A is close to B, and B is close to C, then A is close to C. A popular string metric that follows along the lines of this property is the Levenshtein distance, which is a comparison of sequences of elements at the character-level and the edits needed to match two strings. With that said, it is natural for us to think that the more edits needed to create a match, the further away two strings are from one another.

# calculating the number of deletions, insertions and substitutions necessary to turn b into a.
word1 <- "apples"
word2 <- "lappes"

stringdist::stringdist(word1, word2, method='lv')
[1] 2

# including an extra character that needs to be deleted
word1 <- "apples"
word2 <- "lappesp"

stringdist::stringdist(word1, word2, method='lv')
[1] 3

content

Although the triangle inequality provides a simple explanation on string metrics, popular metrics such as cosine similarity do not satisfy triangle inequality (see proof). Unlike Levenshtein distance, cosine similarity instead cares only about the angle difference between two strings; in other words, cosine similarity is a measurement of orientation and will yield a stronger measure of “closeness” when two strings have the same words - rather than focus on the string elements, we are more concerned about the string’s content. In order to elucidate this point, I have created the example below.

# calculating the cosine angle between two strings
sentence1 <- "I like apples more than Bob likes apples"
sentence2 <- "I like bananas more than my friend but we both like apples"

corpus <- c(sentence1, sentence2)
corpus <- sapply(corpus, function(x) strsplit(x, " "))
corpus_all <- unique(unlist(corpus))
corpus_table <- sapply(corpus, function(x) table(factor(x, levels=corpus_all)))

> corpus_table
        I like apples more than Bob likes apples I like bananas more than my friend but we both like apples
I                                              1                                                          1
like                                           1                                                          2
apples                                         2                                                          1
more                                           1                                                          1
than                                           1                                                          1
Bob                                            1                                                          0
likes                                          1                                                          0
bananas                                        0                                                          1
my                                             0                                                          1
friend                                         0                                                          1
but                                            0                                                          1
we                                             0                                                          1
both                                           0                                                          1

a <- corpus_table[,1]
b <- corpus_table[,2]

(a %*% b) / (sqrt(sum(a^2)) * sqrt(sum(b^2)))
         [,1]
[1,] 0.591608

# calculating the cosine angle between two strings where sentences are more "similar"
sentence1 <- "I like apples more than Bob likes apples"
sentence2 <- "I like apples more than Bob likes pineapple apple pens"

corpus <- c(sentence1, sentence2)
corpus <- sapply(corpus, function(x) strsplit(x, " "))
corpus_all <- unique(unlist(corpus))
corpus_table <- sapply(corpus, function(x) table(factor(x, levels=corpus_all)))

> corpus_table
          I like apples more than Bob likes apples I like apples more than Bob likes pineapple apple pens
I                                                1                                                      1
like                                             1                                                      1
apples                                           2                                                      1
more                                             1                                                      1
than                                             1                                                      1
Bob                                              1                                                      1
likes                                            1                                                      1
pineapple                                        0                                                      1
apple                                            0                                                      1
pens                                             0                                                      1

a <- corpus_table[,1]
b <- corpus_table[,2]

(a %*% b) / (sqrt(sum(a^2)) * sqrt(sum(b^2)))

     [,1]
[1,]  0.8

Using string distance to clean addresses

Knowing that string metrics can be used to identify “closeness”, we can leverage this technique to “clean” free text such as street addresses. This can be achieved by matching street addresses to a prepared dataset. In this case, I am using a dataset on TMK parcels in Hawaii where the matched string address will then serve as a key to then extract the property centroids as a “cheap” geocoding workaround.

Create dummy data

library(data.table)
library(sf)
library(text2vec)
library(Matrix)
library(stringr)
library(stringdist)

FILE_DIR <- 'SET YOUR WORKING DIRECTORY HERE'

address_pts <- st_read(paste0(FILE_DIR, 'input/Address_Points/Address_Points.shp'))

addresses_df <- data.table(
    real_address = c(
        '100 N BERETANIA ST 120 HONOLULU 96817',
        '3353 WAIALAE AVE 305 HONOLULU 96816',
        '94750 HIKIMOE ST WAIPAHU 96797'
    ),
    test_address = c(
        '100 NORT BERETANIA ST 120 HNOLULU 96817',
        '3353 WAIALE AVE 305 HONOLULU 96816',
        '94750 HIKIME ST WAIPAHU 96707'
    )
)

addresses_df[,':='(
    real_address = tolower(real_address),
    test_address = tolower(test_address)
)]

knitr::kable(addresses_df)
real_address test_address
100 n beretania st 120 honolulu 96817 100 nort beretania st 120 hnolulu 96817
3353 waialae ave 305 honolulu 96816 3353 waiale ave 305 honolulu 96816
94750 hikimoe st waipahu 96797 94750 hikime st waipahu 96707

Prepare reference table

setDT(address_pts)

address_pts[, c("City2", "Town") := tstrsplit(CITY, " / ", fixed=TRUE)]
address_pts[, match_address := paste(GEOCODEADD, City2, ZIP)]
address_pts[, match_address := tolower(match_address)]

STREETNAME CITY ZIP match_address
MALUHIA ST Honolulu / Waialae Kahala 96816 3415 maluhia st honolulu 96816
SIERRA DR Honolulu / Waialae Kahala 96816 3777 sierra dr honolulu 96816
MELEINOA PL Waipahu 96797 941172 meleinoa pl waipahu 96797
WAIANAE VALLEY RD Waianae 96792 851412 waianae valley rd waianae 96792
HAIEA PL Ewa Beach 96706 910121 haiea pl ewa beach 96706

Match and score distance

matchDistance <- function(input_address, dictionary_address){
    stopifnot(length(input_address) == 1)
    stopifnot(length(dictionary_address) > 1)
    dist <- stringdist(input_address, dictionary_address, method='cosine')
    x <- paste(dictionary_address[which.min(dist)], min(dist), sep='_')
    return(x)
}

addresses_df[, dist_match_address := lapply(test_address, function(x)
  matchDistance(x, address_pts$match_address))][
    , c('dist_match_address', 'score') := tstrsplit(dist_match_address, '_', fixed=TRUE)]

real_address test_address dist_match_address score
100 n beretania st 120 honolulu 96817 100 nort beretania st 120 hnolulu 96817 100 n beretania st honolulu 96817 0.0196210645149207
3353 waialae ave 305 honolulu 96816 3353 waiale ave 305 honolulu 96816 3353 waialae ave honolulu 96816 0.0275772212632714
94750 hikimoe st waipahu 96797 94750 hikime st waipahu 96707 940750 hikimoe st waipahu 96797 0.0145417478534373

Find the geocode information for my cleaned addresses

geocodes_df <- address_pts[match_address %in% addresses_df$dist_match_address, .(match_address, geometry)]
match_address geometry
940750 hikimoe st waipahu 96797 c(-158.005220330059, 21.3849512708822)
3353 waialae ave honolulu 96816 c(-157.804440999228, 21.285100658459)
100 n beretania st honolulu 96817 c(-157.860876821049, 21.3142336937747)