String Distances

Mar 22, 2018 00:00 · 516 words · 3 minute read

Introduction

String distance is useful when wanting to know some quantifiable measure of closeness between two strings.

String metrics: elements versus content

elements-approach

A commonly used string metric is the Levenshtein distance, which is a comparison of the sequences of elements at the character-level and the edits needed to match two strings. With that said, it is natural for us to think that the more edits are needed to create a match, the further away two strings are from one another. In the example below, we can see that the misspelled words for “apple” are still closer than the word “orange”.

# calculating the number of deletions, insertions and substitutions necessary to turn b into a.
word1 <- "apples"
word2 <- "lappes"

stringdist::stringdist(word1, word2, method='lv')
[1] 2

# including an extra character that needs to be deleted
word1 <- "apples"
word2 <- "lappesp"

stringdist::stringdist(word1, word2, method='lv')
[1] 3

word3 <- "oranges"
stringdist::stringdist(word1, word3, method='lv')
[1] 5

content-approach

In many cases, we are more interested in comparing a string of words such as when we are more concerned about the content. The Levenshtein distance in this case would be inadequate as the sum of all edits for each word in an entire string will consequently have us comparing apples to oranges. When wanting to compare sentences, it is more natural to calculate the cosine similarity between strings. Unlike Levenshtein distance, cosine similarity instead cares only about the angle difference between two strings; in other words, cosine similarity is a measurement of orientation and will yield a stronger measure of “closeness” when two strings have the same words.

# calculating the cosine angle between two strings
sentence1 <- "I like apples more than Bob likes apples"
sentence2 <- "I like bananas more than my friend but we both like apples"

corpus <- c(sentence1, sentence2)
corpus <- sapply(corpus, function(x) strsplit(x, " "))
corpus_all <- unique(unlist(corpus))
corpus_table <- sapply(corpus, function(x) table(factor(x, levels=corpus_all)))

> corpus_table
        I like apples more than Bob likes apples I like bananas more than my friend but we both like apples
I                                              1                                                          1
like                                           1                                                          2
apples                                         2                                                          1
more                                           1                                                          1
than                                           1                                                          1
Bob                                            1                                                          0
likes                                          1                                                          0
bananas                                        0                                                          1
my                                             0                                                          1
friend                                         0                                                          1
but                                            0                                                          1
we                                             0                                                          1
both                                           0                                                          1

a <- corpus_table[,1]
b <- corpus_table[,2]

(a %*% b) / (sqrt(sum(a^2)) * sqrt(sum(b^2)))
         [,1]
[1,] 0.591608

# calculating the cosine angle between two strings where sentences are more "similar"
sentence1 <- "I like apples more than Bob likes apples"
sentence2 <- "I like apples more than Bob likes pineapple apple pens"

corpus <- c(sentence1, sentence2)
corpus <- sapply(corpus, function(x) strsplit(x, " "))
corpus_all <- unique(unlist(corpus))
corpus_table <- sapply(corpus, function(x) table(factor(x, levels=corpus_all)))

> corpus_table
          I like apples more than Bob likes apples I like apples more than Bob likes pineapple apple pens
I                                                1                                                      1
like                                             1                                                      1
apples                                           2                                                      1
more                                             1                                                      1
than                                             1                                                      1
Bob                                              1                                                      1
likes                                            1                                                      1
pineapple                                        0                                                      1
apple                                            0                                                      1
pens                                             0                                                      1

a <- corpus_table[,1]
b <- corpus_table[,2]

(a %*% b) / (sqrt(sum(a^2)) * sqrt(sum(b^2)))

     [,1]
[1,]  0.8