Google Knows My Pokemon Habits

Apr 5, 2018 00:00 · 785 words · 4 minute read

Introduction

Google recently allowed its users to download data collected on them. This service is called Takeout. The data covers the Google ecosystem: photos, hangouts, email, and -surprisingly- locations. The exercises in this notebook will highlight the location data, and perhaps stimulate conspiracy theories.

Data

This is how you will obtain your personal data from Google.

  1. Visit the Takeout site.
  2. Select sets to extract (e.g., location history).
  3. Choose how you would like to receive the data.
  4. Confirm the request and wait awhile for the download to be available.

The location data is prepared in the JSON format by default. We can read this into R using the jsonlite package.

library(jsonlite)
library(data.table)
library(lubridate)
library(zoo)
library(ggplot2)
library(ggmap)
library(viridis)

FILE_DIR <- 'F:'
loc_json <- fromJSON(sprintf('%s/Location History.json', FILE_DIR))

Preprocess

The data needs to be lightly preprocessed. The following section converts data units for the geocodes as well as includes new time features.

loc_df <- setDT(loc_json$locations)[, ':='(
  dt   = ymd_hms(as.POSIXct(as.numeric(timestampMs)/1000, origin = "1970-01-01")),
  lat  = latitudeE7 / 1e7,
  lon  = longitudeE7 / 1e7
)][, ':='(
  date  = as.Date(dt),
  year  = year(dt),
  month = month(dt),
  hour  = hour(dt),
  wday  = wday(dt, label=TRUE)
)]

loc_df[, ':='(
  timestampMs = NULL,
  latitudeE7 = NULL,
  longitudeE7 = NULL,
  activity = NULL
  )]

head(loc_df)
accuracy velocity heading altitude verticalAccuracy dt lat lon date year month hour wday
6 0 NA 24 NA 2017-11-05 14:00:17 XXX XXX 2017-11-05 2017 11 14 Sun
4 9 115 22 NA 2017-11-05 14:00:01 XXX XXX 2017-11-05 2017 11 14 Sun
12 8 117 18 NA 2017-11-05 13:59:45 XXX XXX 2017-11-05 2017 11 13 Sun

I obfuscated the geocodes here for privacy reasons

Exploratory data analysis

Exploratory data analysis, or EDA, can be used here to understand what data Google collects and when this collection occurs.

Points sent to Google (year-month)

My phone regularly sent location data to Google. When looking at total points by time of day and day of week, my phone sent the most data during the weekends. The heat map also reflects my working day, which is usually when I have my phone on me.

loc_ts <- loc_df[year > 2013, .(N = .N), by=.(date)]
ggplot(loc_ts, aes(x=date, y=N, group=1)) +
  geom_line() +
  geom_smooth() +
  scale_x_date(date_breaks=c('months'), date_labels=c('%Y %m')) +
  labs(x='Year Month', y = 'Total Points') +
  theme_bw()

timeseries_points_sent

Points sent to Google (day of week and hour)

loc_heat <- loc_df[, .(N = .N), by=.(wday, hour)]
ggplot(loc_heat, aes(x=hour, y=wday, fill=N)) +
  geom_tile(color='white', size=0.1) +
  scale_fill_viridis()

heatmap_points_sent

Where I played Pokemon Go

The geospatial map was somewhat humorous. The high density areas are locations of Pokemon Go gyms. Our family spent significant time playing the game while our son went through a Pokemon phase.

honolulu <- get_map(location = 'Honolulu', zoom=12)
ggmap(honolulu) +
  geom_density2d(data=loc_df, aes(x=lon, y=lat), size = 0.2) +
  stat_density2d(data=loc_df, aes(x=lon, y=lat, fill=..level.., alpha=..level..),
                 size=0.01, bins=16, geom = "polygon") +
  scale_fill_viridis() +
  scale_alpha(range = c(0, 0.3), guide = FALSE) +
  geom_point(data=loc_df, aes(x=lon, y=lat), alpha=0.0025, color='red') +
  theme(legend.position = "right") +
  labs(
    x = "Longitude",
    y = "Latitude",
    title = "Historical points",
    subtitle = "Honolulu, HI",
    caption = "\nDensity map")

density_map

My commute route

As you can see, I tend to stay near where I reside. This data was very accurate in terms of extrapolating commute pattern due to working days versus weekends.

ggmap(aiea) +
  geom_point(data=loc_df[!date %in% holidays & month==12,], aes(x=lon, y=lat, color=velocity), alpha=0.3) +
  theme(legend.position = 'right') +
  facet_wrap(~wday) +
  labs(
    x = 'Longitude',
    y = 'Latitude',
    title = 'Historical points',
    subtitle = 'Honolulu, HI',
    caption = '\nPlot map') +
  theme(axis.text.x=element_text(angle = -90, hjust = 0))

point_map_velocity

What did I do at Ka Makana Alii mall

The four separate clusters reconcile with real events. The right-most cluster is due to having dinner with in-laws at California Pizza Kitchen. The higher density of points is probably correlated with the longer duration spent at that location. The cluster near Bath and Body Works was due to my partner spending a lot of time at the store over the Christmas holidays purchasing gifts. I also spent an entire morning having coffee with a friend at Coffee Bean and Tea Leaf near by. The top left-most cluster is where I got a hair-cut at Supercuts and the cluster to the bottom-left is mainly due to being the most common entryway I take when entering this mall. Okay. Officially creeped out.

ggmap(aiea) +
  geom_point(data=loc_df[!date %in% holidays & month==12,], aes(x=lon, y=lat, color=velocity), alpha=0.3) +
  theme(legend.position = 'right') +
  facet_wrap(~wday) +
  labs(
    x = 'Longitude',
    y = 'Latitude',
    title = 'Historical points',
    subtitle = 'Honolulu, HI',
    caption = '\nPlot map') +
  theme(axis.text.x=element_text(angle = -90, hjust = 0))

point_map_velocity

Conclusion

There are other ways to slice-and-dice the data; however, further extrapolation started to feel a little “too close to home”. Nonetheless, the code provided here should be more than sufficient to give you the heebie-jeebies.