Complex Surveys with R

Dec 15, 2016 00:00 · 1060 words · 5 minute read


Public health is a broad and sometimes overreaching field. Consequently, I am asked, “what is public health?” more so than “what school did you go to (in Hawaii)?” With that said, I would like to start off with a quote by C.E. Taylor:

“The focus of public health is the community. The ‘patient’ is a whole population unit. The shift in professional orientation which occurs as the unit of attention moves from the individual to the group must be clearly recognized and explicitly stated because it has led to many misunderstandings in the past.”

Once it is clearly understood that public health is not brain surgery and according to the CDC more about zombie invasions, the next question asked is “so what does public health do if it doesn’t cure diabetes?” Such a loaded question. In a nutshell, the four core functions of public health are as follows:

  • Assessment
  • Policy development
  • Assurance
  • Communication

In other words, public health professionals rely on studies and surveys, such as the Behavioral Risk Factors Surveillance System (BRFSS), to understand health trends. Consequently, campaigns and/or policies are developed to break trends when needed.

# importing data

# data wrangling

# complex survey analysis

# eda


In laymen’s terms, many public health organizations and professionals cite the BRFSS when interested in health risk behaviors, health access and chronic disease prevalence. Introductions and examples have already been created on BRFSS for R, and have been featured on R-bloggers. Anthony Damico maintains a GitHub repository on downloading and storing complex surveys. I recommend his solution for those who are new to R and/or complex survey analysis and because there is a significant speed increase when working on analyzing the BRFSS alongside MonetDB.

In this example, I analyzed overweight or obese risk in the 2015 BRFSS series. The question I wanted to answer was if Hawaii should be written off as the healthiest state in the nation.

tf <- tempfile()
td <- tempdir()

xpt15 <- ""
download.file(xpt15, tf, mode='wb')
local.fn <- unzip(tf, exdir=td)

brfss15 <- read.xport(local.fn)

The BRFSS data set contains over 300 variables. In this case, we are only interested in looking at respondents’ reported BMI category and preferred race. We also need to extract additional variables such as weight, residency and strata.

# load(file="input/brfss15.Rda")
data_1 <- setDT(brfss15)[, .(X_PSU, X_LLCPWT, X_STSTR, X_STATE, X_RFBMI5, X_PRACE1)]

kable(head(data_1, 10), format="markdown")
2.015e+09 341.38485 11011 1 2 1
2.015e+09 108.06090 11011 1 2 1
2.015e+09 255.26480 11011 1 1 1
2.015e+09 341.38485 11011 1 2 1
2.015e+09 258.68222 11011 1 1 1
2.015e+09 256.51859 11011 1 2 1
2.015e+09 85.65975 11011 1 1 1
2.015e+09 545.78210 11011 1 2 1
2.015e+09 211.21030 11011 1 1 1
2.015e+09 215.47286 11011 1 2 1


The variables require some data wrangling. Both BMI category and preferred race are numerically encoded and may be difficult to interpret without a data dictionary. Discretizing variable levels with descriptive labels may also help with interpretation.

# source("01 Preprocess/Preprocess Features.R")

# subset data
data_1 <- setDT(brfss15)[, .(X_PSU, X_LLCPWT, X_STSTR, X_STATE, X_RFBMI5, X_PRACE1)]

# discretize BMI category
data_1 <- data_1[X_RFBMI5==2, X_RFBMI5_2:="Obese/Overweight"]
data_1 <- data_1[X_RFBMI5==1, X_RFBMI5_2:="Not Obese/Overweight"]
data_1 <- data_1[X_RFBMI5==9, X_RFBMI5_2:="Not Sure"]

data_1 <- data_1[X_RFBMI5_2!="Not Sure",]

data_1 <- data_1[, X_RFBMI5_2:=as.factor(X_RFBMI5_2)]

# discretize Race
data_1 <- data_1[X_PRACE1==1, X_PRACE1_2:="White"]
data_1 <- data_1[X_PRACE1==2, X_PRACE1_2:="Black"]
data_1 <- data_1[X_PRACE1==3, X_PRACE1_2:="AIAN"]
data_1 <- data_1[X_PRACE1==4, X_PRACE1_2:="Asian"]
data_1 <- data_1[X_PRACE1==5, X_PRACE1_2:="NHOPI"]
data_1 <- data_1[X_PRACE1==6, X_PRACE1_2:="Other"]
data_1 <- data_1[X_PRACE1==7, X_PRACE1_2:="No Preferred"]
data_1 <- data_1[X_PRACE1==8, X_PRACE1_2:="Multiracial"]
data_1 <- data_1[X_PRACE1==77, X_PRACE1_2:="Not Sure"]
data_1 <- data_1[X_PRACE1==99, X_PRACE1_2:="Refused"]

data_1 <- data_1[X_PRACE1_2!="Not Sure",]

data_1 <- data_1[, X_PRACE1_2:=as.factor(X_PRACE1_2)]


The survey design can now be specified with both the survey and preprocessed variables. This step may take some time given the size of this data set, and so the survey object is subset on the state of Hawaii for the rest of the analysis.

# source("02 Modeling/Create Survey Object.R")

dsurvey <- svydesign(id        = ~X_PSU,
                     strata    = ~X_STSTR,
                     weights   = ~X_LLCPWT,
                     nest      = TRUE,
                     data      = data_1)

dsurvey_2 <- subset(dsurvey, X_STATE==15)

Statewide Estimates

# calculate estimates
hawaii_est <- data.frame(
  svytotal(~X_RFBMI5_2, dsurvey_2, na.rm=TRUE),
  svymean(~X_RFBMI5_2, dsurvey_2, na.rm=TRUE, level=0.95),
  confint(svymean(~X_RFBMI5_2, dsurvey_2, na.rm=TRUE, level=0.95))

# create group column
hawaii_est$group <- str_replace(row.names(hawaii_est), "X_RFBMI5_2", "")
colnames(hawaii_est) <- c("N", "N (SE)", "%", "% (SE)", "% Lower CI (95%)", "% Upper CI (95%)", "Group")

# plot results
ggplot(data=hawaii_est, aes(x=Group, y=`%`, group=Group, fill=Group, ymax=`% Upper CI (95%)`, ymin=`% Lower CI (95%)`)) +
  geom_bar(stat="identity", position="dodge") +
  geom_errorbar(position=position_dodge(width=0.9), width=0.1) +
  scale_y_continuous(labels=percent) +
  geom_text(aes(y=0, label=comma(round(N, 0))), position=position_dodge(width=0.9), vjust=-1) +
  labs(title="Overweight or Obese, Statewide",
       subtitle="State of Hawaii BRFSS 2015",
       y="Point Estimate (%)") +

statewide estimates

Stratified By Preferred Race

hawaii_race_est <- merge(
  svyby(~X_RFBMI5_2, ~X_PRACE1_2, svymean, design=dsurvey_2, na.rm=TRUE, vartype=c('ci')),
  svyby(~X_RFBMI5_2, ~X_PRACE1_2, svytotal, design=dsurvey_2, na.rm=TRUE),

colnames(hawaii_race_est) <- c("Group",
                               "Not Obese/Overweight (%)",
                               "Obese/Overweight (%)",
                               "Not Obese/Overweight (%) Lower CI (95%)",
                               "Obese/Overweight (%) Lower CI (95%)",
                               "Not Obese/Overweight (%) Upper CI (95%)",
                               "Obese/Overweight (%) Upper CI (95%)",
                               "Not Obese/Overweight (N)",
                               "Obese/Overweight (N)",
                               "Not Obese/Overweight (SE)",
                               "Obese/Overweight (SE)")

ggplot(data=hawaii_race_est, aes(x=reorder(Group, `Obese/Overweight (%)`), y=`Obese/Overweight (%)`, group=Group, fill=Group,
                                 ymax=`Obese/Overweight (%) Upper CI (95%)`, ymin=`Obese/Overweight (%) Lower CI (95%)`)) +
  geom_bar(stat="identity", position="dodge") +
  geom_errorbar(position=position_dodge(width=0.9), width=0.1) +
  scale_y_continuous(labels=percent) +
  geom_text(aes(y=0, label=comma(round(`Obese/Overweight (N)`, 0))), position=position_dodge(width=0.9), size=3, vjust=-1) +
  labs(title="Overweight or Obese, by Preferred Race",
       subtitle="State of Hawaii BRFSS 2015",
       y="Point Estimate (%)") +
  geom_hline(yintercept=hawaii_est$`%`[2], col="black", lty=2, size=1) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle=90, hjust=1, vjust=0))

statewide estimates by preferred race

Although more than half of residents reported to be overweight or obese, Hawaii still has one of the lowest rates in the nation (0.57% vs. 64.6%). Yet when stratifying by preferred race, Hawaii actually has some of the highest prevalence rates for overweight and obesity risk factors in the nation. In 2015, Native Hawaiians and Other Pacific Islanders (NHOPI) overweight or obese risk was at 79.17% (95% confidence interval [CI]: 71.88-86.46%).


So why was this exercise important? The two charts tell a compelling story. If we were to accept only the first cut of the data, then policies and initiatives may overlook Hawaii. However, the second chart reveals that NHOPIs are at extreme risk for negative health outcomes.

Public health attempts to understand trends and/or driving factors that cause morbidity or mortality. These analyses then help to drive interventions and policies that may potentially make the greatest impact. Tools like R can assist with this process as it democratizes data and analysis so that different perspectives can be added to the overall discussion.