Statistically Significant Trend Analysis

Screenshot

Introduction

The purpose of this analysis is to quantify statistically significant changes in multi-year trends using complex survey design. The analysis was motivated by a CDC publication titled Conducting Trend Analyses of YRBSS Data. The goal of this analysis is to replicate exact statistics presented in the paper using open source methods. In this example, we examine change over time for smoking prevalence among youth.

Since originally completing this analysis in 2015, CDC recently updated the whitepaper in 2016 to assess a different question. The referenced report, therefore, was re-directed to an archived copy stored on the Wayback Machine website.

Data

The analysis was conducted using multiple years (i.e. 1991-2011) of Youth Risk Behavioral Surveillance System (YRBSS) microdata, which is a surveillance tool by the US CDC to monitor health-risk behaviors that contribute to causes of death and disability among youth and adults, such as tobacco smoking.

# harmonize and stack
y <- lapply(seq(1991, 2011, 2 ), function(y) {
  load(paste0("YRBSS/yrbs", y, ".rda"))
  x$year <- y
  x
})
y <- plyr::rbind.fill(y)

ELSE <- TRUE

Dependent Variable

The outcome for this analysis is the dichotomous risk behavior: did the respondent ever smoke?

y <- y %>% 
  select(q2, q3, q4, q23, q26, q27, q28, q29, year, psu, stratum, weight, raceeth) %>%
  mutate_all(funs(as.numeric), which(sapply(., is.character))) %>%
  mutate(
    smoking = case_when(
      year==1991 ~ q23,
      year %in% c(1993, 2001:2009) ~ q28,
      year %in% 1995:1997 ~ q26,
      year %in% 1999 ~ q27,
      year %in% 2011 ~ q29),
    # > table(y2$smoking)
    #     1     2 
    # 94679 58775
    raceeth = case_when(
      year %in% 1991:1997 & q4 %in% 1:3 ~ q4,
      year %in% 1991:1997 & q4 %in% 4:6 ~ 4,
      
      year %in% 1999:2005 & q4 %in% 6 ~ 1,
      year %in% 1999:2005 & q4 %in% 3 ~ 2,
      year %in% 1999:2005 & q4 %in% c(4, 7) ~ 3,
      year %in% 1999:2005 & q4 %in% c(1, 2, 5, 8) ~ 4,

      year %in% 2007:2011 & raceeth %in% 5 ~ 1,
      year %in% 2007:2011 & raceeth %in% 3 ~ 2,
      year %in% 2007:2011 & raceeth %in% c(6, 7) ~ 3,
      year %in% 2007:2011 & raceeth %in% c(1, 2, 4, 8) ~ 4),
    
    grade = case_when(
      q3==1 ~ 1,
      q3==2 ~ 2,
      q3==3 ~ 3,
      q3==4 ~ 4),
    
    sex = case_when(
      q2 %in% 1:2 ~ q2)
    ) %>%
  select(year, psu, stratum, weight, smoking, raceeth, sex, grade)

# recode ref groups
y <- y %>%
  mutate(
    sex = fct_relevel(factor(sex), "2", after = 0),
    smoking = fct_relevel(factor(smoking), "1", after = 0),
    raceeth = fct_relevel(factor(raceeth), "1", after = 0),
    grade = fct_relevel(factor(grade), "1", after = 0)
  )

Time Variable

Linear transformations of the time variable were constructed for the analysis to model various potential trend outcomes to be tested to ascertain if and when a change in trend occurred. If, for example, the p-value for the quadratic time variable is statistically significant at the 0.05 level, then the model suggests a quadratic change.

# extract a linear contrast vector of length equivalent to number of years analyzed
c11l <- contr.poly(11)[, 1]

# extract a quadratic (squared) contrast vector
c11q <- contr.poly(11)[, 2]

# extract a cubic contrast vector
c11c <- contr.poly(11)[, 3]

# for each record in the data set, tack on the linear, quadratic, and cubic contrast value
# these contrast values will serve as replacement for the linear `year` variable in any regression.

# year^1 term (linear)
y$t11l <- c11l[match(y$year, seq(1999, 2011, 2))]

# year^2 term (quadratic)
y$t11q <- c11q[match(y$year, seq(1999, 2011, 2))]

# year^3 term (cubic)
y$t11c <- c11c[match(y$year, seq(1999, 2011, 2))]

Model

Complex Survey Design

A nested strata survey design was created to account for the stacked set by year.

# construct a complex sample survey design object
# stacking multiple years and accounting for `year` in the nested strata
des <- svydesign(
        id=~psu, 
        strata=~interaction(stratum, year),
        data=y, 
        weights=~weight, 
        nest=T)

des_ns <- subset(des, !is.na(smoking))

Fit Trend Changes

A logistic regression was used to test for significant change over time because of the dichotomous outcome. Iterations of fit were conducted on the various time variables (e.g. the model was fit on the linear, quadratic and then cubic time variable) for a total of three models. All models included variables to control for sex, race/ethnicity, and grade. As noted by the authors, only the highest-order time variable should be considered as being valid and interpretable.

unadjusted_smok <- svyby(~smoking, ~year, svymean, design=des_ns, vartype=c('ci', 'se'))

# coerce that result into a `data.frame` object
data.frame(unadjusted_smok) %>%
  ggplot(aes(x=year, y=smoking1)) +
  geom_point() + 
  geom_text(aes(label=round(smoking1, 3)), check_overlap=T, vjust=-.5, size=3, fontface="bold", position=position_dodge(width=.9)) +
  geom_errorbar(aes(ymax=ci_u.smoking1, ymin=ci_l.smoking1), width=.2) +
  geom_line() +
  labs(title="Unadjusted smoking prevalence 1991-2011", subtitle="", x="Year", y="Rate") +
  theme_bw() +
  theme(plot.background   = element_rect(fill="transparent", color="transparent"),
        panel.background  = element_rect(fill='transparent'),
        panel.border      = element_blank(),
        panel.grid.major  = element_blank(),
        panel.grid.minor  = element_blank(),
        legend.key        = element_rect(colour=NA, fill=NA),
        legend.background = element_rect(fill="transparent"))

center

Linear
linyear <- svyglm(I(smoking==1) ~ sex+raceeth+grade+t11l, 
                  design=subset(des_ns, smoking %in% 1:2), 
                  family=quasibinomial)
summary(linyear)
## 
## Call:
## svyglm(formula = I(smoking == 1) ~ sex + raceeth + grade + t11l, 
##     design = subset(des_ns, smoking %in% 1:2), family = quasibinomial)
## 
## Survey design:
## subset(des_ns, smoking %in% 1:2)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.41445    0.04563  -9.083  < 2e-16 ***
## sex1        -0.09318    0.02326  -4.005 7.99e-05 ***
## raceeth2    -0.05605    0.04929  -1.137  0.25647    
## raceeth3     0.19022    0.04298   4.426 1.39e-05 ***
## raceeth4    -0.14977    0.05298  -2.827  0.00505 ** 
## grade2       0.26058    0.03134   8.314 4.41e-15 ***
## grade3       0.39964    0.03708  10.779  < 2e-16 ***
## grade4       0.65188    0.03893  16.744  < 2e-16 ***
## t11l        -1.96550    0.11439 -17.183  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 1.002984)
## 
## Number of Fisher Scoring iterations: 4
Quadratic
quadyear <- svyglm(I(smoking==1) ~ sex+raceeth+grade+t11l+t11q, 
        design=subset(des_ns, smoking %in% 1:2), 
        family=quasibinomial)
summary(quadyear)
## 
## Call:
## svyglm(formula = I(smoking == 1) ~ sex + raceeth + grade + t11l + 
##     t11q, design = subset(des_ns, smoking %in% 1:2), family = quasibinomial)
## 
## Survey design:
## subset(des_ns, smoking %in% 1:2)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.23972    0.07854  -3.052  0.00250 ** 
## sex1        -0.09288    0.02327  -3.991 8.45e-05 ***
## raceeth2    -0.05566    0.04935  -1.128  0.26037    
## raceeth3     0.19094    0.04253   4.489 1.06e-05 ***
## raceeth4    -0.16106    0.05307  -3.035  0.00264 ** 
## grade2       0.26041    0.03139   8.297 5.03e-15 ***
## grade3       0.39890    0.03716  10.736  < 2e-16 ***
## grade4       0.65077    0.03897  16.700  < 2e-16 ***
## t11l        -1.24235    0.28336  -4.384 1.66e-05 ***
## t11q         0.51001    0.19710   2.588  0.01019 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 1.003261)
## 
## Number of Fisher Scoring iterations: 4
Cubic
cubyear <-svyglm(I(smoking==1) ~ sex+raceeth+grade+t11l+t11q+t11c, 
        design=subset(des_ns, smoking %in% 1:2), 
        family=quasibinomial)
summary(cubyear)
## 
## Call:
## svyglm(formula = I(smoking == 1) ~ sex + raceeth + grade + t11l + 
##     t11q + t11c, design = subset(des_ns, smoking %in% 1:2), family = quasibinomial)
## 
## Survey design:
## subset(des_ns, smoking %in% 1:2)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.28320    0.21756  -1.302  0.19412    
## sex1        -0.09284    0.02325  -3.993 8.41e-05 ***
## raceeth2    -0.05593    0.04944  -1.131  0.25899    
## raceeth3     0.19099    0.04253   4.490 1.05e-05 ***
## raceeth4    -0.16157    0.05350  -3.020  0.00277 ** 
## grade2       0.26036    0.03137   8.299 4.99e-15 ***
## grade3       0.39884    0.03715  10.734  < 2e-16 ***
## grade4       0.65072    0.03897  16.700  < 2e-16 ***
## t11l        -1.43510    0.96744  -1.483  0.13913    
## t11q         0.36885    0.70758   0.521  0.60260    
## t11c        -0.06335    0.31997  -0.198  0.84320    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 1.003279)
## 
## Number of Fisher Scoring iterations: 4

Break Point

A final model including the highest-order significant time variable is fit to the data and the marginal predictions are obtained for each datum (i.e. year). A datum point with a large standard error represents higher uncertainty and is assigned a lower weight for the piece-wise regression model. The segmented package utilizes bootstrap restarting to help the model escape local minima, which is especially true when the sample size and/or signal-to-noise ratio is low. Consequently, final estimates may slightly vary.

marginals <- svyglm(formula=I(smoking==1) ~ sex+raceeth+grade,
        design=des_ns, 
        family=quasibinomial)

means_for_joinpoint <- svypredmeans(marginals, ~factor(year))

means_for_joinpoint <- as.data.frame(means_for_joinpoint)

means_for_joinpoint$year <- as.numeric(rownames(means_for_joinpoint))

means_for_joinpoint <- means_for_joinpoint[order(means_for_joinpoint$year), ]

names(means_for_joinpoint) <- c("mean", "se", "yr")

another_plot <- means_for_joinpoint
another_plot$ci_l.mean <- another_plot$mean-(1.96 * another_plot$se)
another_plot$ci_u.mean <- another_plot$mean+(1.96 * another_plot$se)

ggplot(means_for_joinpoint, aes(x=yr, y=mean)) +
    geom_label_repel(aes(label=paste0('Mean:', round(mean, 3), ', SE:', round(se, 3))),
                     size=3.5,
                     box.padding=unit(1, "lines"),
                     color = 'white',
                     segment.color = 'gray',
                     fill='lightseagreen') +
    geom_line() +
    geom_point(aes(size=se)) +
    theme_bw() +
    labs(title="Standard Error at each time point", subtitle="", x="Year", y="Rate") +
  theme_bw() +
  theme(plot.background   = element_rect(fill="transparent", color="transparent"),
        panel.background  = element_rect(fill='transparent'),
        panel.border      = element_blank(),
        panel.grid.major  = element_blank(),
        panel.grid.minor  = element_blank(),
        legend.key        = element_rect(colour=NA, fill=NA),
        legend.background = element_rect(fill="transparent"),
        legend.position   = "top")

center

set.seed(123)
means_for_joinpoint$wgt <- with(means_for_joinpoint, (mean/se)^2) 

# add a segmented variable (`yr` in this example) with 1 breakpoint
o <- lm(log(mean) ~ yr, weights=wgt, data=means_for_joinpoint)

# add a segmented variable (`yr` in this example) with 1 breakpoint
os <- segmented(o, ~yr, control=seg.control(seed=2015))

summary(os)
## 
## 	***Regression Model with Segmented Relationship(s)***
## 
## Call: 
## segmented.lm(obj = o, seg.Z = ~yr, control = seg.control(seed = 2015))
## 
## Estimated Break-Point(s):
##      Est.   St.Err 
## 1998.704    0.386 
## 
## Meaningful coefficients of the linear terms:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.851180   4.734446  -1.025    0.340
## yr           0.002253   0.002374   0.949    0.374
## U1.yr       -0.042195   0.002903 -14.536       NA
## 
## Residual standard error: 0.7548 on 7 degrees of freedom
## Multiple R-Squared: 0.9936,  Adjusted R-squared: 0.9908 
## 
## Convergence attained in 2 iterations with relative change -5.698459e-16
(your_breakpoint <- round(as.vector(os$psi[, "Est."])))
## [1] 1999

Validation

The break point can be validated by separating and fitting two different regression models–one for each segment. The assumption is that there should be no significant changes in trend in either segment. The first model covers the years leading up to (and including) the changepoint (i.e., 1991 to 1999). The second model includes the years from the changepoint forward (i.e., 1999 to 2011). Inferential statistics confirm that the period 1991-1999 saw no significant change while a significant linear decrease in smoking prevalence was seen during 1999-2011 as seem in the term t7l.

Segment: 1991-1999
c5l <- contr.poly(5)[, 1]

des_ns <- update(des_ns, t5l=c5l[match(year, seq(1991, 1999, 2))])

pre_91_99 <- svyglm(I(smoking==1) ~ sex+raceeth+grade+t5l,
        design=subset(des_ns, smoking %in% 1:2 & year <= 1999), 
        family=quasibinomial)

summary(pre_91_99)
## 
## Call:
## svyglm(formula = I(smoking == 1) ~ sex + raceeth + grade + t5l, 
##     design = subset(des_ns, smoking %in% 1:2 & year <= 1999), 
##     family = quasibinomial)
## 
## Survey design:
## subset(des_ns, smoking %in% 1:2 & year <= 1999)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.61609    0.05003  12.314  < 2e-16 ***
## sex1        -0.05856    0.02935  -1.995 0.047310 *  
## raceeth2    -0.12437    0.05412  -2.298 0.022561 *  
## raceeth3     0.18418    0.04781   3.852 0.000156 ***
## raceeth4    -0.16265    0.06497  -2.503 0.013082 *  
## grade2       0.27785    0.04689   5.926 1.29e-08 ***
## grade3       0.36458    0.05606   6.503 5.85e-10 ***
## grade4       0.50805    0.06209   8.183 2.84e-14 ***
## t5l          0.03704    0.05784   0.640 0.522639    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 0.9992789)
## 
## Number of Fisher Scoring iterations: 4
Segment: 1999-2011
c7l <- contr.poly(7)[, 1]

des_ns <- update(des_ns, t7l=c7l[match(year, seq(1999, 2011, 2))])

post_99_11 <- svyglm(I(smoking==1) ~ sex+raceeth+grade+t7l,
        design=subset(des_ns, smoking %in% 1:2 & year >= 1999), 
        family=quasibinomial)

summary(post_99_11)
## 
## Call:
## svyglm(formula = I(smoking == 1) ~ sex + raceeth + grade + t7l, 
##     design = subset(des_ns, smoking %in% 1:2 & year >= 1999), 
##     family = quasibinomial)
## 
## Survey design:
## subset(des_ns, smoking %in% 1:2 & year >= 1999)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.03964    0.04287  -0.925  0.35595    
## sex1        -0.09318    0.02326  -4.005 7.99e-05 ***
## raceeth2    -0.05605    0.04929  -1.137  0.25647    
## raceeth3     0.19022    0.04298   4.426 1.39e-05 ***
## raceeth4    -0.14977    0.05298  -2.827  0.00505 ** 
## grade2       0.26058    0.03134   8.314 4.41e-15 ***
## grade3       0.39964    0.03708  10.779  < 2e-16 ***
## grade4       0.65188    0.03893  16.744  < 2e-16 ***
## t7l         -0.99165    0.05771 -17.183  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 1.000677)
## 
## Number of Fisher Scoring iterations: 4

Other Projects


© 2017. All rights reserved.

Powered by Hydejack v6.6.0