# Statistically Significant Trend Analysis

Projects | Links:

### Introduction

The purpose of this analysis is to quantify statistically significant changes in multi-year trends using complex survey design. The analysis was motivated by a CDC publication titled **Conducting Trend Analyses of YRBSS Data**. The goal of this analysis is to replicate exact statistics presented in the paper using open source methods. In this example, we examine change over time for smoking prevalence among youth.

Since originally completing this analysis in 2015, CDC recently updated the whitepaper in 2016 to assess a different question. The referenced report, therefore, was re-directed to an archived copy stored on the Wayback Machine website.

### Data

The analysis was conducted using multiple years (i.e. 1991-2011) of Youth Risk Behavioral Surveillance System (YRBSS) microdata, which is a surveillance tool by the US CDC to monitor health-risk behaviors that contribute to causes of death and disability among youth and adults, such as tobacco smoking.

### Dependent Variable

The outcome for this analysis is the dichotomous risk behavior: did the respondent ever smoke?

### Time Variable

Linear transformations of the time variable were constructed for the analysis to model various potential trend outcomes to be tested to ascertain if and when a change in trend occurred. If, for example, the p-value for the quadratic time variable is statistically significant at the 0.05 level, then the model suggests a quadratic change.

### Model

#### Complex Survey Design

A nested strata survey design was created to account for the stacked set by year.

#### Fit Trend Changes

A logistic regression was used to test for significant change over time because of the dichotomous outcome. Iterations of fit were conducted on the various time variables (e.g. the model was fit on the linear, quadratic and then cubic time variable) for a total of three models. All models included variables to control for *sex*, *race/ethnicity*, and *grade*. As noted by the authors, only the highest-order time variable should be considered as being valid and interpretable.

##### Linear

##### Quadratic

##### Cubic

#### Break Point

A final model including the highest-order significant time variable is fit to the data and the marginal predictions are obtained for each datum (i.e. year). A datum point with a large standard error represents higher uncertainty and is assigned a lower weight for the piece-wise regression model. The segmented package utilizes bootstrap restarting to help the model escape local minima, which is especially true when the sample size and/or signal-to-noise ratio is low. Consequently, final estimates may slightly vary.

#### Validation

The break point can be validated by separating and fitting two different regression models–one for each segment. The assumption is that there should be no significant changes in trend in either segment. The first model covers the years leading up to (and including) the changepoint (i.e., 1991 to 1999). The second model includes the years from the changepoint forward (i.e., 1999 to 2011). Inferential statistics confirm that the period 1991-1999 saw no significant change while a significant linear decrease in smoking prevalence was seen during 1999-2011 as seem in the term *t7l*.