Cross Validation

Dec 5, 2017 00:00 · 498 words · 3 minute read

Introduction

Cross validation(CV) is used to evaluate a model’s predictive performance and is preferred over measures of fit like \(R^2\) and calibration error such as goodness-of-fit. In the case against \(R^2\), it is widely understood that one can optimistically inflate the \(R^2\) by increasing the degrees of freedom; increasing the degrees of freedom can easily be achieved by adding more covariates. The result is an overfit model that looks good on paper, but performs poorly in practice. In the case against goodness-of-fit such as the Hosmer-Lemeshow test, Dr. Frank Harrell details its shortfalls including its inability to penalize extreme overfitting. In either method, model performance is judged on the fitted data. Cross-validation, on the otherhand, helps use to gauge how the model may perform on “new” data.

k-fold cross validation

Often times, CV refers to the \(k\)-fold design where \(k\) represents subsamples of approximately equal sizes called folds. In the \(k\)-fold design, a fold is held out and the \(k-1\) remaining folds are used to train the model. The model is then evaluated on the held out fold. This process is repeated until the model has been evaluated on all \(k\) folds.

Method

library(data.table)
library(xgboost)
library(knitr)
library(MLmetrics)

Data

The data can be downloaded from Kaggle’s Porto Seguro’s Safe Driver Prediction competition.

path_dat = '../input' # replace with your dir
train <- fread(sprintf("%s/train.csv", path_dat))
test <- fread(sprintf("%s/test.csv", path_dat))

Assigning folds

Here the sample() function is used to assign the fold index. A seed is set to ensure reproducibility. An examination of the fold column shows \(k=5\) folds.

cv_folds = 5

set.seed(808)
train[, fold := sample(1:cv_folds, nrow(train), replace=TRUE)]

kable(head(train[, c(1,ncol(train),2), with=FALSE], 10), format="markdown")
id fold target
7 5 0
9 1 0
13 2 0
16 4 0
17 5 0
19 3 0
20 1 0
22 5 0
26 2 0
28 2 1

Fitting

Here we can see how the \(k\)-fold works. The loop makes 5 iterations of training using an eXtreme Gradient Boosting model. We can see a fold is held out at each iteration. The held out fold will be used to evaluate the performance of 5 independently trained models.

# custom loss function
normalizedGini <- function(preds, dtrain) {
  labels <- getinfo(dtrain, "label")
  err <- Gini(as.numeric(preds), as.numeric(labels))
  return(list(name = "Gini", value = err, higher_better=TRUE))
}

# training
xtr=list()

f <- setdiff(names(train), c('id','target','fold'))

for (i in 1:cv_folds) {
  dtrain <- xgb.DMatrix(data=as.matrix(train[train$fold!=i, ..f]), label=train[train$fold!=i, ]$target)
  dtest <- xgb.DMatrix(data=as.matrix(train[train$fold==i, ..f]), label=train[train$fold==i, ]$target)

  watchlist <- list(train=dtrain, test=dtest)

  xtr[[i]] <- xgb.train(data          = dtrain,
                        watchlist     = watchlist,
                        objective     = "reg:logistic",
                        eta           = 0.1,
                        nrounds       = 100,
                        feval         = normalizedGini,
                        maximize      = TRUE,
                        early_stopping_round = 10,
                        verbose = FALSE)
}

Evaluate

The xtr object now contains five models, which are evaluated using its respective held out fold.

xtr.score <- train$target

for (i in 1:cv_folds) {
  xtr.score[train$fold==i] <- predict(xtr[[i]], as.matrix(train[train$fold==i, ..f]))
}

print(MLmetrics::NormalizedGini(xtr.score, train$target))

xgb_train  <- data.table(id=train$id, target=xtr.score)

Conclusion

In general, preparing a CV design is important for evaluating model performance when we are interested in using it for prediction on new data.