Cross Validation

Dec 5, 2017 00:00 · 834 words · 4 minute read

Introduction

Cross validation(CV) is used to evaluate a model’s predictive performance and is preferred over measures of fit like $R^2$ and calibration error such as goodness-of-fit. In the case against $R^2$, it is widely understood that one can optimistically inflate the $R^2$ by increasing the degrees of freedom; increasing the degrees of freedom can easily be achieved by adding more covariates. The result is an overfit model that looks good on paper, but performs poorly in practice. In the case against goodness-of-fit such as the Hosmer-Lemeshow test, I have never felt comfortable with the arbitrary binning and found the significance tests to yield inconsistent results by toggling the number of bins.

In regards to Kaggle competitions, relying on a local CV is almost always the better choice. The public leaderboard scores are based on a percentage of the full blinded test set. Many shared solutions may be tuned using this small subset of the test data, which means that the model may not generalize well over the full data set. I personally experienced the hard lessons of ignoring CV in a competition. I retreated into the mountains for several years in shame where I practiced CV self-discipline only returning to society after redeeming myself. OK. Things weren’t that dramatic, but relying on my local CV prevented another slide.

k-fold cross validation

In machine learning pop culture, CV often refers to k-fold designs. The $k$ represents subsamples of equal sizes called folds. When fitting models, a fold is held and the $k-1$ remaining folds are used to train the model. The model is then evaluated on the left-out fold. This process is repeated until the model has been evaluated on all $k$ folds. Although CV increases processing time, several benefits may be realized such as the ability to make judgements between models using a quantifiable metric; to know when a model may be overfit; and, to quantify prediction quality on a blend of several predictions.

Method

The example below was conducted using R, but it can be easily replicated in Python. There are also many packages for both R and Python that split the data into folds and/or include cross validation preparation into the pipeline. The purpose for doing this method manually, however, is to gain a conceptual understanding of how folds are prepared and used in model fitting.

library(data.table)
library(xgboost)
library(knitr)
library(MLmetrics)

Data

The data can be downloaded from Kaggle’s Porto Seguro’s Safe Driver Prediction competition.

path_dat = '../input' # replace with your dir
train <- fread(sprintf("%s/train.csv", path_dat))
test <- fread(sprintf("%s/test.csv", path_dat))

Assigning folds

Here the sample() function is used to assign the fold index. A seed is set to ensure reproducibility. An examination of the fold column shows $k=5$ folds.

cv_folds = 5

set.seed(808)
train[, fold := sample(1:cv_folds, nrow(train), replace=TRUE)]

kable(head(train[, c(1,ncol(train),2), with=FALSE], 10), format="markdown")
id fold target
7 5 0
9 1 0
13 2 0
16 4 0
17 5 0
19 3 0
20 1 0
22 5 0
26 2 0
28 2 1

Fitting

A k-fold cross validation design is used here with eXtreme Gradient Boosting. The data set is split into a train and test set. The model uses the test set to evaluate training performance and stops further training once gains have no longer been seen for 10 rounds. This process iterates through five times where one iteration is conducted for each fold. The number of rounds was set to 100 to iterate through the example quickly.

# custom loss function
normalizedGini <- function(preds, dtrain) {
  labels <- getinfo(dtrain, "label")
  err <- Gini(as.numeric(preds), as.numeric(labels))
  return(list(name = "Gini", value = err, higher_better=TRUE))
}

# training
xtr=list()

f <- setdiff(names(train), c('id','target','fold'))

for (i in 1:cv_folds) {
  dtrain <- xgb.DMatrix(data=as.matrix(train[train$fold!=i, ..f]), label=train[train$fold!=i, ]$target)
  dtest <- xgb.DMatrix(data=as.matrix(train[train$fold==i, ..f]), label=train[train$fold==i, ]$target)

  watchlist <- list(train=dtrain, test=dtest)

  xtr[[i]] <- xgb.train(data          = dtrain,
                        watchlist     = watchlist,
                        objective     = "reg:logistic",
                        eta           = 0.1,
                        nrounds       = 100,
                        feval         = normalizedGini,
                        maximize      = TRUE,
                        early_stopping_round = 10,
                        verbose = FALSE)
}

Evaluate

The xtr object now contains five models, which will be used to predict on its respective held out fold. The predictions are then scored. In addition to evaluting the model, predictions made on the training set was created as xgb_train, which can be stored and used in stacking or blending with other predictions. Although blending is used on test predictions, a blend on the training predictions can be scored against and used to evaluate the blend’s quality.

xtr.score <- train$target

for (i in 1:cv_folds) {
  xtr.score[train$fold==i] <- predict(xtr[[i]], as.matrix(train[train$fold==i, ..f]))
}

print(MLmetrics::NormalizedGini(xtr.score, train$target))

xgb_train  <- data.table(id=train$id, target=xtr.score)

Create predictions

In this case, the Kaggle test set is used for predicting on each model stored in xtr. The predictions are then averaged and prepared for submission.

test.y <- list()

for (i in 1:cv_folds) {
  test.y[[i]] <- predict(xtr[[i]], as.matrix(test[, ..f]), reshape=TRUE) #, reshape=TRUE ???
}

submission <- data.table(id=test$id, target=Reduce('+',test.y)/cv_folds)

Conclusion

In general, preparing a CV design is important for evaluating model performance and quality. It is a technique used to benchmark models and quantify improvement or degradation.