k-fold Cross Validation

Using the same data to construct a model and evaluate it can lead to overfitting, and poor generalization if you want to predict on future data.

One way to cope is data splitting: split your data into two subsets (training data and testing data), train models using the training data only, then evaluate prediction error on the testing data.

A slightly more sophisticated popular method is \(k\)-fold cross validation. Let \(\mathcal{D}\) denote the full dataset. We partition the dataset \(\mathcal{D}\) into \(k\) disjoint subsets \(\mathcal{D}_1,\ldots,\mathcal{D}_k\) of roughly equal size (typically chosen randomly).

How to compute the CV score for your model \(m\).

For each fold \(j = 1, \ldots, k\)
- Fit your model \(m\) on the subset \(\mathcal{D} - \mathcal{D}_j\). In the case of a linear model, you will get some \(\widehat{\beta}(m, j)\).
- Use this fitted model to predict on the remaining part of the data \(\mathcal{D}_j\) and compute the prediction error. In the case of a linear model, it might look something like \[\text{MSE}(m,j) := \frac{1}{|\mathcal{D}_j|} \sum_{i \in \mathcal{D}_j} (y_i - x_i^\top \widehat{\beta}(m, j))^2.\]
Average (or add) these prediction errors over all the folds. The final value might look something like \[\text{CV}(m) = \frac{1}{k} \sum_{j=1}^k \text{MSE}(m, j) = \frac{1}{k} \sum_{j=1}^k \frac{1}{|\mathcal{D}_j|} \sum_{i \in \mathcal{D}_j} (y_i - x_i^\top \widehat{\beta}(m, j))^2.\]

This quantity is a rough measure of prediction error of the model \(m\). You can compare \(\text{CV}(m)\) for your different candidate models, and select the model that has the smallest value.

Implementation considerations:

You may use createFolds() from the caret package to create randomly chosen folds as described above.
You can use predict() using your fitted lm object to get this model’s prediction on new data. See help(predict.lm) for more detail.

Here is some sample code on how to use the above two functions.

library(faraway)
library(caret)

data(seatpos)
head(seatpos)

##   Age Weight HtShoes    Ht Seated  Arm Thigh  Leg hipcenter
## 1  46    180   187.2 184.9   95.2 36.1  45.3 41.3  -206.300
## 2  31    175   167.5 165.5   83.8 32.9  36.5 35.9  -178.210
## 3  23    100   153.6 152.2   82.9 26.0  36.6 31.0   -71.673
## 4  19    185   190.3 187.4   97.3 37.4  44.1 41.0  -257.720
## 5  23    159   178.0 174.1   93.9 29.5  40.1 36.9  -173.230
## 6  47    170   178.7 177.0   92.4 36.0  43.2 37.4  -185.150

n <- dim(seatpos)[1]
n

## [1] 38

folds <- createFolds(seatpos$hipcenter, k=5, list=T)
folds

## $Fold1
## [1]  2  3  4  7 11 20 26 31
## 
## $Fold2
## [1]  6  9 17 22 24 25 27 34
## 
## $Fold3
## [1]  5  8 14 15 19 33 38
## 
## $Fold4
## [1] 12 18 21 29 32 35 36 37
## 
## $Fold5
## [1]  1 10 13 16 23 28 30

# Example for computing prediction error on fold 1,
# for the model that uses Ht, Age, Leg
mod <- lm(hipcenter ~ Ht + Age + Leg, data=seatpos[-folds[[1]],])
folds[[1]]

## [1]  2  3  4  7 11 20 26 31

head(mod$model, 10)

##    hipcenter    Ht Age  Leg
## 1    -206.30 184.9  46 41.3
## 5    -173.23 174.1  23 36.9
## 6    -185.15 177.0  47 37.4
## 8    -270.92 182.7  28 43.1
## 9    -151.78 165.0  23 33.4
## 10   -113.88 158.7  29 32.8
## 12   -125.55 152.5  41 31.1
## 13   -203.61 177.2  51 40.2
## 14   -163.22 162.7  30 33.8
## 15   -204.11 176.4  22 36.6

dim(mod$model)

## [1] 30  4

n - length(folds[[1]])

## [1] 30

# use help(predict.lm) to see how to use predict()
preds <- predict(mod, seatpos[folds[[1]], ])
preds

##         2         3         4         7        11        20        26 
## -160.6796 -104.2205 -243.9908 -162.1714 -165.5979 -172.0470 -218.4299 
##        31 
## -114.1710

yval <- seatpos[folds[[1]], "hipcenter"]
MSE <- 1 / length(folds[[1]]) * sum((preds - yval)^2)
MSE

## [1] 1038.14

Leave one out cross validation

The special case \(k=n\) is leave one out cross validation; each fold involves training a model on all but one data point, and the model is evaluated based on prediction error of the last data point.

In this case, we have an explicit formula for the prediction error, since the predicted value for \(x_i\) is \(x_i^\top \widehat{\beta}_{[i]}(m)\), where \(\widehat{\beta}_{[i]}(m)\) is trained on the dataset with the \(i\)th point excluded. The error is precisely the predicted residual \[\widehat{e}_{[i]}(m) := y_i - x_i^\top \widehat{\beta}_{[i]}(m)\] so \(\text{MSE}_i\) from earlier is simply \(\widehat{e}^2_{[i]}(m)\) in this case. Recall the earlier formula \[\widehat{e}_{[i]}(m) = \frac{\widehat{e}_i(m)}{1 - h_i(m)}.\] Thus we can consider the Predicted REsidual Sum of Squares (PRESS), \[\text{PRESS}(m) := \sum_{i=1}^n \frac{\widehat{e}_i^2(m)}{(1 - h_i(m))^2},\] which is basically \(\sum_{i=1}^k \text{MSE}_i\) in this special case of \(k=n\) folds. In my definition of \(\text{CV}(m)\) above, I averaged, so \[\text{CV}(m) = \frac{1}{n} \text{PRESS}(m)\] in the case \(k=n\). But this factor of \(\frac{1}{n}\) does not matter when comparing across different models, as long as you are consistent.

Implementation considerations:

hatvalues() can take an lm object as argument and give you the leverage values \(h_i\)

STAT 151A: Lab 10: Cross Validation

Billy Fang

3 November 2017

k-fold Cross Validation

Leave one out cross validation