k-fold Cross Validation

Using the same data to construct a model and evaluate it can lead to overfitting, and poor generalization if you want to predict on future data.

One way to cope is data splitting: split your data into two subsets (training data and testing data), train models using the training data only, then evaluate prediction error on the testing data.

A slightly more sophisticated popular method is \(k\)-fold cross validation. Let \(\mathcal{D}\) denote the full dataset. We partition the dataset \(\mathcal{D}\) into \(k\) disjoint subsets \(\mathcal{D}_1,\ldots,\mathcal{D}_k\) of roughly equal size (typically chosen randomly).

How to compute the CV score for your model \(m\).

This quantity is a rough measure of prediction error of the model \(m\). You can compare \(\text{CV}(m)\) for your different candidate models, and select the model that has the smallest value.

Implementation considerations:

Here is some sample code on how to use the above two functions.

library(faraway)
library(caret)
data(seatpos)
head(seatpos)
##   Age Weight HtShoes    Ht Seated  Arm Thigh  Leg hipcenter
## 1  46    180   187.2 184.9   95.2 36.1  45.3 41.3  -206.300
## 2  31    175   167.5 165.5   83.8 32.9  36.5 35.9  -178.210
## 3  23    100   153.6 152.2   82.9 26.0  36.6 31.0   -71.673
## 4  19    185   190.3 187.4   97.3 37.4  44.1 41.0  -257.720
## 5  23    159   178.0 174.1   93.9 29.5  40.1 36.9  -173.230
## 6  47    170   178.7 177.0   92.4 36.0  43.2 37.4  -185.150
n <- dim(seatpos)[1]
n
## [1] 38
folds <- createFolds(seatpos$hipcenter, k=5, list=T)
folds
## $Fold1
## [1]  2  3  4  7 11 20 26 31
## 
## $Fold2
## [1]  6  9 17 22 24 25 27 34
## 
## $Fold3
## [1]  5  8 14 15 19 33 38
## 
## $Fold4
## [1] 12 18 21 29 32 35 36 37
## 
## $Fold5
## [1]  1 10 13 16 23 28 30
# Example for computing prediction error on fold 1,
# for the model that uses Ht, Age, Leg
mod <- lm(hipcenter ~ Ht + Age + Leg, data=seatpos[-folds[[1]],])
folds[[1]]
## [1]  2  3  4  7 11 20 26 31
head(mod$model, 10)
##    hipcenter    Ht Age  Leg
## 1    -206.30 184.9  46 41.3
## 5    -173.23 174.1  23 36.9
## 6    -185.15 177.0  47 37.4
## 8    -270.92 182.7  28 43.1
## 9    -151.78 165.0  23 33.4
## 10   -113.88 158.7  29 32.8
## 12   -125.55 152.5  41 31.1
## 13   -203.61 177.2  51 40.2
## 14   -163.22 162.7  30 33.8
## 15   -204.11 176.4  22 36.6
dim(mod$model)
## [1] 30  4
n - length(folds[[1]])
## [1] 30
# use help(predict.lm) to see how to use predict()
preds <- predict(mod, seatpos[folds[[1]], ])
preds
##         2         3         4         7        11        20        26 
## -160.6796 -104.2205 -243.9908 -162.1714 -165.5979 -172.0470 -218.4299 
##        31 
## -114.1710
yval <- seatpos[folds[[1]], "hipcenter"]
MSE <- 1 / length(folds[[1]]) * sum((preds - yval)^2)
MSE
## [1] 1038.14

Leave one out cross validation

The special case \(k=n\) is leave one out cross validation; each fold involves training a model on all but one data point, and the model is evaluated based on prediction error of the last data point.

In this case, we have an explicit formula for the prediction error, since the predicted value for \(x_i\) is \(x_i^\top \widehat{\beta}_{[i]}(m)\), where \(\widehat{\beta}_{[i]}(m)\) is trained on the dataset with the \(i\)th point excluded. The error is precisely the predicted residual \[\widehat{e}_{[i]}(m) := y_i - x_i^\top \widehat{\beta}_{[i]}(m)\] so \(\text{MSE}_i\) from earlier is simply \(\widehat{e}^2_{[i]}(m)\) in this case. Recall the earlier formula \[\widehat{e}_{[i]}(m) = \frac{\widehat{e}_i(m)}{1 - h_i(m)}.\] Thus we can consider the Predicted REsidual Sum of Squares (PRESS), \[\text{PRESS}(m) := \sum_{i=1}^n \frac{\widehat{e}_i^2(m)}{(1 - h_i(m))^2},\] which is basically \(\sum_{i=1}^k \text{MSE}_i\) in this special case of \(k=n\) folds. In my definition of \(\text{CV}(m)\) above, I averaged, so \[\text{CV}(m) = \frac{1}{n} \text{PRESS}(m)\] in the case \(k=n\). But this factor of \(\frac{1}{n}\) does not matter when comparing across different models, as long as you are consistent.

Implementation considerations: