1 Cross-Validation Handout

Note: Working definitions and graphs are taken from Ugarte, Militino, and Arnholt (2016)

1.1 The Validation Set Approach

The basic idea behind the validation set approach is to split the available data into a training set and a testing set. A regression model is developed using only the training set. Consider Figure 1.1 which illustrates a split of the available data into a training set and a testing set.

Figure 1.1: Validation set approach

The percent of values that are allocated into training and testing may vary based on the size of the available data. It is not unusual to allocate 70–75% of the available data as the training set and the remaining 25–30% as the testing set. The predictive performance of a regression model is assessed using the testing set. One of the more common methods to assess the predictive performance of a regression model is the mean square prediction error (MSPE). The MSPE is defined as

\[\begin{equation} \textrm{MSPE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \tag{1.1} \end{equation}\]

1.2 Leave-One-Out Cross Validation

The leave-one-out cross-validation (LOOCV) eliminates the problem of variability in MSPE present in the validation set approach. The LOOCV is similar to the validation set approach as the available \(n\) observations are split into training and testing sets. The difference is that each of the available \(n\) observations are split into \(n\) training and \(n\) testing sets where each of the \(n\) training sets consist of \(n - 1\) observations and each of the testing sets consists of a single different value from the original \(n\) observations. Figure 1.2 provides a schematic display of the leave-one-out cross-validation process with testing sets (light shade) and training sets (dark shade) for a data set of \(n\) observations.

Figure 1.2: Leave-one-out cross validation

The MSPE is computed with each testing set resulting in \(n\) values of MSPE. The LOOCV estimate for the test MSPE is the average of these \(n\) MSPE values denoted as

\[\begin{equation} \textrm{CV}_{(n)} = \frac{1}{n}\sum_{i=1}^{n}\textrm{MSPE}_i \tag{1.2} \end{equation}\]

1.3 \(k\)-Fold Cross Validation

\(k\)-fold cross-validation is similar to LOOCV in that the available data is split into training sets and testing sets; however, instead of creating \(n\) different training and testing sets, \(k\) folds/groups of training and testing sets are created where \(k < n\) and each fold consists of roughly \(n/k\) values in the testing set and \(n - n/k\) values in the training set. Figure 1.3 shows a schematic display of 5-fold cross-validation. The lightly shaded rectangles are the testing sets and the darker shaded rectangles are the training sets.

Figure 1.3: Five fold cross-validation

The MSPE is computed on each of the \(k\) folds using the testing set to evaluate the regression model built from the training set. The weighted average of \(k\) MSPE values is denoted as

\[\begin{equation} \textrm{CV}_{(k)} = \sum_{k=1}^{k}\frac{n_k}{n}\textrm{MSPE}_k \tag{1.3} \end{equation}\]

Note that LOOCV is a special case of \(k\)-fold cross-validation where \(k\) is set equal to \(n\). An important advantage \(k\)-fold cross-validation has over LOOCV is that \(\textrm{CV}_{k}\) for \(k = 5\) or \(k = 10\) provides a more accurate estimate of the test error rate than does \(\textrm{CV}_n\).

1.4 Creating some data

set.seed(24)
n <- 1000           # Number of observations to generate
SD <- 0.5
xs <- sort(runif(n, 5, 9))
ys <- sin(xs) + rnorm(n, 0, SD)
DF <- data.frame(x = xs, y = ys)
rm(xs, ys)
library(DT)
datatable(DF)

1.5 Validation Set Approach

Create a training set using 75% of the observations in DF.
Sort the observations in the training and testing sets.

n <- nrow(DF)
train <- sample(n, floor(0.75 * n), replace = FALSE)
train <- sort(train) 
trainSET <- DF[train, ]
testSET <- DF[-train, ]
dim(trainSET)

[1] 750   2

dim(testSET)

[1] 250   2

Fit a quadratic model using the training set (trainSET).

library(ggplot2)
# Base R
plot(y ~ x, data = trainSET, pch = 19, cex = .25, col = "blue")
modq <- lm(y ~ poly(x, 2, raw = TRUE), data = trainSET)
yhat <- predict(modq, data = trainSET)
lines(trainSET$x, yhat, col = "red")

# ggplot2 approach
ggplot(data = trainSET, aes(x = x, y = y)) + 
  geom_point(color = "blue", size = 1) + 
  theme_bw() + 
  geom_smooth(method = "lm", formula = y ~ poly(x, 2, raw = TRUE), color = "red", se = FALSE)

# Summary of quadratic model
summary(modq)


Call:
lm(formula = y ~ poly(x, 2, raw = TRUE), data = trainSET)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.65985 -0.32154 -0.00695  0.33937  1.47986 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)             -14.81533    0.74463  -19.90   <2e-16 ***
poly(x, 2, raw = TRUE)1   3.92073    0.21685   18.08   <2e-16 ***
poly(x, 2, raw = TRUE)2  -0.24569    0.01546  -15.90   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5078 on 747 degrees of freedom
Multiple R-squared:  0.6076,    Adjusted R-squared:  0.6065 
F-statistic: 578.3 on 2 and 747 DF,  p-value: < 2.2e-16

1.6 Compute the training `MSPE`

MSPE <- mean(resid(modq)^2)
MSPE

[1] 0.2567848

1.7 Compute the testing `MSPE`

yhtest <- predict(modq, newdata = testSET)
MSPEtest <- mean((testSET$y - yhtest)^2)
MSPEtest

[1] 0.2986236

1.8 Fit a cubic model.

# Base R
plot(y ~ x, data = trainSET, pch = 19, cex = .25, col = "blue")
modc <- lm(y ~ poly(x, 3, raw = TRUE), data = trainSET)
yhat <- predict(modc, data = trainSET)
lines(trainSET$x, yhat, col = "red")

# ggplot2 approach
ggplot(data = trainSET, aes(x = x, y = y)) + 
  geom_point(color = "blue", size = 0.5) + 
  theme_bw() + 
  geom_smooth(method = "lm", formula = y ~ poly(x, 3, raw = TRUE), color = "red", se = FALSE)

# Summary of cubic model
summary(modc)


Call:
lm(formula = y ~ poly(x, 3, raw = TRUE), data = trainSET)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.53023 -0.32179 -0.01373  0.33392  1.43179 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)             15.93473    4.84463   3.289  0.00105 ** 
poly(x, 3, raw = TRUE)1 -9.73992    2.13842  -4.555 6.13e-06 ***
poly(x, 3, raw = TRUE)2  1.74229    0.31004   5.620 2.70e-08 ***
poly(x, 3, raw = TRUE)3 -0.09485    0.01477  -6.420 2.43e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4946 on 746 degrees of freedom
Multiple R-squared:  0.6281,    Adjusted R-squared:  0.6266 
F-statistic:   420 on 3 and 746 DF,  p-value: < 2.2e-16

1.9 Compute the training `MSPE`

MSPE <- mean(resid(modc)^2)
MSPE

[1] 0.2433419

1.10 Compute the testing `MSPE`

yhtest <- predict(modc, newdata = testSET)
MSPEtest <- mean((testSET$y - yhtest)^2)
MSPEtest

[1] 0.2845815

1.11 Your Turn

Create a training set (80%) and testing set (20%) of the observations from the data frame HSWRESTLER from the PASWR2 package. Store the results from regressing hwfat onto abs and triceps in the object modf.
Compute the test MSPE.
Note how the answers of your classmates are all different. The validation estimate of the test MSPE can be highly variable.

# Your code here
library(PASWR2)
#
#
#
#
#
#
#
#
#
#
#

1.12 Your Turn

The left side of Figure 5.2 on page 178 of James et al. (2013) shows the validation approach used on the Auto data set in order to estimate the test error that results from predicting mpg using polynomial functions of horsepower for one particular split of the original data. The code below creates a similar graph.

library(ISLR)
n <- nrow(Auto)
plot(1:10, type ="n", xlab = "Degree of Polynomial", ylim = c(15, 30), 
     ylab = "Mean Squared Prediction Error") 
IND <- sample(1:n, size = floor(n/2), replace = FALSE)
train <- Auto[IND, ]
test <- Auto[-IND, ]
MSPE <- numeric(10)
for(i in 1:10){
    mod <- lm(mpg ~ poly(horsepower, i), data = train)
    pred <- predict(mod, newdata = test)
    MSPE[i] <- mean((test$mpg - pred)^2)
  }
lines(1:10, MSPE, col = "red")
points(1:10, MSPE, col = "red", pch = 19)

# ggplot2 approach
DF2 <- data.frame(x = 1:10, MSPE = MSPE)
ggplot(data = DF2, aes(x = x, y = MSPE)) + 
  geom_point(color = "red", size = 2) + 
  geom_line(color = "red") + 
  theme_bw() + 
  ylim(15, 30) + 
  scale_x_continuous(breaks = 1:10) + 
  labs(x = "Degree of Polynomial", y = "Mean Squared Prediction Error")

Modify the code above to recreate a graph similar to the right side of Figure 5.2 on page 178 of James et al. (2013). Hint: Place a for loop before IND.

# Your code here
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
# ggplot2 approach
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#

1.13 \(k\) Fold Cross Validation

Create \(k = 5\) folds.
Compute the \(\textrm{CV}_{k=5}\) for modq.

set.seed(1)
k <- 5
MSPE <- numeric(k)
folds <- sample(x = 1:k, size = nrow(DF), replace = TRUE)
xtabs(~folds)

folds
  1   2   3   4   5 
210 194 183 204 209

# or
table(folds)

folds
  1   2   3   4   5 
210 194 183 204 209

sum(xtabs(~folds))

[1] 1000

for(j in 1:k){
  modq <- lm(y ~ poly(x, 2, raw = TRUE), data = DF[folds != j, ])
  pred <- predict(modq, newdata = DF[folds ==j, ])
  MSPE[j] <- mean((DF[folds == j, ]$y - pred)^2) 
}
MSPE

[1] 0.2660482 0.2644834 0.2283972 0.3057784 0.2736351

weighted.mean(MSPE, table(folds)/sum(folds))

[1] 0.2685451

1.13.1 Using `caret`

library(caret)
model <- train(
  form = y ~ poly(x, 2, raw = TRUE),
  data = DF,
  method = "lm",
  trControl = trainControl(
    method = "cv", number = 5
  )
)  
model

Linear Regression 

1000 samples
   1 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 800, 800, 800, 800, 800 
Resampling results:

  RMSE       Rsquared   MAE      
  0.5163329  0.6103537  0.4117134

Tuning parameter 'intercept' was held constant at a value of TRUE

1.14 Your Turn

Compute the \(\textrm{CV}_8\) for modf. Recall that modf was created from regressing hwfat onto abs and triceps.

# Your code here
set.seed(13)
k <- 8
MSPE <- numeric(k)
folds <- sample(x = 1:k, size = nrow(HSWRESTLER), replace = TRUE)
#
#
#
#
#
#
#
#
#

1.14.1 Using `caret`

model <- train(
  form = hwfat ~ abs + triceps,
  data = HSWRESTLER,
  method = "lm",
  trControl = trainControl(
    method = "cv", number = 8
  )
)  
model

Linear Regression 

78 samples
 2 predictor

No pre-processing
Resampling: Cross-Validated (8 fold) 
Summary of sample sizes: 68, 69, 68, 68, 68, 68, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  3.141969  0.8494259  2.510904

Tuning parameter 'intercept' was held constant at a value of TRUE

model$results$RMSE^2

[1] 9.871968

1.15 Using `cv.glm` from `boot`

set.seed(1)
library(boot)
glm.fit <- glm(y ~ poly(x, 2, raw = TRUE), data = DF)
cv.err <- cv.glm(data = DF, glmfit = glm.fit, K = 5)$delta[1]
cv.err

[1] 0.2690022

1.16 Your Turn

Compute \(\textrm{CV}_8\) for modf using cv.glm. Recall that modf was created from regressing hwfat onto abs and triceps.

# Your code here
glm.fit <- glm(hwfat ~ abs + triceps, data = HSWRESTLER)
#
#

Use caret and compare answers.

# Your Code Here
#
#
#
#
#
#
#
#
#
#

1.17 Your Turn

The right side of Figure 5.4 on page 180 of James et al. (2013) shows the 10-fold cross-validation approach used on the Auto data set in order to estimate the test error that results from predicting mpg using polynomial functions of horsepower run nine separate times. The code below creates a graph showing one particular run.

# Your code here
plot(1:10, type ="n", xlab = "Degree of Polynomial", ylim = c(16, 26), 
     ylab = "Mean Squared Prediction Error", main = "10-fold CV")

k <- 10 # number of folds
MSPE <- numeric(k)
cv <- numeric(k)
#
#
#
#
#
#
#
#
#
#
#

Use a for loop to run the above code nine times. The result should look similar to the right side of Figure 5.4 on page 180 of James et al. (2013).

# Your Code Here
set.seed(123)
plot(1:10, type ="n", xlab = "Degree of Polynomial", ylim = c(16, 26), 
     ylab = "Mean Squared Prediction Error", main = "10-fold CV")

k <- 10      # number of folds
MSPE <- numeric(k)
cv <- numeric(k)
#
#
#
#
#
#
#
#
#
#
#
#
#
## GGplot2 approach
set.seed(123)
cv <- matrix(NA, 10, 9)
k <- 10      # number of folds
MSPE <- numeric(k)
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#

Use the function cv.glm to create similar a graph to the right side of Figure 5.4 on page 180 of James et al. (2013).

# Your Code Here
plot(1:10, type ="n", xlab = "Degree of Polynomial", ylim = c(16, 26), 
     ylab = "Mean Squared Prediction Error", main = "10-fold CV")

#
#
#
#
#
#
#
#
#
# GGplot2 approach
cv.err <- matrix(NA, 10, 9)
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#

1.18 Leave-One-Out Cross-Validation

set.seed(1)
k <- nrow(DF)
MSPE <- numeric(k)
folds <- sample(x = 1:k, size = nrow(DF), replace = FALSE)  
# Note that replace changes to FALSE for LOOCV...can you explain why?
for(j in 1:k){
  modq <- lm(y ~ poly(x, 2, raw = TRUE), data = DF[folds != j, ])
  pred <- predict(modq, newdata = DF[folds ==j, ])
  MSPE[j] <- mean((DF[folds == j, ]$y - pred)^2) 
}
mean(MSPE)

[1] 0.2682352

1.19 Your Turn

Compute \(\textrm{CV}_n\) for modf. Recall that modf was created from regressing hwfat onto abs and triceps.

# Your Code Here
set.seed(1)
k <- nrow(HSWRESTLER)
MSPE <- numeric(k)
#
#
#
#
#
#
#

Recall

\[CV_{n}=\frac{1}{n}\sum_{i=1}^n\left(\frac{y_i - \hat{y_i}}{1 - h_i}\right)^2\]

modq <- lm(y ~ poly(x, 2, raw = TRUE), data = DF)
h <- hatvalues(modq)
CVn <- mean(((DF$y - predict(modq))/(1 - h))^2)
CVn

[1] 0.2682352

1.20 Your Turn

Compute \(\textrm{CV}_n\) for modf using the mathematical shortcut. Recall that modf was created from regressing hwfat onto abs and triceps.

# Your Code Here
modf <- lm(hwfat ~ abs + triceps, data = HSWRESTLER)
#
#
#

1.21 Using `cv.glm` from `boot`

Note: If one does not use the K argument for the number of folds, gv.glm will compute LOOCV.

library(boot)
glm.fit <- glm(y ~ poly(x, 2, raw = TRUE), data = DF)
cv.err <- cv.glm(data = DF, glmfit = glm.fit)$delta[1]
cv.err

[1] 0.2682352

1.22 Your Turn

Compute \(\textrm{CV}_n\) for modf using cv.glm. Recall that modf was created from regressing hwfat onto abs and triceps.

# Your Code Here
glm.fit <- glm(hwfat ~ abs + triceps, data = HSWRESTLER)
#
#

1.23 Your Turn

Create a graph similar to the left side of Figure 5.4 on page 180 of James et al. (2013).

# Your Code Here
plot(1:10, type ="n", xlab = "Degree of Polynomial", ylim = c(16, 26), 
     ylab = "Mean Squared Prediction Error")

k <- nrow(Auto) # number of folds
MSPE <- numeric(k)
cv <- numeric(10)
#
#
#
#
#
#
#
#
#
#
#
#

Using the short cut formula:

# Your Code Here
plot(1:10, type ="n", xlab = "Degree of Polynomial", ylim = c(16, 26), 
     ylab = "Mean Squared Prediction Error")

cv <- numeric(10)
#
#
#
#
#
#
#

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani, eds. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics 103. New York: Springer.

Ugarte, María Dolores, Ana F. Militino, and Alan T. Arnholt. 2016. Probability and Statistics with R. Second edition. Boca Raton: CRC Press, Taylor & Francis Group.

Cross-Validation Hand Out

Alan Arnholt

Last knit on: March 23, 2023 at 08:46:45 AM

1 Cross-Validation Handout

1.1 The Validation Set Approach

1.2 Leave-One-Out Cross Validation

1.3 \(k\)-Fold Cross Validation

1.4 Creating some data

1.5 Validation Set Approach

1.6 Compute the training `MSPE`

1.7 Compute the testing `MSPE`

1.8 Fit a cubic model.

1.9 Compute the training `MSPE`

1.10 Compute the testing `MSPE`

1.11 Your Turn

1.12 Your Turn

1.13 \(k\) Fold Cross Validation

1.13.1 Using `caret`

1.14 Your Turn

1.14.1 Using `caret`

1.15 Using `cv.glm` from `boot`

1.16 Your Turn

1.17 Your Turn

1.18 Leave-One-Out Cross-Validation

1.19 Your Turn

1.20 Your Turn

1.21 Using `cv.glm` from `boot`

1.22 Your Turn

1.23 Your Turn

References

Cross-Validation Hand Out

Alan Arnholt

Last knit on: March 23, 2023 at 08:46:45 AM

1 Cross-Validation Handout

1.1 The Validation Set Approach

1.2 Leave-One-Out Cross Validation

1.3 \(k\)-Fold Cross Validation

1.4 Creating some data

1.5 Validation Set Approach

1.6 Compute the training MSPE

1.7 Compute the testing MSPE

1.8 Fit a cubic model.

1.9 Compute the training MSPE

1.10 Compute the testing MSPE

1.11 Your Turn

1.12 Your Turn

1.13 \(k\) Fold Cross Validation

1.13.1 Using caret

1.14 Your Turn

1.14.1 Using caret

1.15 Using cv.glm from boot

1.16 Your Turn

1.17 Your Turn

1.18 Leave-One-Out Cross-Validation

1.19 Your Turn

1.20 Your Turn

1.21 Using cv.glm from boot

1.22 Your Turn

1.23 Your Turn

References

1.6 Compute the training `MSPE`

1.7 Compute the testing `MSPE`

1.9 Compute the training `MSPE`

1.10 Compute the testing `MSPE`

1.13.1 Using `caret`

1.14.1 Using `caret`

1.15 Using `cv.glm` from `boot`

1.21 Using `cv.glm` from `boot`