Note: Working definitions and graphs are taken from Ugarte, Militino, and Arnholt (2016)
The basic idea behind the validation set approach is to split the available data into a training set and a testing set. A regression model is developed using only the training set. Consider Figure 1.1 which illustrates a split of the available data into a training set and a testing set.
The percent of values that are allocated into training and testing may vary based on the size of the available data. It is not unusual to allocate 70–75% of the available data as the training set and the remaining 25–30% as the testing set. The predictive performance of a regression model is assessed using the testing set. One of the more common methods to assess the predictive performance of a regression model is the mean square prediction error (MSPE). The MSPE is defined as
\[\begin{equation} \textrm{MSPE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 \tag{1.1} \end{equation}\]
The leave-one-out cross-validation (LOOCV) eliminates the problem of variability in MSPE present in the validation set approach. The LOOCV is similar to the validation set approach as the available \(n\) observations are split into training and testing sets. The difference is that each of the available \(n\) observations are split into \(n\) training and \(n\) testing sets where each of the \(n\) training sets consist of \(n - 1\) observations and each of the testing sets consists of a single different value from the original \(n\) observations. Figure 1.2 provides a schematic display of the leave-one-out cross-validation process with testing sets (light shade) and training sets (dark shade) for a data set of \(n\) observations.
The MSPE is computed with each testing set resulting in \(n\) values of MSPE. The LOOCV estimate for the test MSPE is the average of these \(n\) MSPE values denoted as
\[\begin{equation} \textrm{CV}_{(n)} = \frac{1}{n}\sum_{i=1}^{n}\textrm{MSPE}_i \tag{1.2} \end{equation}\]
\(k\)-fold cross-validation is similar to LOOCV in that the available data is split into training sets and testing sets; however, instead of creating \(n\) different training and testing sets, \(k\) folds/groups of training and testing sets are created where \(k < n\) and each fold consists of roughly \(n/k\) values in the testing set and \(n - n/k\) values in the training set. Figure 1.3 shows a schematic display of 5-fold cross-validation. The lightly shaded rectangles are the testing sets and the darker shaded rectangles are the training sets.
The MSPE is computed on each of the \(k\) folds using the testing set to evaluate the regression model built from the training set. The weighted average of \(k\) MSPE values is denoted as
\[\begin{equation} \textrm{CV}_{(k)} = \sum_{k=1}^{k}\frac{n_k}{n}\textrm{MSPE}_k \tag{1.3} \end{equation}\]
Note that LOOCV is a special case of \(k\)-fold cross-validation where \(k\) is set equal to \(n\). An important advantage \(k\)-fold cross-validation has over LOOCV is that \(\textrm{CV}_{k}\) for \(k = 5\) or \(k = 10\) provides a more accurate estimate of the test error rate than does \(\textrm{CV}_n\).
set.seed(24)
n <- 1000 # Number of observations to generate
SD <- 0.5
xs <- sort(runif(n, 5, 9))
ys <- sin(xs) + rnorm(n, 0, SD)
DF <- data.frame(x = xs, y = ys)
rm(xs, ys)
library(DT)
datatable(DF)
DF
.n <- nrow(DF)
train <- sample(n, floor(0.75 * n), replace = FALSE)
train <- sort(train)
trainSET <- DF[train, ]
testSET <- DF[-train, ]
dim(trainSET)
[1] 750 2
dim(testSET)
[1] 250 2
trainSET
).library(ggplot2)
# Base R
plot(y ~ x, data = trainSET, pch = 19, cex = .25, col = "blue")
modq <- lm(y ~ poly(x, 2, raw = TRUE), data = trainSET)
yhat <- predict(modq, data = trainSET)
lines(trainSET$x, yhat, col = "red")
# ggplot2 approach
ggplot(data = trainSET, aes(x = x, y = y)) +
geom_point(color = "blue", size = 1) +
theme_bw() +
geom_smooth(method = "lm", formula = y ~ poly(x, 2, raw = TRUE), color = "red", se = FALSE)
# Summary of quadratic model
summary(modq)
Call:
lm(formula = y ~ poly(x, 2, raw = TRUE), data = trainSET)
Residuals:
Min 1Q Median 3Q Max
-1.65985 -0.32154 -0.00695 0.33937 1.47986
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -14.81533 0.74463 -19.90 <2e-16 ***
poly(x, 2, raw = TRUE)1 3.92073 0.21685 18.08 <2e-16 ***
poly(x, 2, raw = TRUE)2 -0.24569 0.01546 -15.90 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5078 on 747 degrees of freedom
Multiple R-squared: 0.6076, Adjusted R-squared: 0.6065
F-statistic: 578.3 on 2 and 747 DF, p-value: < 2.2e-16
MSPE
MSPE <- mean(resid(modq)^2)
MSPE
[1] 0.2567848
MSPE
yhtest <- predict(modq, newdata = testSET)
MSPEtest <- mean((testSET$y - yhtest)^2)
MSPEtest
[1] 0.2986236
# Base R
plot(y ~ x, data = trainSET, pch = 19, cex = .25, col = "blue")
modc <- lm(y ~ poly(x, 3, raw = TRUE), data = trainSET)
yhat <- predict(modc, data = trainSET)
lines(trainSET$x, yhat, col = "red")
# ggplot2 approach
ggplot(data = trainSET, aes(x = x, y = y)) +
geom_point(color = "blue", size = 0.5) +
theme_bw() +
geom_smooth(method = "lm", formula = y ~ poly(x, 3, raw = TRUE), color = "red", se = FALSE)
# Summary of cubic model
summary(modc)
Call:
lm(formula = y ~ poly(x, 3, raw = TRUE), data = trainSET)
Residuals:
Min 1Q Median 3Q Max
-1.53023 -0.32179 -0.01373 0.33392 1.43179
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.93473 4.84463 3.289 0.00105 **
poly(x, 3, raw = TRUE)1 -9.73992 2.13842 -4.555 6.13e-06 ***
poly(x, 3, raw = TRUE)2 1.74229 0.31004 5.620 2.70e-08 ***
poly(x, 3, raw = TRUE)3 -0.09485 0.01477 -6.420 2.43e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4946 on 746 degrees of freedom
Multiple R-squared: 0.6281, Adjusted R-squared: 0.6266
F-statistic: 420 on 3 and 746 DF, p-value: < 2.2e-16
MSPE
MSPE <- mean(resid(modc)^2)
MSPE
[1] 0.2433419
MSPE
yhtest <- predict(modc, newdata = testSET)
MSPEtest <- mean((testSET$y - yhtest)^2)
MSPEtest
[1] 0.2845815
Create a training set (80%) and testing set (20%) of the observations from the data frame HSWRESTLER
from the PASWR2
package. Store the results from regressing hwfat
onto abs
and triceps
in the object modf
.
Compute the test MSPE
.
Note how the answers of your classmates are all different. The validation estimate of the test MSPE can be highly variable.
# Your code here
library(PASWR2)
#
#
#
#
#
#
#
#
#
#
#
The left side of Figure 5.2 on page 178 of James et al. (2013) shows the validation approach used on the Auto
data set in order to estimate the test error that results from predicting mpg
using polynomial functions of horsepower
for one particular split of the original data. The code below creates a similar graph.
library(ISLR)
n <- nrow(Auto)
plot(1:10, type ="n", xlab = "Degree of Polynomial", ylim = c(15, 30),
ylab = "Mean Squared Prediction Error")
IND <- sample(1:n, size = floor(n/2), replace = FALSE)
train <- Auto[IND, ]
test <- Auto[-IND, ]
MSPE <- numeric(10)
for(i in 1:10){
mod <- lm(mpg ~ poly(horsepower, i), data = train)
pred <- predict(mod, newdata = test)
MSPE[i] <- mean((test$mpg - pred)^2)
}
lines(1:10, MSPE, col = "red")
points(1:10, MSPE, col = "red", pch = 19)
# ggplot2 approach
DF2 <- data.frame(x = 1:10, MSPE = MSPE)
ggplot(data = DF2, aes(x = x, y = MSPE)) +
geom_point(color = "red", size = 2) +
geom_line(color = "red") +
theme_bw() +
ylim(15, 30) +
scale_x_continuous(breaks = 1:10) +
labs(x = "Degree of Polynomial", y = "Mean Squared Prediction Error")
for
loop before IND
.# Your code here
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
# ggplot2 approach
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
modq
.set.seed(1)
k <- 5
MSPE <- numeric(k)
folds <- sample(x = 1:k, size = nrow(DF), replace = TRUE)
xtabs(~folds)
folds
1 2 3 4 5
210 194 183 204 209
# or
table(folds)
folds
1 2 3 4 5
210 194 183 204 209
sum(xtabs(~folds))
[1] 1000
for(j in 1:k){
modq <- lm(y ~ poly(x, 2, raw = TRUE), data = DF[folds != j, ])
pred <- predict(modq, newdata = DF[folds ==j, ])
MSPE[j] <- mean((DF[folds == j, ]$y - pred)^2)
}
MSPE
[1] 0.2660482 0.2644834 0.2283972 0.3057784 0.2736351
weighted.mean(MSPE, table(folds)/sum(folds))
[1] 0.2685451
caret
library(caret)
model <- train(
form = y ~ poly(x, 2, raw = TRUE),
data = DF,
method = "lm",
trControl = trainControl(
method = "cv", number = 5
)
)
model
Linear Regression
1000 samples
1 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 800, 800, 800, 800, 800
Resampling results:
RMSE Rsquared MAE
0.5163329 0.6103537 0.4117134
Tuning parameter 'intercept' was held constant at a value of TRUE
modf
. Recall that modf
was created from regressing hwfat
onto abs
and triceps
.# Your code here
set.seed(13)
k <- 8
MSPE <- numeric(k)
folds <- sample(x = 1:k, size = nrow(HSWRESTLER), replace = TRUE)
#
#
#
#
#
#
#
#
#
caret
model <- train(
form = hwfat ~ abs + triceps,
data = HSWRESTLER,
method = "lm",
trControl = trainControl(
method = "cv", number = 8
)
)
model
Linear Regression
78 samples
2 predictor
No pre-processing
Resampling: Cross-Validated (8 fold)
Summary of sample sizes: 68, 69, 68, 68, 68, 68, ...
Resampling results:
RMSE Rsquared MAE
3.141969 0.8494259 2.510904
Tuning parameter 'intercept' was held constant at a value of TRUE
model$results$RMSE^2
[1] 9.871968
cv.glm
from boot
set.seed(1)
library(boot)
glm.fit <- glm(y ~ poly(x, 2, raw = TRUE), data = DF)
cv.err <- cv.glm(data = DF, glmfit = glm.fit, K = 5)$delta[1]
cv.err
[1] 0.2690022
modf
using cv.glm
. Recall that modf
was created from regressing hwfat
onto abs
and triceps
.# Your code here
glm.fit <- glm(hwfat ~ abs + triceps, data = HSWRESTLER)
#
#
caret
and compare answers.# Your Code Here
#
#
#
#
#
#
#
#
#
#
The right side of Figure 5.4 on page 180 of James et al. (2013) shows the 10-fold cross-validation approach used on the Auto
data set in order to estimate the test error that results from predicting mpg
using polynomial functions of horsepower
run nine separate times. The code below creates a graph showing one particular run.
# Your code here
plot(1:10, type ="n", xlab = "Degree of Polynomial", ylim = c(16, 26),
ylab = "Mean Squared Prediction Error", main = "10-fold CV")
k <- 10 # number of folds
MSPE <- numeric(k)
cv <- numeric(k)
#
#
#
#
#
#
#
#
#
#
#
for
loop to run the above code nine times. The result should look similar to the right side of Figure 5.4 on page 180 of James et al. (2013).# Your Code Here
set.seed(123)
plot(1:10, type ="n", xlab = "Degree of Polynomial", ylim = c(16, 26),
ylab = "Mean Squared Prediction Error", main = "10-fold CV")
k <- 10 # number of folds
MSPE <- numeric(k)
cv <- numeric(k)
#
#
#
#
#
#
#
#
#
#
#
#
#
## GGplot2 approach
set.seed(123)
cv <- matrix(NA, 10, 9)
k <- 10 # number of folds
MSPE <- numeric(k)
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
cv.glm
to create similar a graph to the right side of Figure 5.4 on page 180 of James et al. (2013).# Your Code Here
plot(1:10, type ="n", xlab = "Degree of Polynomial", ylim = c(16, 26),
ylab = "Mean Squared Prediction Error", main = "10-fold CV")
#
#
#
#
#
#
#
#
#
# GGplot2 approach
cv.err <- matrix(NA, 10, 9)
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
set.seed(1)
k <- nrow(DF)
MSPE <- numeric(k)
folds <- sample(x = 1:k, size = nrow(DF), replace = FALSE)
# Note that replace changes to FALSE for LOOCV...can you explain why?
for(j in 1:k){
modq <- lm(y ~ poly(x, 2, raw = TRUE), data = DF[folds != j, ])
pred <- predict(modq, newdata = DF[folds ==j, ])
MSPE[j] <- mean((DF[folds == j, ]$y - pred)^2)
}
mean(MSPE)
[1] 0.2682352
modf
. Recall that modf
was created from regressing hwfat
onto abs
and triceps
.# Your Code Here
set.seed(1)
k <- nrow(HSWRESTLER)
MSPE <- numeric(k)
#
#
#
#
#
#
#
Recall
\[CV_{n}=\frac{1}{n}\sum_{i=1}^n\left(\frac{y_i - \hat{y_i}}{1 - h_i}\right)^2\]
modq <- lm(y ~ poly(x, 2, raw = TRUE), data = DF)
h <- hatvalues(modq)
CVn <- mean(((DF$y - predict(modq))/(1 - h))^2)
CVn
[1] 0.2682352
modf
using the mathematical shortcut. Recall that modf
was created from regressing hwfat
onto abs
and triceps
.# Your Code Here
modf <- lm(hwfat ~ abs + triceps, data = HSWRESTLER)
#
#
#
cv.glm
from boot
K
argument for the number of folds, gv.glm
will compute LOOCV.library(boot)
glm.fit <- glm(y ~ poly(x, 2, raw = TRUE), data = DF)
cv.err <- cv.glm(data = DF, glmfit = glm.fit)$delta[1]
cv.err
[1] 0.2682352
modf
using cv.glm
. Recall that modf
was created from regressing hwfat
onto abs
and triceps
.# Your Code Here
glm.fit <- glm(hwfat ~ abs + triceps, data = HSWRESTLER)
#
#
# Your Code Here
plot(1:10, type ="n", xlab = "Degree of Polynomial", ylim = c(16, 26),
ylab = "Mean Squared Prediction Error")
k <- nrow(Auto) # number of folds
MSPE <- numeric(k)
cv <- numeric(10)
#
#
#
#
#
#
#
#
#
#
#
#
Using the short cut formula:
# Your Code Here
plot(1:10, type ="n", xlab = "Degree of Polynomial", ylim = c(16, 26),
ylab = "Mean Squared Prediction Error")
cv <- numeric(10)
#
#
#
#
#
#
#