1 The Validation Set Approach

We explore the use of the validation set approach in order to estimate the test error rates that result from fitting various linear models on the Auto data set.

Before we begin, we use the set.seed() function in order to set a seed for seed R’s random number generator, so that the reader will obtain precisely the same results as those shown below. It is generally a good idea to set a random seed when performing an analysis such as cross-validation that contains an element of randomness, so that the results obtained can be reproduced precisely at a later time.

We begin by using the sample() function to split the set of observations into two halves, by selecting a random subset of 196 observations out of the original 392 observations. We refer to these observations as the training set.

library(ISLR)
set.seed(1)
train <- sample(392, 196)

(Here we use a shortcut in the sample command; see ?sample for details.) We then use the subset option in lm() to fit a linear regression using only the observations corresponding to the training set.

lm.fitA <- lm(mpg ~ horsepower, data = Auto, subset = train)
trainSET <- Auto[train, ]
validSET <- Auto[-train, ]
lm.fitB <- lm(mpg ~ horsepower, data = trainSET)
summary(lm.fitA)

Call:
lm(formula = mpg ~ horsepower, data = Auto, subset = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.698  -3.085  -0.216   2.680  16.770 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 40.340377   1.002269   40.25   <2e-16 ***
horsepower  -0.161701   0.008809  -18.36   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.692 on 194 degrees of freedom
Multiple R-squared:  0.6346,    Adjusted R-squared:  0.6327 
F-statistic: 336.9 on 1 and 194 DF,  p-value: < 2.2e-16
summary(lm.fitB)

Call:
lm(formula = mpg ~ horsepower, data = trainSET)

Residuals:
    Min      1Q  Median      3Q     Max 
-13.698  -3.085  -0.216   2.680  16.770 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 40.340377   1.002269   40.25   <2e-16 ***
horsepower  -0.161701   0.008809  -18.36   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.692 on 194 degrees of freedom
Multiple R-squared:  0.6346,    Adjusted R-squared:  0.6327 
F-statistic: 336.9 on 1 and 194 DF,  p-value: < 2.2e-16

We now use the predict() function to estimate the response for all 392 observations, and we use the mean() function to calculate the MSE of the 196 observations in the validation set. Note that the -train index below selects only the observations that are not in the training set.

EMSE <- mean((Auto$mpg - predict(lm.fitA, newdata = Auto))[-train]^2)
EMSE
[1] 26.14142
# Or
EMSE <- mean((validSET$mpg - predict(lm.fitA, newdata = validSET))^2)
EMSE
[1] 26.14142

Therefore, the estimated test MSE for the linear regression fit is 26.1414212. We can use the poly() function to estimate the test error for the polynomial and cubic regressions.

lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
EMSE2 <- mean((Auto$mpg - predict(lm.fit2, newdata = Auto))[-train]^2)
EMSE2
[1] 19.82259
# Or
EMSE2 <- mean((validSET$mpg - predict(lm.fit2, newdata = validSET))^2)
EMSE2
[1] 19.82259
#
lm.fit3 <- lm(mpg ~ poly(horsepower, 3), data = Auto, subset = train)
EMSE3 <- mean((Auto$mpg - predict(lm.fit3, newdata = Auto))[-train]^2)
EMSE3
[1] 19.78252
# Or
EMSE3 <- mean((validSET$mpg - predict(lm.fit3, newdata = validSET))^2)
EMSE3
[1] 19.78252

These error rates are 19.822585 and 19.7825167, respectively. If we choose a different training set instead, then we will obtain somewhat different errors on the validation set.

set.seed(2)
train <- sample(392, 196)
trainSET <- Auto[train, ]
validSET <- Auto[-train, ]
lm.fit1 <- lm(mpg ~ horsepower, data = Auto, subset = train)
EMSE1a <- mean((Auto$mpg - predict(lm.fit1, newdata = Auto))[-train]^2)
EMSE1a
[1] 23.29559
# Or
EMSE1a <- mean((validSET$mpg - predict(lm.fit1, newdata = validSET))^2)
EMSE1a
[1] 23.29559
lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
EMSE2a <- mean((Auto$mpg - predict(lm.fit2, newdata = Auto))[-train]^2)
EMSE2a
[1] 18.90124
# Or
EMSE2a <- mean((validSET$mpg - predict(lm.fit2, newdata = validSET))^2)
EMSE2a
[1] 18.90124
lm.fit3 <- lm(mpg ~ poly(horsepower, 3), data = Auto, subset = train)
EMSE3a <- mean((Auto$mpg - predict(lm.fit3, newdata = Auto))[-train]^2)
EMSE3a
[1] 19.2574
# Or
EMSE3a <- mean((validSET$mpg - predict(lm.fit3, newdata = validSET))^2)
EMSE3a
[1] 19.2574

Using this split of the observations into a training set and a validation set, we find that the validation set error rates for the models with linear, quadratic, and cubic terms are 23.2955852, 18.9012408, and 19.2573983, respectively.

These results are consistent with our previous findings: a model that predicts mpg using a quadratic function of horsepower performs better than a model that involves only a linear function of horsepower, and there is little evidence in favor of a model that uses a cubic function of horsepower.

2 Leave-One-Out Cross-Validation

The LOOCV estimate can be automatically computed for any generalized linear model using the glm() and cv.glm() functions. If we use glm() to fit a model without passing in the family argument, then it performs linear regression, just like the lm() function. So for instance,

glm.fit <- glm(mpg ~ horsepower, data = Auto)
coef(glm.fit)
(Intercept)  horsepower 
 39.9358610  -0.1578447 

and

lm.fit <- lm(mpg ~ horsepower, data = Auto)
coef(lm.fit)
(Intercept)  horsepower 
 39.9358610  -0.1578447 

yield identical linear regression models. In this lab, we will perform linear regression using the glm() function rather than the lm() function because the latter can be used together with cv.glm(). The cv.glm() function is part of the boot package.

library(boot)
glm.fit <- glm(mpg ~ horsepower, data = Auto)
cv.err = cv.glm(data = Auto, glmfit = glm.fit)
cv.err$delta
[1] 24.23151 24.23114

The cv.glm() function produces a list with several components. The two numbers in the delta vector contain the cross-validation results. In this cv.glm() case the numbers are identical (up to two decimal places) and correspond to the LOOCV statistic. Below, we discuss a situation in which the two numbers differ. Our cross-validation estimate for the test error is approximately 24.2315135.

We can repeat this procedure for increasingly complex polynomial fits. To automate the process, we use the for() function to initiate a for loop which iteratively fits polynomial regressions for polynomials of order i = 1 to i = 5, computes the associated cross-validation error, and stores it in the i\(^{\text{th}}\) element of the vector cv.error. We begin by initializing the vector. This command will likely take a couple of minutes to run.

cv.error <- numeric(5)
for(i in 1:5){
  glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
  cv.error[i] <- cv.glm(data = Auto, glmfit = glm.fit)$delta[1]
}
cv.error
[1] 24.23151 19.24821 19.33498 19.42443 19.03321

2.1 k-Fold Cross-Validation

The cv.glm() function can also be used to implement \(k\)-fold CV. Below we use k = 10, a common choice for \(k\), on the Auto data set. We once again set a random seed and initialize a vector in which we will store the CV errors corresponding to the polynomial fits of orders one to ten.

set.seed(17)
cv.error.10 <- numeric(10)
for(i in 1:10){
  glm.fit <- glm(mpg ~ poly(horsepower, i), data = Auto)
  cv.error.10[i] <- cv.glm(data = Auto, glmfit = glm.fit, K = 10)$delta[1]
}
cv.error.10
 [1] 24.20520 19.18924 19.30662 19.33799 18.87911 19.02103 18.89609
 [8] 19.71201 18.95140 19.50196

Notice that the computation time is much shorter than that of LOOCV. (In principle, the computation time for LOOCV for a least squares linear model should be faster than for \(k\)-fold CV, due to the availability of \[CV_{(n)} = \frac{1}{n}\sum_{i = 1}^n\left(\frac{y_i - \hat{y}_i}{1 - h_i}\right)^2,\] for LOOCV; however, unfortunately the cv.glm() function does not make use of this formula.) We still see little evidence that using cubic or higher-order polynomial terms leads to lower test error than simply using a quadratic fit.

We saw in the last section that the two numbers associated with delta are essentially the same when LOOCV is performed. When we instead perform \(k\)-fold CV, then the two numbers associated with delta differ slightly. The first is the standard \(k\)-fold CV estimate, as in \[CV_{(k)} = \frac{1}{k}\sum_{i=1}^kMSE_i.\] The second is a bias-corrected version. On this data set, the two estimates are very similar to each other.

3 5-Fold Cross-Validation

Compute 5-fold Cross Validation errors for mod.be, mod.fs, mod1, mod2, and mod3 from Project 1.

site <- "http://ww2.amstat.org/publications/jse/datasets/homes76.dat.txt"
HP <- read.table(file = site, header = TRUE)
# Removing indicated columns
HP <- HP[ ,-c(1, 7, 10, 15, 16, 17, 18, 19)]
# Renaming columns
names(HP) <- c("price", "size", "lot", "bath", "bed", "year", "age", 
               "garage", "status", "active", "elem")
mod.be <- glm(price ~ size + lot + bed + status + elem, data = HP)
mod.fs <- glm(price ~ elem + garage + lot + active + size + bed, data = HP)
mod1 <- glm(price ~ . - status - year, data = HP)
mod2 <- update(mod1, .~. + bath:bed + I(age^2))
mod3 <- glm(price ~ size + lot + bath + bed + bath:bed + age + I(age^2) + 
              garage + active + I(elem == 'edison') + I(elem == 'harris'), 
            data = HP)
library(boot)
CV5mod.be <- cv.glm(data = HP, glmfit = mod.be, K = 5)$delta
CV5mod.be
[1] 2522.826 2424.758
CV5mod.be^.5
[1] 50.22774 49.24183
CV5mod.fs <- cv.glm(data = HP, glmfit = mod.fs, K = 5)$delta
CV5mod.fs
[1] 2633.179 2516.025
CV5mod.fs^.5
[1] 51.31451 50.16000
CV5mod1 <- cv.glm(data = HP, glmfit = mod1, K = 5)$delta
CV5mod1
[1] 2885.120 2738.971
CV5mod1^.5
[1] 53.71331 52.33518
CV5mod2 <- cv.glm(data = HP, glmfit = mod2, K = 5)$delta
CV5mod2
[1] 2786.619 2609.500
CV5mod2^.5
[1] 52.78843 51.08327
CV5mod3 <- cv.glm(data = HP, glmfit = mod3, K = 5)$delta
CV5mod3
[1] 2560.953 2420.673
CV5mod3^.5
[1] 50.60586 49.20034