NOTE: Text in many places is verbatim from chapter 6 of James et al. (2013).

1 Read in the data:

Credit <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/Credit.csv")
Credit$Utilization <- Credit$Balance / (Credit$Income*100)
summary(Credit)
       X             Income           Limit           Rating     
 Min.   :  1.0   Min.   : 10.35   Min.   :  855   Min.   : 93.0  
 1st Qu.:100.8   1st Qu.: 21.01   1st Qu.: 3088   1st Qu.:247.2  
 Median :200.5   Median : 33.12   Median : 4622   Median :344.0  
 Mean   :200.5   Mean   : 45.22   Mean   : 4736   Mean   :354.9  
 3rd Qu.:300.2   3rd Qu.: 57.47   3rd Qu.: 5873   3rd Qu.:437.2  
 Max.   :400.0   Max.   :186.63   Max.   :13913   Max.   :982.0  
     Cards            Age          Education        Gender    Student  
 Min.   :1.000   Min.   :23.00   Min.   : 5.00   Female:207   No :360  
 1st Qu.:2.000   1st Qu.:41.75   1st Qu.:11.00    Male :193   Yes: 40  
 Median :3.000   Median :56.00   Median :14.00                         
 Mean   :2.958   Mean   :55.67   Mean   :13.45                         
 3rd Qu.:4.000   3rd Qu.:70.00   3rd Qu.:16.00                         
 Max.   :9.000   Max.   :98.00   Max.   :20.00                         
 Married              Ethnicity      Balance         Utilization     
 No :155   African American: 99   Min.   :   0.00   Min.   :0.00000  
 Yes:245   Asian           :102   1st Qu.:  68.75   1st Qu.:0.01521  
           Caucasian       :199   Median : 459.50   Median :0.09976  
                                  Mean   : 520.01   Mean   :0.15120  
                                  3rd Qu.: 863.00   3rd Qu.:0.21266  
                                  Max.   :1999.00   Max.   :1.12156  
Credit <- Credit[ ,-1]
DT::datatable(Credit, rownames = FALSE)

2 Best subsets regression

To perform best subset selection, we fit a separate least squares regression for each possible combination of the p predictors. That is, we fit all p models that contain exactly on predictor, all \(\binom{p}{2}=p(p - 1)/2\) models that contain exactly two predictors, and so forth. We then look at all of the resulting models, with the goal of identifying the one that is best.

Algorithm 6.1 Best subset selection

  1. Let \(\mathcal{M}_0\) denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation.

  2. For \(k = 1, 2, \ldots, p:\)

    1. Fit all \(\binom{p}{k}\) models that contain exactly \(k\) predictors.
    2. Pick the best among these \(\binom{p}{k}\) models, and call it \(\mathcal{M}_k\). Here best is defined as having the smallest RSS, or equivalently largest \(R^2\).
  3. Select a single best model from among the \(\mathcal{M}_0, \ldots, \mathcal{M}_p\) using cross validated prediction error, \(C_p\) (AIC), BIC, or adjusted \(R^2\).

The regsubsets() function (part of the leaps package) performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS. The syntax is the same as for lm(). The summary() command outputs the best set of variables for each model size.

Note that the default for nvmax (maximum size of subsets to examine) is 8. To evaluate all of the variables in Credit, use nvmax = 12.

library(leaps)
regfit.full <- regsubsets(Balance ~. , data = Credit, nvmax = 12)
summary(regfit.full)
Subset selection object
Call: regsubsets.formula(Balance ~ ., data = Credit, nvmax = 12)
12 Variables  (and intercept)
                   Forced in Forced out
Income                 FALSE      FALSE
Limit                  FALSE      FALSE
Rating                 FALSE      FALSE
Cards                  FALSE      FALSE
Age                    FALSE      FALSE
Education              FALSE      FALSE
Gender Male            FALSE      FALSE
StudentYes             FALSE      FALSE
MarriedYes             FALSE      FALSE
EthnicityAsian         FALSE      FALSE
EthnicityCaucasian     FALSE      FALSE
Utilization            FALSE      FALSE
1 subsets of each size up to 12
Selection Algorithm: exhaustive
          Income Limit Rating Cards Age Education Gender Male StudentYes
1  ( 1 )  " "    " "   "*"    " "   " " " "       " "         " "       
2  ( 1 )  " "    " "   "*"    " "   " " " "       " "         " "       
3  ( 1 )  "*"    " "   "*"    " "   " " " "       " "         "*"       
4  ( 1 )  "*"    "*"   " "    "*"   " " " "       " "         "*"       
5  ( 1 )  "*"    "*"   " "    "*"   " " " "       " "         "*"       
6  ( 1 )  "*"    "*"   "*"    "*"   " " " "       " "         "*"       
7  ( 1 )  "*"    "*"   "*"    "*"   "*" " "       " "         "*"       
8  ( 1 )  "*"    "*"   "*"    "*"   "*" " "       "*"         "*"       
9  ( 1 )  "*"    "*"   "*"    "*"   "*" " "       "*"         "*"       
10  ( 1 ) "*"    "*"   "*"    "*"   "*" " "       "*"         "*"       
11  ( 1 ) "*"    "*"   "*"    "*"   "*" " "       "*"         "*"       
12  ( 1 ) "*"    "*"   "*"    "*"   "*" "*"       "*"         "*"       
          MarriedYes EthnicityAsian EthnicityCaucasian Utilization
1  ( 1 )  " "        " "            " "                " "        
2  ( 1 )  " "        " "            " "                "*"        
3  ( 1 )  " "        " "            " "                " "        
4  ( 1 )  " "        " "            " "                " "        
5  ( 1 )  " "        " "            " "                "*"        
6  ( 1 )  " "        " "            " "                "*"        
7  ( 1 )  " "        " "            " "                "*"        
8  ( 1 )  " "        " "            " "                "*"        
9  ( 1 )  "*"        " "            " "                "*"        
10  ( 1 ) "*"        "*"            " "                "*"        
11  ( 1 ) "*"        "*"            "*"                "*"        
12  ( 1 ) "*"        "*"            "*"                "*"        

An asterisk indicates that a given variable is included in the corresponding model. For instance, this output indicates that the best two-variable model contains only Rating and Utilization.

The summary() function also returns \(R^2\), RSS, adjusted \(R^2\), \(C_p\), and BIC.

names(summary(regfit.full))
[1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"   
summary(regfit.full)$rsq
 [1] 0.7458484 0.8855145 0.9498788 0.9535800 0.9546034 0.9551550 0.9555780
 [8] 0.9556808 0.9557625 0.9558333 0.9558954 0.9559304
reg.summary <- summary(regfit.full)
reg.summary$bic
 [1]  -535.9468  -848.9484 -1173.3585 -1198.0527 -1200.9789 -1199.8774
 [7] -1197.6764 -1192.6115 -1187.3586 -1182.0077 -1176.5787 -1170.9053
reg.summary$rss
 [1] 21435122  9655698  4227219  3915058  3828741  3782220  3746548
 [8]  3737880  3730984  3725014  3719780  3716824

2.1 Plotting

Plotting RSS, adjusted \(R^2\), \(C_p\), and BIC for all models at once will help us decide which model to select. The points() command works like the plot() command, except that it puts points on a plot that has already been created, instead of creating a new plot. The which.max() function can be used to identify the location of the maximum point of a vector.

par(mfrow = c(2, 2))
plot(reg.summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = " Adjusted RSq", type = "l")
points(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)], col = "red",cex = 2, pch = 20)
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp",
type ="l")
points(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)], col = "red", cex = 2, pch = 20)
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC",
type = "l")
points(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)], col = "red", cex = 2, pch = 20)
Red dots indicate optimal number of variables

Figure 2.1: Red dots indicate optimal number of variables

par(mfrow = c(1, 1))

2.2 Using the generic leaps plotting functions

The top row of each plot contains a black square for each variable selected according to the optimal model associated with that statistic.

par(mfrow = c(2, 2))
plot(regfit.full, scale = "r2")
plot(regfit.full, scale = "adjr2")
plot(regfit.full, scale = "Cp")
plot(regfit.full, scale = "bic")

par(mfrow = c(1, 1))

What are the coefficients selected with BIC?

which.min(reg.summary$bic)
[1] 5
coef(regfit.full, which.min(reg.summary$bic))
 (Intercept)       Income        Limit        Cards   StudentYes 
-490.4886212   -6.8997534    0.2525404   21.1105879  406.3008916 
 Utilization 
 155.4748273 

2.3 Forward selection with leaps

Forward stepwise selection is a computationally efficient alternative to best subset selection. While the best subset selection procedure considers all \(2^p\) possible models containing subsets of the \(p\) predicts, forward stepwise considers a much smaller set of models. Forward stepwise selection begins with a model containing no predictors, and then adds predicts to the model, one-at-a-time, until all of the predictors are in the model. In particular, at each step the variable that gives the greatest additional improvement to the fit is added to the model.

Algorithm 6.2 Forward stepwise selection

  1. Let \(\mathcal{M}_0\) denote the null model, which contains no predictors. This model simply predicts the sample mean for each observation.

  2. For \(k = 0, 1, \ldots, p-1:\)

    1. Consider all \(p-k\) models that augment the predictors in \(\mathcal{M}_k\) with one additional predictor.
    2. Choose the best among these \(p-k\) models, and call it \(\mathcal{M}_{k+1}\). Here best is defined as having the smallest RSS, or equivalently largest \(R^2\).
  3. Select a single best model from among the \(\mathcal{M}_0, \ldots, \mathcal{M}_p\) using cross validated prediction error, \(C_p\) (AIC), BIC, or adjusted \(R^2\).

Unlike best subset selection, which involved fitting \(2^p\) models, forward step wise selection involves fitting one null model, along with \(p-k\) models in the kth iteration, for \(k=0,1, \ldots,p-1\). This amounts to a total of \(1 + \sum_{k=0}^{p-1}(p-k)=1 + p(p + 1)/2\) models. This is a substantial difference: when \(p=12\), best subset selection requires fitting \(2^{12} = 4096\) models, whereas forward stepwise selection requires fitting only \(1 + 12(12 + 1)= 157\) models.

regfit.fwd <- regsubsets(Balance ~. , data = Credit , nvmax = 12,
method = "forward")
summary(regfit.fwd)
Subset selection object
Call: regsubsets.formula(Balance ~ ., data = Credit, nvmax = 12, method = "forward")
12 Variables  (and intercept)
                   Forced in Forced out
Income                 FALSE      FALSE
Limit                  FALSE      FALSE
Rating                 FALSE      FALSE
Cards                  FALSE      FALSE
Age                    FALSE      FALSE
Education              FALSE      FALSE
Gender Male            FALSE      FALSE
StudentYes             FALSE      FALSE
MarriedYes             FALSE      FALSE
EthnicityAsian         FALSE      FALSE
EthnicityCaucasian     FALSE      FALSE
Utilization            FALSE      FALSE
1 subsets of each size up to 12
Selection Algorithm: forward
          Income Limit Rating Cards Age Education Gender Male StudentYes
1  ( 1 )  " "    " "   "*"    " "   " " " "       " "         " "       
2  ( 1 )  " "    " "   "*"    " "   " " " "       " "         " "       
3  ( 1 )  " "    " "   "*"    " "   " " " "       " "         "*"       
4  ( 1 )  "*"    " "   "*"    " "   " " " "       " "         "*"       
5  ( 1 )  "*"    "*"   "*"    " "   " " " "       " "         "*"       
6  ( 1 )  "*"    "*"   "*"    "*"   " " " "       " "         "*"       
7  ( 1 )  "*"    "*"   "*"    "*"   "*" " "       " "         "*"       
8  ( 1 )  "*"    "*"   "*"    "*"   "*" " "       "*"         "*"       
9  ( 1 )  "*"    "*"   "*"    "*"   "*" " "       "*"         "*"       
10  ( 1 ) "*"    "*"   "*"    "*"   "*" " "       "*"         "*"       
11  ( 1 ) "*"    "*"   "*"    "*"   "*" " "       "*"         "*"       
12  ( 1 ) "*"    "*"   "*"    "*"   "*" "*"       "*"         "*"       
          MarriedYes EthnicityAsian EthnicityCaucasian Utilization
1  ( 1 )  " "        " "            " "                " "        
2  ( 1 )  " "        " "            " "                "*"        
3  ( 1 )  " "        " "            " "                "*"        
4  ( 1 )  " "        " "            " "                "*"        
5  ( 1 )  " "        " "            " "                "*"        
6  ( 1 )  " "        " "            " "                "*"        
7  ( 1 )  " "        " "            " "                "*"        
8  ( 1 )  " "        " "            " "                "*"        
9  ( 1 )  "*"        " "            " "                "*"        
10  ( 1 ) "*"        "*"            " "                "*"        
11  ( 1 ) "*"        "*"            "*"                "*"        
12  ( 1 ) "*"        "*"            "*"                "*"        
reg.summary <- summary(regfit.fwd)

2.4 Plotting

par(mfrow = c(2, 2))
plot(reg.summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = " Adjusted RSq", type = "l")
points(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)], col = "red",cex = 2, pch = 20)
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp",
type ="l")
points(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)], col = "red", cex = 2, pch = 20)
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC",
type = "l")
points(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)], col = "red", cex = 2, pch = 20)

par(mfrow = c(1, 1))

2.5 Using the build in leaps plotting functions

par(mfrow = c(2, 2))
plot(regfit.fwd, scale = "r2")
plot(regfit.fwd, scale = "adjr2")
plot(regfit.fwd, scale = "Cp")
plot(regfit.fwd, scale = "bic")

par(mfrow = c(1, 1))

What are the coefficients selected with BIC?

which.min(reg.summary$bic)
[1] 6
coef(regfit.fwd, which.min(reg.summary$bic))
 (Intercept)       Income        Limit       Rating        Cards 
-516.3833320   -6.9477878    0.1823182    1.0605783   15.9497347 
  StudentYes  Utilization 
 403.9421811  153.2844256 

2.6 Backward selection with leaps

Backward stepwise selection like forward stepwise selection provides a computationally efficient alternative to best subset selection. However, unlike forward stepwise selection, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time.

Algorithm 6.3 Backward stepwise selection

  1. Let \(\mathcal{M}_p\) denote the full model, which contains all p predictors.

  2. For \(k = p, p-1, \ldots, 1:\)

    1. Consider all k models that contain all but one of the predictors in \(\mathcal{M}_k\), for a total of \(k-1\) predictors.
    2. Choose the best among these \(k\) models, and call it \(\mathcal{M}_{k-1}\). Here best is defined as having the smallest RSS, or equivalently largest \(R^2\).
  3. Select a single best model from among the \(\mathcal{M}_0, \ldots, \mathcal{M}_p\) using cross validated prediction error, \(C_p\) (AIC), BIC, or adjusted \(R^2\).

Like forward stepwise selection, the backward selection approach searches through only \(1 + p(p+1)/2\) models, and so can be applied in settings where p is too large to apply best subset selection. Also like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best model containing a subset of p predictors.

Backward selection requires that the number of samples \(n\) is larger than the number of variables \(p\) (so that the full model can be fit). In contrast, forward stepwise can be used even when \(n <p\), and so is the only viable method when \(p\) is very large.

regfit.bwd <- regsubsets(Balance ~. , data = Credit , nvmax = 12, method = "backward")
summary(regfit.bwd)
Subset selection object
Call: regsubsets.formula(Balance ~ ., data = Credit, nvmax = 12, method = "backward")
12 Variables  (and intercept)
                   Forced in Forced out
Income                 FALSE      FALSE
Limit                  FALSE      FALSE
Rating                 FALSE      FALSE
Cards                  FALSE      FALSE
Age                    FALSE      FALSE
Education              FALSE      FALSE
Gender Male            FALSE      FALSE
StudentYes             FALSE      FALSE
MarriedYes             FALSE      FALSE
EthnicityAsian         FALSE      FALSE
EthnicityCaucasian     FALSE      FALSE
Utilization            FALSE      FALSE
1 subsets of each size up to 12
Selection Algorithm: backward
          Income Limit Rating Cards Age Education Gender Male StudentYes
1  ( 1 )  " "    "*"   " "    " "   " " " "       " "         " "       
2  ( 1 )  "*"    "*"   " "    " "   " " " "       " "         " "       
3  ( 1 )  "*"    "*"   " "    " "   " " " "       " "         "*"       
4  ( 1 )  "*"    "*"   " "    "*"   " " " "       " "         "*"       
5  ( 1 )  "*"    "*"   " "    "*"   " " " "       " "         "*"       
6  ( 1 )  "*"    "*"   "*"    "*"   " " " "       " "         "*"       
7  ( 1 )  "*"    "*"   "*"    "*"   "*" " "       " "         "*"       
8  ( 1 )  "*"    "*"   "*"    "*"   "*" " "       "*"         "*"       
9  ( 1 )  "*"    "*"   "*"    "*"   "*" " "       "*"         "*"       
10  ( 1 ) "*"    "*"   "*"    "*"   "*" " "       "*"         "*"       
11  ( 1 ) "*"    "*"   "*"    "*"   "*" " "       "*"         "*"       
12  ( 1 ) "*"    "*"   "*"    "*"   "*" "*"       "*"         "*"       
          MarriedYes EthnicityAsian EthnicityCaucasian Utilization
1  ( 1 )  " "        " "            " "                " "        
2  ( 1 )  " "        " "            " "                " "        
3  ( 1 )  " "        " "            " "                " "        
4  ( 1 )  " "        " "            " "                " "        
5  ( 1 )  " "        " "            " "                "*"        
6  ( 1 )  " "        " "            " "                "*"        
7  ( 1 )  " "        " "            " "                "*"        
8  ( 1 )  " "        " "            " "                "*"        
9  ( 1 )  "*"        " "            " "                "*"        
10  ( 1 ) "*"        "*"            " "                "*"        
11  ( 1 ) "*"        "*"            "*"                "*"        
12  ( 1 ) "*"        "*"            "*"                "*"        
reg.summary <- summary(regfit.bwd)

2.7 Plotting

par(mfrow = c(2, 2))
plot(reg.summary$rss, xlab = "Number of Variables", ylab = "RSS", type = "l")
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = " Adjusted RSq", type = "l")
points(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)], col = "red",cex = 2, pch = 20)
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp",
type ="l")
points(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)], col = "red", cex = 2, pch = 20)
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC",
type = "l")
points(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)], col = "red", cex = 2, pch = 20)

par(mfrow = c(1, 1))

2.8 Using the build in leaps plotting functions

par(mfrow = c(2, 2))
plot(regfit.bwd, scale = "r2")
plot(regfit.bwd, scale = "adjr2")
plot(regfit.bwd, scale = "Cp")
plot(regfit.bwd, scale = "bic")

par(mfrow = c(1, 1))

What are the coefficients selected with BIC?

which.min(reg.summary$bic)
[1] 5
coef(regfit.bwd, which.min(reg.summary$bic))
 (Intercept)       Income        Limit        Cards   StudentYes 
-490.4886212   -6.8997534    0.2525404   21.1105879  406.3008916 
 Utilization 
 155.4748273 

2.8.1 Different Models?

coef(regfit.full, 5)
 (Intercept)       Income        Limit        Cards   StudentYes 
-490.4886212   -6.8997534    0.2525404   21.1105879  406.3008916 
 Utilization 
 155.4748273 
coef(regfit.fwd, 5)
 (Intercept)       Income        Limit       Rating   StudentYes 
-506.3322347   -6.8374452    0.1165379    2.0189019  396.0568703 
 Utilization 
 181.5979490 
coef(regfit.bwd, 5)
 (Intercept)       Income        Limit        Cards   StudentYes 
-490.4886212   -6.8997534    0.2525404   21.1105879  406.3008916 
 Utilization 
 155.4748273 
#
coef(regfit.full, 7)
 (Intercept)       Income        Limit       Rating        Cards 
-487.7563318   -6.9233687    0.1822902    1.0649244   16.3703375 
         Age   StudentYes  Utilization 
  -0.5606190  403.9969037  145.4632091 
coef(regfit.fwd, 7)
 (Intercept)       Income        Limit       Rating        Cards 
-487.7563318   -6.9233687    0.1822902    1.0649244   16.3703375 
         Age   StudentYes  Utilization 
  -0.5606190  403.9969037  145.4632091 
coef(regfit.bwd, 7)
 (Intercept)       Income        Limit       Rating        Cards 
-487.7563318   -6.9233687    0.1822902    1.0649244   16.3703375 
         Age   StudentYes  Utilization 
  -0.5606190  403.9969037  145.4632091 

3 Choosing among models using validation set approach

We just saw that it is possible to choose among a set of models of different sizes using \(C_p\), BIC, and adjusted \(R^2\). We will now consider how to do this using the validation set and cross-validation approaches.

In order for these approaches to yield accurate estimates of the test error, we must use only the training observations to perform all aspects of model fitting—including variable selection. Therefore, the determination of which model of a given size is best must be made using only the training observations. This point is subtle but important. If the full data set is used to perform the best subset selection step, the validation set errors and cross-validation errors that we obtain will not be accurate estimates of the test error.

In order to use the validation set approach, we begin by splitting the observations into a training set and a test set. We do this by creating a random vector, train, of elements equal to TRUE if the corresponding observation is in the training set, and FALSE otherwise. The vector test has a TRUE if the observation is in the test set, and a FALSE otherwise. Note the ! in the command to create test causes TRUEs to be switched to FALSEs and vice versa.

set.seed(134)
train = sample(c(TRUE, FALSE), size = nrow(Credit), replace = TRUE)
test <- (!train)

Now, we apply regsubsets() to the training set in order to perform best subset selection. Note: this method is not viable for large p.

regfit.best <- regsubsets(Balance ~ ., data = Credit[train, ], nvmax = 12)

Before computing the validation set error for the best model of each model size, a model matrix for the test data is created.

test.mat <- model.matrix(Balance ~ ., data = Credit[test, ])
head(test.mat)
   (Intercept)  Income Limit Rating Cards Age Education Gender Male
2            1 106.025  6645    483     3  82        15           0
3            1 104.593  7075    514     4  71        11           1
5            1  55.882  4897    357     2  68        16           1
8            1  71.408  7114    512     2  87         9           1
10           1  71.061  6819    491     3  41        19           0
11           1  63.095  8117    589     4  30        14           1
   StudentYes MarriedYes EthnicityAsian EthnicityCaucasian Utilization
2           1          1              1                  0  0.08516859
3           0          0              1                  0  0.05545304
5           0          1              0                  1  0.05923195
8           0          0              1                  0  0.12211517
10          1          1              0                  0  0.18997762
11          0          1              0                  1  0.22299707

The model.matrix() function is used in many regression packages for building an “X” matrix from data. Now we run the loop, and for each size i, we extract the coefficients from regfit.best for the best model of that size, multiply them into the appropriate columns of the test model matrix to form the predictions, and compute the test MSE.

coef1 <- coef(regfit.best, id = 1)
coef1
(Intercept)      Rating 
 -362.85093     2.51319 
coef2 <- coef(regfit.best, id = 2)
coef2
(Intercept)      Rating Utilization 
-439.840217    2.256745 1005.014415 
names(coef1)
[1] "(Intercept)" "Rating"     
names(coef2)
[1] "(Intercept)" "Rating"      "Utilization"
head(test.mat[, names(coef1)])
   (Intercept) Rating
2            1    483
3            1    514
5            1    357
8            1    512
10           1    491
11           1    589
head(test.mat[, names(coef2)])
   (Intercept) Rating Utilization
2            1    483  0.08516859
3            1    514  0.05545304
5            1    357  0.05923195
8            1    512  0.12211517
10           1    491  0.18997762
11           1    589  0.22299707
val.errors <- numeric(12)
for(i in 1:12){
  coefi <- coef(regfit.best, id = i)
  pred <- test.mat[, names(coefi)]%*%coefi
  val.errors[i] <- mean((Credit$Balance[test] - pred)^2)
}
val.errors
 [1] 52287.14 25401.93 11521.79 11086.38 10868.55 10766.73 10707.39
 [8] 10747.02 10805.68 10859.13 10858.35 10825.10
which.min(val.errors)
[1] 7
coef(regfit.best, which.min(val.errors))
 (Intercept)       Income        Limit       Rating        Cards 
-505.8922107   -6.9285676    0.1843488    1.0535554   19.4199497 
         Age   StudentYes  Utilization 
  -0.6242167  413.1856810  145.8489745 

3.1 Creating a predict function for regsubsets

That last chunk of code was rather tedious, partly because there is no predict() method for regsubsets(). Since we will be using this function again, we can capture our steps above and write our own method.

predict.regsubsets=function(object, newdata, id, ...){
  form <- as.formula(object$call[[2]])
  mat <- model.matrix(form, newdata)
  coefi <- coef(object, id = id)
  xvars <- names(coefi)
  mat[ , xvars]%*%coefi
}

Finally, we perform best subset selection on the full data set, and select the best 7 variable model. It is important that we make use of the full data set in order to obtain more accurate coefficient estimates. Note that we perform best subset selection of the full data set and select the best 7 variable model, rather that simply using the variables that were obtained from the training set, because the best 7-variable model on the full data set may differ from the corresponding model in the training set.

regfit.best <- regsubsets(Balance ~ ., data = Credit, nvmax = which.min(val.errors))
coef(regfit.best, id = which.min(val.errors))
 (Intercept)       Income        Limit       Rating        Cards 
-487.7563318   -6.9233687    0.1822902    1.0649244   16.3703375 
         Age   StudentYes  Utilization 
  -0.5606190  403.9969037  145.4632091 

3.2 Choosing among models of different sizes with cross validation

Next we choose among the models of different sizes using cross validation. This approach is somewhat involved, as we must perform best subset selection within each of the k training sets. Despite this, we see that with its clever subsetting syntax, R makes this job quite easy. First we create a vector that allocates each observation to one of \(k = 5\) folds, and we create a matrix in which we will store the results.

k <- 5
set.seed(1)
folds <- sample(1:k, size = nrow(Credit), replace = TRUE)
cv.errors <- matrix(NA, k, 12, dimnames = list(NULL, paste(1:12)))

Now we write a for loop that performs the cross-validation. In the \(j^{\text{th}}\) fold, the elements of folds that equal j are in the test set, and the remainder are in the training set. We make our predictions for each model size (using our new predict() method), compute the test errors on the appropriate subset, and store them in the appropriate slot in the matrix cv.errors.

for(j in 1:k){
  best.fit <- regsubsets(Balance ~ ., data = Credit[folds != j, ], nvmax = 12)
  for(i in 1:12){
    pred <- predict(best.fit, Credit[folds == j, ], id = i)
    cv.errors[j, i] <- mean((Credit$Balance[folds == j] - pred)^2)
  }
}

This has given us a \(5 \times 12\) matrix, of which the (\(i,j\))th element corresponds to the test MSE for the \(i^{\text{th}}\) cross-validation fold for the best \(j\)-variable model. We use the apply() function to average over the columns of this matrix in order to obtain a vector for which the \(j^{\text{th}}\) element is the cross validation error for the \(j\)-variable model.

mean.cv.errors <- apply(cv.errors, 2, mean)
mean.cv.errors
       1        2        3        4        5        6        7        8 
54889.41 24995.29 11835.29 10887.59 10671.61 10279.28 10152.01 10403.26 
       9       10       11       12 
10427.87 10410.61 10402.15 10365.81 
which.min(mean.cv.errors)
7 
7 
plot(mean.cv.errors, type = "b")

Note that the best model contains 7 variables. We now perform best subset selection of the full data set in order to obtain the 7-variable model.

reg.best <- regsubsets(Balance ~ ., data = Credit, nvmax = 12)
coef(reg.best, which.min(mean.cv.errors))
 (Intercept)       Income        Limit       Rating        Cards 
-487.7563318   -6.9233687    0.1822902    1.0649244   16.3703375 
         Age   StudentYes  Utilization 
  -0.5606190  403.9969037  145.4632091 
coef(reg.best, 3) # The curve really does not drop much after 3...
(Intercept)      Income      Rating  StudentYes 
-581.078888   -7.874931    3.987472  418.760284 
mymod <- lm(Balance ~ Income + Rating +  Student, data = Credit)
summary(mymod)

Call:
lm(formula = Balance ~ Income + Rating + Student, data = Credit)

Residuals:
     Min       1Q   Median       3Q      Max 
-226.126  -80.445   -5.018   65.192  293.234 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -581.07889   13.83463  -42.00   <2e-16 ***
Income        -7.87493    0.24021  -32.78   <2e-16 ***
Rating         3.98747    0.05471   72.89   <2e-16 ***
StudentYes   418.76028   17.23025   24.30   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 103.3 on 396 degrees of freedom
Multiple R-squared:  0.9499,    Adjusted R-squared:  0.9495 
F-statistic:  2502 on 3 and 396 DF,  p-value: < 2.2e-16

4 Using stepAIC

library(MASS)
null <- lm(Balance ~ 1, data = Credit)
full <- lm(Balance ~ ., data = Credit)
mod.fs <- stepAIC(null, scope = list(lower = null, upper = full), direction = "forward", test = "F")
Start:  AIC=4905.56
Balance ~ 1

              Df Sum of Sq      RSS    AIC F Value     Pr(F)    
+ Rating       1  62904790 21435122 4359.6 1167.99 < 2.2e-16 ***
+ Limit        1  62624255 21715657 4364.8 1147.76 < 2.2e-16 ***
+ Utilization  1  27382381 56957530 4750.5  191.34 < 2.2e-16 ***
+ Income       1  18131167 66208745 4810.7  108.99 < 2.2e-16 ***
+ Student      1   5658372 78681540 4879.8   28.62 1.488e-07 ***
+ Cards        1    630416 83709496 4904.6    3.00   0.08418 .  
<none>                     84339912 4905.6                      
+ Gender       1     38892 84301020 4907.4    0.18   0.66852    
+ Education    1      5481 84334431 4907.5    0.03   0.87231    
+ Married      1      2715 84337197 4907.5    0.01   0.90994    
+ Age          1       284 84339628 4907.6    0.00   0.97081    
+ Ethnicity    2     18454 84321458 4909.5    0.04   0.95749    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step:  AIC=4359.63
Balance ~ Rating

              Df Sum of Sq      RSS    AIC F Value     Pr(F)    
+ Utilization  1  11779424  9655698 4042.6  484.32 < 2.2e-16 ***
+ Income       1  10902581 10532541 4077.4  410.95 < 2.2e-16 ***
+ Student      1   5735163 15699959 4237.1  145.02 < 2.2e-16 ***
+ Age          1    649110 20786012 4349.3   12.40 0.0004798 ***
+ Cards        1    138580 21296542 4359.0    2.58 0.1087889    
+ Married      1    118209 21316913 4359.4    2.20 0.1386707    
<none>                     21435122 4359.6                      
+ Education    1     27243 21407879 4361.1    0.51 0.4776403    
+ Gender       1     16065 21419057 4361.3    0.30 0.5855899    
+ Limit        1      7960 21427162 4361.5    0.15 0.7011619    
+ Ethnicity    2     51100 21384022 4362.7    0.47 0.6233922    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step:  AIC=4042.64
Balance ~ Rating + Utilization

            Df Sum of Sq     RSS    AIC F Value     Pr(F)    
+ Student    1   2671767 6983931 3915.1 151.493 < 2.2e-16 ***
+ Income     1   1025771 8629927 3999.7  47.069  2.65e-11 ***
+ Married    1     95060 9560638 4040.7   3.937   0.04791 *  
+ Age        1     50502 9605197 4042.5   2.082   0.14983    
<none>                   9655698 4042.6                      
+ Limit      1     42855 9612843 4042.9   1.765   0.18472    
+ Education  1     28909 9626789 4043.4   1.189   0.27616    
+ Gender     1      7187 9648511 4044.3   0.295   0.58735    
+ Cards      1      3371 9652327 4044.5   0.138   0.71017    
+ Ethnicity  2     13259 9642439 4046.1   0.272   0.76231    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step:  AIC=3915.06
Balance ~ Rating + Utilization + Student

            Df Sum of Sq     RSS    AIC F Value   Pr(F)    
+ Income     1   2893712 4090219 3703.1 279.451 < 2e-16 ***
+ Limit      1     77766 6906165 3912.6   4.448 0.03557 *  
+ Age        1     58618 6925313 3913.7   3.343 0.06823 .  
<none>                   6983931 3915.1                    
+ Married    1     33686 6950245 3915.1   1.914 0.16725    
+ Education  1      2344 6981587 3916.9   0.133 0.71591    
+ Cards      1      1302 6982630 3917.0   0.074 0.78627    
+ Gender     1         9 6983922 3917.1   0.001 0.98212    
+ Ethnicity  2      1715 6982217 3919.0   0.048 0.95278    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step:  AIC=3703.06
Balance ~ Rating + Utilization + Student + Income

            Df Sum of Sq     RSS    AIC F Value     Pr(F)    
+ Limit      1    178086 3912133 3687.3 17.9354 2.847e-05 ***
+ Age        1     34096 4056122 3701.7  3.3120   0.06953 .  
<none>                   4090219 3703.1                      
+ Married    1     15941 4074278 3703.5  1.5416   0.21512    
+ Gender     1      8880 4081339 3704.2  0.8572   0.35508    
+ Cards      1      4628 4085591 3704.6  0.4463   0.50447    
+ Education  1       445 4089774 3705.0  0.0428   0.83613    
+ Ethnicity  2     16108 4074111 3705.5  0.7769   0.46054    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step:  AIC=3687.25
Balance ~ Rating + Utilization + Student + Income + Limit

            Df Sum of Sq     RSS    AIC F Value     Pr(F)    
+ Cards      1    129913 3782220 3675.7 13.4989 0.0002718 ***
+ Age        1     29075 3883058 3686.3  2.9427 0.0870572 .  
<none>                   3912133 3687.3                      
+ Gender     1     10045 3902089 3688.2  1.0116 0.3151296    
+ Married    1      8872 3903262 3688.3  0.8932 0.3451820    
+ Education  1      3501 3908633 3688.9  0.3520 0.5533444    
+ Ethnicity  2     12590 3899543 3690.0  0.6328 0.5316436    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step:  AIC=3675.74
Balance ~ Rating + Utilization + Student + Income + Limit + Cards

            Df Sum of Sq     RSS    AIC F Value   Pr(F)  
+ Age        1     35671 3746548 3674.0  3.7323 0.05409 .
<none>                   3782220 3675.7                  
+ Gender     1      8945 3773275 3676.8  0.9293 0.33564  
+ Married    1      4801 3777419 3677.2  0.4982 0.48069  
+ Education  1      3733 3778487 3677.3  0.3873 0.53408  
+ Ethnicity  2     10981 3771239 3678.6  0.5693 0.56641  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step:  AIC=3673.95
Balance ~ Rating + Utilization + Student + Income + Limit + Cards + 
    Age

            Df Sum of Sq     RSS    AIC F Value  Pr(F)
<none>                   3746548 3674.0               
+ Gender     1    8668.5 3737880 3675.0 0.90676 0.3416
+ Married    1    7191.5 3739357 3675.2 0.75197 0.3864
+ Education  1    3505.2 3743043 3675.6 0.36616 0.5455
+ Ethnicity  2    8615.0 3737933 3677.0 0.44943 0.6383
mod.be <- stepAIC(full, scope = list(lower = null, upper = full), direction = "backward", test = "F")
Start:  AIC=3680.77
Balance ~ Income + Limit + Rating + Cards + Age + Education + 
    Gender + Student + Married + Ethnicity + Utilization

              Df Sum of Sq     RSS    AIC F Value     Pr(F)    
- Ethnicity    2     11142 3727965 3678.0    0.58 0.5603551    
- Education    1      2957 3719780 3679.1    0.31 0.5793166    
- Married      1      8329 3725153 3679.7    0.87 0.3522921    
- Gender       1      8958 3725782 3679.7    0.93 0.3347539    
<none>                     3716824 3680.8                      
- Age          1     35042 3751866 3682.5    3.65 0.0568551 .  
- Rating       1     50626 3767450 3684.2    5.27 0.0222141 *  
- Utilization  1     69907 3786730 3686.2    7.28 0.0072828 ** 
- Cards        1    128547 3845371 3692.4   13.38 0.0002889 ***
- Limit        1    287249 4004073 3708.5   29.91 8.141e-08 ***
- Income       1   3046715 6763538 3918.2  317.23 < 2.2e-16 ***
- Student      1   4650015 8366839 4003.3  484.17 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step:  AIC=3677.96
Balance ~ Income + Limit + Rating + Cards + Age + Education + 
    Gender + Student + Married + Utilization

              Df Sum of Sq     RSS    AIC F Value     Pr(F)    
- Education    1      3019 3730984 3676.3    0.31 0.5749597    
- Married      1      6271 3734237 3676.6    0.65 0.4190419    
- Gender       1      8509 3736474 3676.9    0.89 0.3466409    
<none>                     3727965 3678.0                      
- Age          1     37386 3765351 3680.0    3.90 0.0489614 *  
- Rating       1     48086 3776052 3681.1    5.02 0.0256547 *  
- Utilization  1     72849 3800814 3683.7    7.60 0.0061067 ** 
- Cards        1    131187 3859153 3689.8   13.69 0.0002468 ***
- Limit        1    294067 4022032 3706.3   30.68   5.6e-08 ***
- Income       1   3036882 6764847 3914.3  316.89 < 2.2e-16 ***
- Student      1   4668283 8396249 4000.7  487.12 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step:  AIC=3676.29
Balance ~ Income + Limit + Rating + Cards + Age + Gender + Student + 
    Married + Utilization

              Df Sum of Sq     RSS    AIC F Value     Pr(F)    
- Married      1      6896 3737880 3675.0    0.72 0.3963978    
- Gender       1      8373 3739357 3675.2    0.88 0.3500974    
<none>                     3730984 3676.3                      
- Age          1     37726 3768710 3678.3    3.94 0.0477529 *  
- Rating       1     50282 3781266 3679.6    5.26 0.0224028 *  
- Utilization  1     74587 3805572 3682.2    7.80 0.0054920 ** 
- Cards        1    130839 3861823 3688.1   13.68 0.0002483 ***
- Limit        1    291132 4022117 3704.3   30.43  6.31e-08 ***
- Income       1   3035245 6766229 3912.4  317.27 < 2.2e-16 ***
- Student      1   4689629 8420613 3999.9  490.21 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step:  AIC=3675.03
Balance ~ Income + Limit + Rating + Cards + Age + Gender + Student + 
    Utilization

              Df Sum of Sq     RSS    AIC F Value     Pr(F)    
- Gender       1      8668 3746548 3674.0    0.91 0.3415625    
<none>                     3737880 3675.0                      
- Age          1     35395 3773275 3676.8    3.70 0.0550578 .  
- Rating       1     47158 3785038 3678.0    4.93 0.0269210 *  
- Utilization  1     72879 3810759 3680.7    7.62 0.0060328 ** 
- Cards        1    135372 3873252 3687.3   14.16 0.0001936 ***
- Limit        1    303600 4041480 3704.3   31.76 3.347e-08 ***
- Income       1   3056864 6794744 3912.1  319.76 < 2.2e-16 ***
- Student      1   4770749 8508629 4002.1  499.04 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step:  AIC=3673.95
Balance ~ Income + Limit + Rating + Cards + Age + Student + Utilization

              Df Sum of Sq     RSS    AIC F Value     Pr(F)    
<none>                     3746548 3674.0                      
- Age          1     35671 3782220 3675.7    3.73 0.0540906 .  
- Rating       1     46902 3793451 3676.9    4.91 0.0273163 *  
- Utilization  1     75071 3821620 3679.9    7.85 0.0053206 ** 
- Cards        1    136510 3883058 3686.3   14.28 0.0001817 ***
- Limit        1    303278 4049826 3703.1   31.73 3.383e-08 ***
- Income       1   3048196 6794744 3910.1  318.93 < 2.2e-16 ***
- Student      1   4765085 8511634 4000.2  498.57 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod.fs)

Call:
lm(formula = Balance ~ Rating + Utilization + Student + Income + 
    Limit + Cards + Age, data = Credit)

Residuals:
    Min      1Q  Median      3Q     Max 
-192.01  -77.03  -14.61   57.03  308.37 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -487.75633   24.70331 -19.745  < 2e-16 ***
Rating         1.06492    0.48072   2.215 0.027316 *  
Utilization  145.46321   51.90256   2.803 0.005321 ** 
StudentYes   403.99690   18.09320  22.329  < 2e-16 ***
Income        -6.92337    0.38768 -17.859  < 2e-16 ***
Limit          0.18229    0.03236   5.633 3.38e-08 ***
Cards         16.37034    4.33160   3.779 0.000182 ***
Age           -0.56062    0.29019  -1.932 0.054091 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 97.76 on 392 degrees of freedom
Multiple R-squared:  0.9556,    Adjusted R-squared:  0.9548 
F-statistic:  1205 on 7 and 392 DF,  p-value: < 2.2e-16
summary(mod.be) 

Call:
lm(formula = Balance ~ Income + Limit + Rating + Cards + Age + 
    Student + Utilization, data = Credit)

Residuals:
    Min      1Q  Median      3Q     Max 
-192.01  -77.03  -14.61   57.03  308.37 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -487.75633   24.70331 -19.745  < 2e-16 ***
Income        -6.92337    0.38768 -17.859  < 2e-16 ***
Limit          0.18229    0.03236   5.633 3.38e-08 ***
Rating         1.06492    0.48072   2.215 0.027316 *  
Cards         16.37034    4.33160   3.779 0.000182 ***
Age           -0.56062    0.29019  -1.932 0.054091 .  
StudentYes   403.99690   18.09320  22.329  < 2e-16 ***
Utilization  145.46321   51.90256   2.803 0.005321 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 97.76 on 392 degrees of freedom
Multiple R-squared:  0.9556,    Adjusted R-squared:  0.9548 
F-statistic:  1205 on 7 and 392 DF,  p-value: < 2.2e-16
car::vif(mod.be)
     Income       Limit      Rating       Cards         Age     Student 
   7.793671  232.919318  230.957276    1.472901    1.046060    1.233070 
Utilization 
   3.323397 
car::vif(mod.fs)
     Rating Utilization     Student      Income       Limit       Cards 
 230.957276    3.323397    1.233070    7.793671  232.919318    1.472901 
        Age 
   1.046060 

5 Shrinkage Methods

As an alternative to subset selection, we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. It turns out that shrinking the coefficient estimates can significantly reduce their variance. The two best-known techniques for shrinking the regression coefficients towards zero are ridge regression and the lasso.

5.1 Ridge Regression

The least squares fitting procedure estimates \(\beta_0, \beta_1, \ldots, \beta_p\) using the values that minimize Equation (5.1).

\[\begin{equation} \text{RSS} = \sum_{i=1}^{n}\left(y_i - \beta_0 - \sum_{j=1}^{p}\beta_{j}x_{ij} \right)^2 \tag{5.1} \end{equation}\]

Ridge regression is very similar to least squares, except that the coefficients are estimated by minimizing a slightly different quantity. In particular, the ridge regression coefficient estimates \(\hat{\beta}^{R}\) are the values that minimize Equation (5.2).

\[\begin{equation} \sum_{i=1}^{n}\left(y_i - \beta_0 - \sum_{j=1}^{p}\beta_{j}x_{ij} \right)^2 + \lambda \sum_{j=1}^{p}\beta_j^2 = \text{RSS} + \lambda \sum_{j=1}^{p}\beta_j^2 \tag{5.2} \end{equation}\]

where \(\lambda \geq 0\) is a tuning parameter, to be estimated separately. As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small. However, the second term \(\lambda \sum_{j=1}^{p}\beta_j^2\), called a shrinkage penalty, is small when \(\beta_1, \ldots, \beta_p\) are close to zero, and so has the effect of shrinking the estimates of \(\beta_j\) towards zero. The tuning parameter serves to control the relative impact of these two two terms on the regression coefficient estimates. When \(\lambda = 0\), the penalty term has no effect, and ridge regression will produce the least squares estimates. However, as \(\lambda \rightarrow \infty\), the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach zero. Unlike least squares, which generates only one set of coefficient estimates, ridge regression will produce a different set of coefficient estimates \(\hat{\beta}^{R}_{\lambda}\), for each value of \(\lambda\).

Note that the shrinkage penalty in not applied to \(\beta_0\).

5.1.1 glmnet()

The glmnet() function has an alpha argument that determines what type of model is fit. If alpha = 0 then a ridge regreesion model is fit, and if alpha = 1 then a lasso model is fit. By default, the glmnet() function standardizes the variables so that they are on the same scale. To turn off this default setting, use the argument standardize = FALSE. Also by default, the glmnet() function performs ridge regression for an automatically selected range of \(\lambda\) values. However, below we have choose to implement the function over a grid of values ranging form \(\lambda = 10^{10}\) to \(\lambda = 10^{-2}\), essentially covering the full range of scenarios from the null model containing only the intercept, to the least squares fit.

Associated with each value of \(\lambda\) is a vector of ridge regression coefficients, stored in a matrix that can be accessed by coef().

set.seed(1)
train = sample(c(TRUE, FALSE), size = nrow(Credit), replace = TRUE)
test <- (!train)
library(glmnet)
x <- model.matrix(Balance ~ ., data = Credit)[, -1] # Remove (Intercept) column
y <- Credit$Balance
grid <- 10^seq(10, -2, length = 100)
ridge.mod <- glmnet(x[train,], y[train], alpha = 0, lambda = grid)
dim(coef(ridge.mod))
[1]  13 100

In this case, it is a \(13 \times 100\) matrix, with 13 rows (one for each predictor, plus an intercept) and 100 columns (one for each value of \(\lambda\)). We expect the coefficient estimates to be much smaller, in terms of \(\ell_2\) norm, when a large value of \(\lambda\) is used, as compared to when a small value of \(\lambda\) is used.

grid[c(25, 75)]
[1] 1.232847e+07 1.072267e+01

These are the coefficients when \(\lambda = 1.2328467\times 10^{7}\), along with their \(\ell_2\) norm.

ridge.mod$lambda[25]  # lambda
[1] 12328467
coef(ridge.mod)[, 25]
       (Intercept)             Income              Limit 
      5.647644e+02       2.092243e-04       6.603610e-06 
            Rating              Cards                Age 
      9.831725e-05       1.753870e-03      -2.066925e-05 
         Education        Gender Male         StudentYes 
     -7.532019e-05      -3.784780e-04       1.983148e-02 
        MarriedYes     EthnicityAsian EthnicityCaucasian 
     -1.443080e-03       2.488565e-03      -4.302792e-03 
       Utilization 
      5.293474e-02 
sqrt(sum(coef(ridge.mod)[-1, 25]^2)) # l_2 norm
[1] 0.05679298

In contrast, here are the coefficients when \(\lambda = 10.7226722\), along with their \(\ell_2\) norm. Note the much larger \(\ell_2\) norm of the coefficients associated with this smaller value of \(\lambda\).

ridge.mod$lambda[75]  # lambda
[1] 10.72267
coef(ridge.mod)[, 75]
       (Intercept)             Income              Limit 
     -421.28868705        -5.60611656         0.12599890 
            Rating              Cards                Age 
        1.58458657        11.45725580        -0.93602041 
         Education        Gender Male         StudentYes 
        0.02549979         0.60629997       392.94706116 
        MarriedYes     EthnicityAsian EthnicityCaucasian 
       -8.89359468         2.84991438       -23.31125983 
       Utilization 
      259.20170982 
coef(ridge.mod)[-1, 75]
            Income              Limit             Rating 
       -5.60611656         0.12599890         1.58458657 
             Cards                Age          Education 
       11.45725580        -0.93602041         0.02549979 
       Gender Male         StudentYes         MarriedYes 
        0.60629997       392.94706116        -8.89359468 
    EthnicityAsian EthnicityCaucasian        Utilization 
        2.84991438       -23.31125983       259.20170982 
sqrt(sum(coef(ridge.mod)[-1, 75]^2)) # l_2 norm
[1] 471.5825

We can use the predict() function for a number of purposes. For instance, we can obtain the ridge regression coefficients for a new value of \(\lambda\), say 50:

predict(ridge.mod, s = 50, type = "coefficients")[1:13, ]
       (Intercept)             Income              Limit 
     -353.14194183        -2.55438465         0.09435968 
            Rating              Cards                Age 
        1.34000809        11.24164536        -1.02204597 
         Education        Gender Male         StudentYes 
        0.75087290        -2.86712266       315.27778920 
        MarriedYes     EthnicityAsian EthnicityCaucasian 
      -15.31286768        -0.43648137       -24.46058752 
       Utilization 
      547.77292093 
predict(ridge.mod, s = 50, type = "coefficients")
13 x 1 sparse Matrix of class "dgCMatrix"
                               1
(Intercept)        -353.14194183
Income               -2.55438465
Limit                 0.09435968
Rating                1.34000809
Cards                11.24164536
Age                  -1.02204597
Education             0.75087290
Gender Male          -2.86712266
StudentYes          315.27778920
MarriedYes          -15.31286768
EthnicityAsian       -0.43648137
EthnicityCaucasian  -24.46058752
Utilization         547.77292093
sqrt(sum(predict(ridge.mod, s = 50, type = "coefficients")[2:13, ]^2))  # l_2 norm
[1] 632.7976

We can visualize the coefficients (for ridge regression) using the plot() function as shown in Figure 5.1.

plot(ridge.mod, xvar = "lambda", label = TRUE)
Coefficients for ridge regression model

Figure 5.1: Coefficients for ridge regression model

Each curve corresponds to a variable. It shows the path of its coefficient against the \(\ell_2\)-norm of the whole coefficient vector as \(\lambda\) varies.

The function glmnet() returns a sequence of models for the users to choose from. In many cases, users may prefer the software to select one of them. Cross-validation is perhaps the simplest and most widely used method for that task.

cv.glmnet() is the main function to do cross-validation in the glmnet package. cv.glmnet returns a cv.glmnet object, which is cv.out below, a list with all the ingredients of the cross-validation fit. The object can be plotted as shown below. Note that the cv.glmnet() function performs ten-fold cross-validation by default. To change the number of folds, use the nfolds argument.

cv.out <- cv.glmnet(x[train,], y[train], alpha = 0)
plot(cv.out)

The value of \(\lambda\) that results in the smallest cross-validation error is 43.0994974. What is the test MSE associated with \(\lambda = 43.0994974\)?

bestlambda <- cv.out$lambda.min
bestlambda
[1] 43.0995
ridge.pred <- predict(ridge.mod, s = bestlambda, newx = x[test, ])
RE <- mean((y[test] - ridge.pred)^2)  # MSPE
RE
[1] 13245.35

Finally, we refit our ridge regression on the full data set, using the value of \(\lambda\) chosen by cross-validation, and examine the coefficient estimates.

final <- glmnet(x, y, alpha = 0)
predict(final, type = "coefficients", s = bestlambda)
13 x 1 sparse Matrix of class "dgCMatrix"
                              1
(Intercept)        -398.8639153
Income               -2.2284220
Limit                 0.0910371
Rating                1.3277319
Cards                 8.9191408
Age                  -0.5766572
Education             0.2310996
Gender Male           1.4846991
StudentYes          295.3787993
MarriedYes          -14.6877715
EthnicityAsian        6.2207868
EthnicityCaucasian    3.4660732
Utilization         630.3674773

As expected, none of the coefficients are zero—ridge regression does not perform variable selection!

5.2 Compare to OLS model

ols <- lm(hp ~ ., data = mtcars)
summary(ols)

Call:
lm(formula = hp ~ ., data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-38.681 -15.558   0.799  18.106  34.718 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  79.0484   184.5041   0.428  0.67270   
mpg          -2.0631     2.0906  -0.987  0.33496   
cyl           8.2037    10.0861   0.813  0.42513   
disp          0.4390     0.1492   2.942  0.00778 **
drat         -4.6185    16.0829  -0.287  0.77680   
wt          -27.6600    19.2704  -1.435  0.16591   
qsec         -1.7844     7.3639  -0.242  0.81089   
vs           25.8129    19.8512   1.300  0.20758   
am            9.4863    20.7599   0.457  0.65240   
gear          7.2164    14.6160   0.494  0.62662   
carb         18.7487     7.0288   2.667  0.01441 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.97 on 21 degrees of freedom
Multiple R-squared:  0.9028,    Adjusted R-squared:  0.8565 
F-statistic:  19.5 on 10 and 21 DF,  p-value: 1.898e-08
SMSPE <- sqrt(mean((mtcars$hp - predict(ols))^2))
SMSPE
[1] 21.03922

6 Lasso Regression

In order to fit a lasso moedl, we once again use the glmnet() function; however, this time we use the argument alpha = 1.

x <- model.matrix(Balance ~ ., data = Credit)[, -1]
y <- Credit$Balance
grid <- 10^seq(10, -2, length = 100)
lasso.mod <- glmnet(x[train,], y[train], lambda = grid, alpha = 1)
dim(coef(lasso.mod))
[1]  13 100
plot(lasso.mod, xvar = "lambda", label = TRUE)
Coefficients plot for lasso model

Figure 6.1: Coefficients plot for lasso model

We can see from Figure 6.1 that some of the coefficients are zero depending on the choice of the tuning parameter (\(\lambda\)). We now perform cross-validation and compute the associated test error.

set.seed(321)                                           # set seed for reproducibility
cv.out <- cv.glmnet(x[train,], y[train], alpha = 1)
plot(cv.out)

bestlambda <- cv.out$lambda.min
bestlambda
[1] 2.145072
lasso.pred <- predict(lasso.mod, s = bestlambda, newx = x[test, ])
LE <- mean((y[test] - lasso.pred)^2)  # MSPE
LE
[1] 11386.52

Note that the lasso \(\text{MSPE}_{\text{test}} = 1.138652\times 10^{4}\) which is smaller that the ridge regression estimate using the best \(\lambda\) from cross-validation (\(1.324535\times 10^{4}\)). In this scenario, \(\text{MSPE}_{\text{test}}\) smaller and one of the coefficients (EthnicityCaucasian) is zero. Setting \(\lambda = 30\) eight of the 13 coefficients estimates are exactly zero.

final <- glmnet(x, y, alpha = 1, lambda = grid)
predict(final, type = "coefficients", s = bestlambda)[1:13, ]
       (Intercept)             Income              Limit 
     -480.48970570        -6.16429692         0.15291614 
            Rating              Cards                Age 
        1.33146243        12.48168277        -0.45170387 
         Education        Gender Male         StudentYes 
       -0.04740626         3.82109631       382.75326999 
        MarriedYes     EthnicityAsian EthnicityCaucasian 
       -5.92593599         3.43173753         0.00000000 
       Utilization 
      224.94976293 
predict(final, type = "coefficients", s = 30)
13 x 1 sparse Matrix of class "dgCMatrix"
                               1
(Intercept)        -359.24710094
Income                .         
Limit                 0.05209682
Rating                1.38295582
Cards                 .         
Age                   .         
Education             .         
Gender Male           .         
StudentYes          195.09880232
MarriedYes            .         
EthnicityAsian        .         
EthnicityCaucasian    .         
Utilization         808.04838639

7 Changing the problem now

7.1 Response is now Rating

  • Create a model that predicts Rating with Limit, Cards, Married, Student, and Education as features.
mod <- lm(Rating ~ Limit + Cards + Married + Student + Education, data = Credit)
summary(mod)

Call:
lm(formula = Rating ~ Limit + Cards + Married + Student + Education, 
    data = Credit)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.3855  -6.9708  -0.8064   6.4644  26.0040 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 26.2320857  2.8284022   9.275   <2e-16 ***
Limit        0.0667736  0.0002212 301.902   <2e-16 ***
Cards        4.8520572  0.3725931  13.022   <2e-16 ***
MarriedYes   2.1732013  1.0509888   2.068   0.0393 *  
StudentYes   3.0880657  1.7086441   1.807   0.0715 .  
Education   -0.2598468  0.1641332  -1.583   0.1142    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.19 on 394 degrees of freedom
Multiple R-squared:  0.9957,    Adjusted R-squared:  0.9957 
F-statistic: 1.832e+04 on 5 and 394 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(mod)

par(mfrow = c(1, 1))
car::residualPlots(mod)

           Test stat Pr(>|t|)
Limit          2.157    0.032
Cards         -2.286    0.023
Married           NA       NA
Student           NA       NA
Education      1.174    0.241
Tukey test     2.115    0.034
modN <- lm(Rating ~ poly(Limit, 2, raw = TRUE) + poly(Cards, 2, raw = TRUE) + Married + Student + Education, data = Credit)
summary(modN)

Call:
lm(formula = Rating ~ poly(Limit, 2, raw = TRUE) + poly(Cards, 
    2, raw = TRUE) + Married + Student + Education, data = Credit)

Residuals:
     Min       1Q   Median       3Q      Max 
-27.8814  -6.8317  -0.3358   6.5136  25.9925 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  2.579e+01  3.816e+00   6.760 5.01e-11 ***
poly(Limit, 2, raw = TRUE)1  6.529e-02  7.506e-04  86.984  < 2e-16 ***
poly(Limit, 2, raw = TRUE)2  1.320e-07  6.297e-08   2.096   0.0368 *  
poly(Cards, 2, raw = TRUE)1  7.615e+00  1.301e+00   5.855 1.01e-08 ***
poly(Cards, 2, raw = TRUE)2 -3.972e-01  1.783e-01  -2.228   0.0264 *  
MarriedYes                   2.295e+00  1.043e+00   2.199   0.0285 *  
StudentYes                   3.159e+00  1.693e+00   1.866   0.0628 .  
Education                   -2.774e-01  1.627e-01  -1.705   0.0889 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.09 on 392 degrees of freedom
Multiple R-squared:  0.9958,    Adjusted R-squared:  0.9957 
F-statistic: 1.334e+04 on 7 and 392 DF,  p-value: < 2.2e-16
car::residualPlots(modN)

                           Test stat Pr(>|t|)
poly(Limit, 2, raw = TRUE)        NA       NA
poly(Cards, 2, raw = TRUE)        NA       NA
Married                           NA       NA
Student                           NA       NA
Education                      1.271    0.204
Tukey test                    -0.782    0.434
car::vif(modN)
                               GVIF Df GVIF^(1/(2*Df))
poly(Limit, 2, raw = TRUE) 1.006987  2        1.001742
poly(Cards, 2, raw = TRUE) 1.011571  2        1.002880
Married                    1.014970  1        1.007457
Student                    1.012868  1        1.006413
Education                  1.012733  1        1.006346
summary(modN)

Call:
lm(formula = Rating ~ poly(Limit, 2, raw = TRUE) + poly(Cards, 
    2, raw = TRUE) + Married + Student + Education, data = Credit)

Residuals:
     Min       1Q   Median       3Q      Max 
-27.8814  -6.8317  -0.3358   6.5136  25.9925 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  2.579e+01  3.816e+00   6.760 5.01e-11 ***
poly(Limit, 2, raw = TRUE)1  6.529e-02  7.506e-04  86.984  < 2e-16 ***
poly(Limit, 2, raw = TRUE)2  1.320e-07  6.297e-08   2.096   0.0368 *  
poly(Cards, 2, raw = TRUE)1  7.615e+00  1.301e+00   5.855 1.01e-08 ***
poly(Cards, 2, raw = TRUE)2 -3.972e-01  1.783e-01  -2.228   0.0264 *  
MarriedYes                   2.295e+00  1.043e+00   2.199   0.0285 *  
StudentYes                   3.159e+00  1.693e+00   1.866   0.0628 .  
Education                   -2.774e-01  1.627e-01  -1.705   0.0889 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.09 on 392 degrees of freedom
Multiple R-squared:  0.9958,    Adjusted R-squared:  0.9957 
F-statistic: 1.334e+04 on 7 and 392 DF,  p-value: < 2.2e-16
  • Use your model to predict the Rating for an individual that has a credit card limit of $6,000, has 4 credit cards, is married, and is not a student, and has an undergraduate degree (Education = 16).

  • Use your model to predict the Rating for an individual that has a credit card limit of $12,000, has 2 credit cards, is married, is not a student, and has an eighth grade education (Education = 8).

predict(modN, newdata = data.frame(Limit = 6000, Cards = 4, Married = "Yes", Student = "No", Education = 16), response = "pred")
       1 
444.2526 
### Should be the same as:
coef(modN)[1] + coef(modN)[2]*6000 + coef(modN)[3]*6000^2 + coef(modN)[4]*4 + coef(modN)[5]*4^2 + coef(modN)[6]*1 + coef(modN)[7]*0 + coef(modN)[8]*16
(Intercept) 
   444.2526 
predict(modN, newdata = data.frame(Limit = 12000, Cards = 2, Married = "Yes", Student = "No", Education = 8), response = "pred")
       1 
842.0091 

References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani, eds. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics 103. New York: Springer.