Here we apply the best subset selection approach to the Hitters data. We wish to predict a baseball player’s Salary on the basis of various statistics associated with performance in the previous year.
First of all, we note that the Salary variable is missing for some of the players. The is.na()
function can be used to identify the missing observations. It returns a vector of the same length as the input vector, with a TRUE
for any elements that are missing, and a FALSE
for non-missing elements. The sum()
function can then be used to count all of the missing elements.
library(ISLR)
names(Hitters)
[1] "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks"
[7] "Years" "CAtBat" "CHits" "CHmRun" "CRuns" "CRBI"
[13] "CWalks" "League" "Division" "PutOuts" "Assists" "Errors"
[19] "Salary" "NewLeague"
dim(Hitters)
[1] 322 20
sum(is.na(Hitters$Salary))
[1] 59
Hence we see that Salary
is missing for 59 players. The na.omit()
function
removes all of the rows that have missing values in any variable.
Hitters <- na.omit(Hitters)
dim(Hitters)
[1] 263 20
sum(is.na(Hitters))
[1] 0
The regsubsets()
function (part of the leaps
package) performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS. The syntax is the same as for lm()
. The summary()
command outputs the best set of variables for each model size.
library(leaps)
regfit_full <- regsubsets(Salary ~ ., data = Hitters)
summary(regfit_full)
Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters)
19 Variables (and intercept)
Forced in Forced out
AtBat FALSE FALSE
Hits FALSE FALSE
HmRun FALSE FALSE
Runs FALSE FALSE
RBI FALSE FALSE
Walks FALSE FALSE
Years FALSE FALSE
CAtBat FALSE FALSE
CHits FALSE FALSE
CHmRun FALSE FALSE
CRuns FALSE FALSE
CRBI FALSE FALSE
CWalks FALSE FALSE
LeagueN FALSE FALSE
DivisionW FALSE FALSE
PutOuts FALSE FALSE
Assists FALSE FALSE
Errors FALSE FALSE
NewLeagueN FALSE FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*"
6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
7 ( 1 ) " " "*" " " " " " " "*" " " "*" "*" "*" " " " "
8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " "*" "*" " "
CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1 ( 1 ) " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " " "
3 ( 1 ) " " " " " " "*" " " " " " "
4 ( 1 ) " " " " "*" "*" " " " " " "
5 ( 1 ) " " " " "*" "*" " " " " " "
6 ( 1 ) " " " " "*" "*" " " " " " "
7 ( 1 ) " " " " "*" "*" " " " " " "
8 ( 1 ) "*" " " "*" "*" " " " " " "
An asterisk indicates that a given variable is included in the corresponding model. For instance, this output indicates that the best two-variable model contains only Hits
and CRBI
. By default, regsubsets()
only reports results up to the best eight-variable model. But the nvmax
option can be used in order to return as many variables as are desired. Here we fit up to a 19-variable model.
regfit_full <- regsubsets(Salary ~ ., data = Hitters, nvmax = 19)
reg_summary <- summary(regfit_full)
The summary()
function also returns \(R^2\), RSS, adjusted \(R^2\), \(C_p\), and BIC. We can examine these to try to select the best overall model.
names(reg_summary)
[1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
For instance, we see that the \(R^2\) statistic increases from 32.15%, when only one variable is included in the model, to 54.61%, when all variables are included. As expected, the \(R^2\) statistic increases monotonically as more variables are included.
reg_summary$rsq
[1] 0.3214501 0.4252237 0.4514294 0.4754067 0.4908036 0.5087146 0.5141227
[8] 0.5285569 0.5346124 0.5404950 0.5426153 0.5436302 0.5444570 0.5452164
[15] 0.5454692 0.5457656 0.5459518 0.5460945 0.5461159
Plotting RSS, adjusted \(R^2\), \(C_p\), and BIC for all of the models at once will help us decide which model to select. Note the type = "l"
option tells R
to connect the plotted points with lines. The points()
command works like the plot()
command, except that it puts points on a plot that has already been created, instead of creating a new plot. The which.max()
function can be used to identify the location of the maximum point of a vector. We will now plot a red dot to indicate the model with the largest adjusted \(R^2\) statistic. In a similar fashion we can plot the \(C_p\) and BIC statistics, and indicate the models with the smallest statistic using which.min()
.
par(mfrow = c(2, 2))
plot(reg_summary$rss ,xlab = "Number of Variables", ylab = "RSS", type = "l")
plot(reg_summary$adjr2 ,xlab = "Number of Variables", ylab = "Adjusted RSq", type = "l")
which.max(reg_summary$adjr2)
[1] 11
points(11, reg_summary$adjr2[11], col = "red", cex = 2, pch = 20)
plot(reg_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
which.min(reg_summary$cp)
[1] 10
points (which.min(reg_summary$cp), reg_summary$cp[which.min(reg_summary$cp)],
col = "red", cex = 2, pch = 20)
which.min(reg_summary$bic)
[1] 6
plot(reg_summary$bic, xlab = "Number of Variables ", ylab = "BIC", type = "l")
points(which.min(reg_summary$bic), reg_summary$bic[which.min(reg_summary$bic)],
col = "red", cex = 2, pch = 20)
par(mfrow = c(1, 1))
The regsubsets()
function has a built-in plot()
command which can be used to display the selected variables for the best model with a given number of predictors, ranked according to the BIC, \(C_p\), adjusted \(R^2\), or AIC. To find out more about this function, type ?plot.regsubsets
.
par(mfrow = c(2, 2))
plot(regfit_full, scale = "r2")
plot(regfit_full, scale = "adjr2")
plot(regfit_full, scale = "Cp")
plot(regfit_full, scale = "bic")
par(mfrow = c(1, 1))
The top row of each plot contains a black square for each variable selected according to the optimal model associated with that statistic. For instance, we see that several models share a BIC close to −150. However, the model with the lowest BIC is the six-variable model that contains only AtBat
, Hits
, Walks
, CRBI
, DivisionW
, and PutOuts
. We can use the coef()
function
to see the coefficient estimates associated with this model.
coef(regfit_full, 6)
(Intercept) AtBat Hits Walks CRBI DivisionW
91.5117981 -1.8685892 7.6043976 3.6976468 0.6430169 -122.9515338
PutOuts
0.2643076
We can also use the regsubsets()
function to perform forward stepwise or backward stepwise selection, using the argument method = "forward"
or method = "backward"
.
regfitB_fwd <- regsubsets(Salary ~ ., data = Hitters , nvmax = 19, method = "forward")
summary(regfitB_fwd)
Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")
19 Variables (and intercept)
Forced in Forced out
AtBat FALSE FALSE
Hits FALSE FALSE
HmRun FALSE FALSE
Runs FALSE FALSE
RBI FALSE FALSE
Walks FALSE FALSE
Years FALSE FALSE
CAtBat FALSE FALSE
CHits FALSE FALSE
CHmRun FALSE FALSE
CRuns FALSE FALSE
CRBI FALSE FALSE
CWalks FALSE FALSE
LeagueN FALSE FALSE
DivisionW FALSE FALSE
PutOuts FALSE FALSE
Assists FALSE FALSE
Errors FALSE FALSE
NewLeagueN FALSE FALSE
1 subsets of each size up to 19
Selection Algorithm: forward
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " "*"
2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
4 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " " "*"
5 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " " " "*"
6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " " " "*"
8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" "*"
9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*"
15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*"
16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*"
19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1 ( 1 ) " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " " "
3 ( 1 ) " " " " " " "*" " " " " " "
4 ( 1 ) " " " " "*" "*" " " " " " "
5 ( 1 ) " " " " "*" "*" " " " " " "
6 ( 1 ) " " " " "*" "*" " " " " " "
7 ( 1 ) "*" " " "*" "*" " " " " " "
8 ( 1 ) "*" " " "*" "*" " " " " " "
9 ( 1 ) "*" " " "*" "*" " " " " " "
10 ( 1 ) "*" " " "*" "*" "*" " " " "
11 ( 1 ) "*" "*" "*" "*" "*" " " " "
12 ( 1 ) "*" "*" "*" "*" "*" " " " "
13 ( 1 ) "*" "*" "*" "*" "*" "*" " "
14 ( 1 ) "*" "*" "*" "*" "*" "*" " "
15 ( 1 ) "*" "*" "*" "*" "*" "*" " "
16 ( 1 ) "*" "*" "*" "*" "*" "*" " "
17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
18 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
regfitB_bwd <- regsubsets(Salary ~., data = Hitters , nvmax = 19, method = "backward")
summary(regfitB_bwd)
Subset selection object
Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "backward")
19 Variables (and intercept)
Forced in Forced out
AtBat FALSE FALSE
Hits FALSE FALSE
HmRun FALSE FALSE
Runs FALSE FALSE
RBI FALSE FALSE
Walks FALSE FALSE
Years FALSE FALSE
CAtBat FALSE FALSE
CHits FALSE FALSE
CHmRun FALSE FALSE
CRuns FALSE FALSE
CRBI FALSE FALSE
CWalks FALSE FALSE
LeagueN FALSE FALSE
DivisionW FALSE FALSE
PutOuts FALSE FALSE
Assists FALSE FALSE
Errors FALSE FALSE
NewLeagueN FALSE FALSE
1 subsets of each size up to 19
Selection Algorithm: backward
AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns CRBI
1 ( 1 ) " " " " " " " " " " " " " " " " " " " " "*" " "
2 ( 1 ) " " "*" " " " " " " " " " " " " " " " " "*" " "
3 ( 1 ) " " "*" " " " " " " " " " " " " " " " " "*" " "
4 ( 1 ) "*" "*" " " " " " " " " " " " " " " " " "*" " "
5 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" " "
6 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" " "
7 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" " "
8 ( 1 ) "*" "*" " " " " " " "*" " " " " " " " " "*" "*"
9 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
10 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
11 ( 1 ) "*" "*" " " " " " " "*" " " "*" " " " " "*" "*"
12 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
13 ( 1 ) "*" "*" " " "*" " " "*" " " "*" " " " " "*" "*"
14 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" " " " " "*" "*"
15 ( 1 ) "*" "*" "*" "*" " " "*" " " "*" "*" " " "*" "*"
16 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
17 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
18 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*"
19 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1 ( 1 ) " " " " " " " " " " " " " "
2 ( 1 ) " " " " " " " " " " " " " "
3 ( 1 ) " " " " " " "*" " " " " " "
4 ( 1 ) " " " " " " "*" " " " " " "
5 ( 1 ) " " " " " " "*" " " " " " "
6 ( 1 ) " " " " "*" "*" " " " " " "
7 ( 1 ) "*" " " "*" "*" " " " " " "
8 ( 1 ) "*" " " "*" "*" " " " " " "
9 ( 1 ) "*" " " "*" "*" " " " " " "
10 ( 1 ) "*" " " "*" "*" "*" " " " "
11 ( 1 ) "*" "*" "*" "*" "*" " " " "
12 ( 1 ) "*" "*" "*" "*" "*" " " " "
13 ( 1 ) "*" "*" "*" "*" "*" "*" " "
14 ( 1 ) "*" "*" "*" "*" "*" "*" " "
15 ( 1 ) "*" "*" "*" "*" "*" "*" " "
16 ( 1 ) "*" "*" "*" "*" "*" "*" " "
17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
18 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
19 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
For instance, we see that using forward stepwise selection, the best one variable model contains only CRBI
, and the best two-variable model additionally includes Hits
. For this data, the best one-variable through six variable models are each identical for best subset and forward selection. However, the best seven-variable models identified by forward stepwise selection, backward stepwise selection, and best subset selection are different.
coef(regfit_full, 7)
(Intercept) Hits Walks CAtBat CHits CHmRun
79.4509472 1.2833513 3.2274264 -0.3752350 1.4957073 1.4420538
DivisionW PutOuts
-129.9866432 0.2366813
coef(regfitB_fwd, 7)
(Intercept) AtBat Hits Walks CRBI CWalks
109.7873062 -1.9588851 7.4498772 4.9131401 0.8537622 -0.3053070
DivisionW PutOuts
-127.1223928 0.2533404
coef(regfitB_bwd, 7)
(Intercept) AtBat Hits Walks CRuns CWalks
105.6487488 -1.9762838 6.7574914 6.0558691 1.1293095 -0.7163346
DivisionW PutOuts
-116.1692169 0.3028847
We just saw that it is possible to choose among a set of models of different sizes using \(C_p\), BIC, and adjusted \(R^2\). We will now consider how to do this using the validation set and cross-validation approaches. In order for these approaches to yield accurate estimates of the test error, we must use only the training observations to perform all aspects of model-fitting—including variable selection. Therefore, the determination of which model of a given size is best must be made using only the training observations. This point is subtle but important. If the full data set is used to perform the best subset selection step, the validation set errors and cross-validation errors that we obtain will not be accurate estimates of the test error.
In order to use the validation set approach, we begin by splitting the observations into a training set and a test set. We do this by creating a random vector, train
, of elements equal to TRUE
if the corresponding observation is in the training set, and FALSE
otherwise. The vector test has a TRUE
if the observation is in the test set, and a FALSE
otherwise. Note the !
in the command to create test causes TRUE
s to be switched to FALSE
s and vice versa. We also set a random seed so that the user will obtain the same
training set/test set split.
set.seed(135)
train <- sample(c(TRUE, FALSE), size = nrow(Hitters), rep = TRUE)
test <- (!train)
Now, we apply regsubsets()
to the training set in order to perform best subset selection.
regfit_best <- regsubsets(Salary ~ ., data = Hitters[train ,], nvmax = 19)
Notice that we subset the Hitters
data frame directly in the call in order to access only the training subset of the data, using the expression Hitters[train,]
. We now compute the validation set error for the best model of each model size. We first make a model matrix from the test data.
test_mat <- model.matrix(Salary~., data = Hitters[test,])
The model.matrix()
function is used in many regression packages for building an “X” matrix from data. Now we run a loop, and for each size i
, we extract the coefficients from regfit.best
for the best model of that size, multiply them into the appropriate columns of the test model matrix to form the predictions, and compute the test MSE.
val_errors <- rep(NA , 19)
for(i in 1:19){
coefi = coef(regfit_best, id = i)
pred = test_mat[ ,names(coefi)]%*%coefi
val_errors[i] = mean((Hitters$Salary[test] - pred)^2)
}
val_errors
[1] 185212.6 162304.4 161602.8 154955.2 149492.5 148920.3 148320.6 143702.5
[9] 145153.8 146245.1 147595.8 148170.8 148943.6 148352.6 147640.9 147697.1
[17] 147768.8 147705.8 147529.7
We find the best model is the one that contains 8 variables.
which.min(val_errors)
[1] 8
coef(regfit_best, which.min(val_errors))
(Intercept) AtBat Hits Walks CHits CHmRun
165.0154261 -2.4218326 8.2240437 5.8201915 0.4353578 1.9097902
CWalks DivisionW PutOuts
-0.9216821 -83.7618566 0.1954191
coef(regfit_full, which.min(val_errors))
(Intercept) AtBat Hits Walks CHmRun CRuns
130.9691577 -2.1731903 7.3582935 6.0037597 1.2339718 0.9651349
CWalks DivisionW PutOuts
-0.8323788 -117.9657795 0.2908431
This was a little tedious, partly because there is no predict()
method
for regsubsets()
. Since we will be using this function again, we can capture
our steps above and write our own predict method.
predict.regsubsets <- function (object, newdata, id,...){
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
xvars = names(coefi)
mat[ ,xvars]%*%coefi
}
Our function pretty much mimics what we did above. The only complex part is how we extracted the formula used in the call to regsubsets()
. We demonstrate how we use this function below, when we do cross-validation.
Finally, we perform best subset selection on the full data set, and select the best 8-variable model. It is important that we make use of the full data set in order to obtain more accurate coefficient estimates. Note that we perform best subset selection on the full data set and select the best 8 variable model, rather than simply using the variables that were obtained from the training set, because the best 8-variable model on the full data set may differ from the corresponding model on the training set.
In fact, we see that the best 8-variable model on the full data set has a different set of variables than the best 8-variable model on the training set.
coef(regfit_full, which.min(val_errors)) # full data set
(Intercept) AtBat Hits Walks CHmRun CRuns
130.9691577 -2.1731903 7.3582935 6.0037597 1.2339718 0.9651349
CWalks DivisionW PutOuts
-0.8323788 -117.9657795 0.2908431
coef(regfit_best, which.min(val_errors)) # training data set
(Intercept) AtBat Hits Walks CHits CHmRun
165.0154261 -2.4218326 8.2240437 5.8201915 0.4353578 1.9097902
CWalks DivisionW PutOuts
-0.9216821 -83.7618566 0.1954191
We now try to choose among the models of different sizes using cross validation. This approach is somewhat involved, as we must perform best subset selection within each of the k training sets. Despite this, we see that with its clever subsetting syntax, R
makes this job quite easy. First, we create a vector that allocates each observation to one of k = 10
folds, and we create a matrix in which we will store the results.
k <- 10
set.seed(1)
folds <- sample(1:k, size = nrow(Hitters), replace = TRUE)
cv_errors <- matrix(NA, k, 19, dimnames = list(NULL, paste(1:19)))
Now we write a for loop that performs cross-validation. In the \(j^{th}\) fold, the elements of folds
that equal j
are in the test set, and the remainder are in the training set. We make our predictions for each model size (using our new predict()
method), compute the test errors on the appropriate subset, and store them in the appropriate slot in the matrix cv.errors
.
for(j in 1:k){
best_fit = regsubsets(Salary ~ ., data = Hitters[folds!=j, ], nvmax=19)
for(i in 1:19){
pred = predict(best_fit, Hitters[folds==j, ], id = i)
cv_errors[j, i] = mean((Hitters$Salary[folds==j] - pred)^2)
}
}
This has given us a \(10 \times 19\) matrix, of which the \((i, j)\)th element corresponds to the test MSE for the ith cross-validation fold for the best \(j\)-variable model. We use the apply()
function to average over the columns of this apply() matrix in order to obtain a vector for which the \(j\)th element is the cross validation error for the \(j\)-variable model.
mean_cv_errors <- apply(cv_errors, 2, mean)
mean_cv_errors
1 2 3 4 5 6 7 8
149821.1 130922.0 139127.0 131028.8 131050.2 119538.6 124286.1 113580.0
9 10 11 12 13 14 15 16
115556.5 112216.7 113251.2 115755.9 117820.8 119481.2 120121.6 120074.3
17 18 19
120084.8 120085.8 120403.5
which.min(mean_cv_errors)
10
10
plot(mean_cv_errors, type = "b")
We see that cross-validation selects an 10-variable model. We now perform best subset selection on the full data set in order to obtain the 10-variable model.
reg_best <- regsubsets(Salary~., data = Hitters, nvmax = 19)
coef(reg_best, which.min(mean_cv_errors))
(Intercept) AtBat Hits Walks CAtBat CRuns
162.5354420 -2.1686501 6.9180175 5.7732246 -0.1300798 1.4082490
CRBI CWalks DivisionW PutOuts Assists
0.7743122 -0.8308264 -112.3800575 0.2973726 0.2831680
caret
library(caret)
myControl <- trainControl(## 10-fold CV
method = "cv",
number = 10)
## repeated ten times
## repeats = 10)
set.seed(14)
regfit_seq <- train(Salary ~ ., data = Hitters,
method = "leapSeq",
trControl = myControl,
# tuneLength = 15
tuneGrid = expand.grid(nvmax = 2:15)
)
regfit_seq
Linear Regression with Stepwise Selection
263 samples
19 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 236, 236, 237, 237, 237, 237, ...
Resampling results across tuning parameters:
nvmax RMSE Rsquared MAE
2 345.4077 0.4511946 233.0780
3 348.9725 0.4397596 234.8597
4 364.8361 0.3798861 257.4740
5 349.5303 0.3981106 251.0870
6 345.0738 0.4393867 245.0293
7 332.9188 0.4848242 231.0050
8 335.2450 0.4773432 231.0535
9 321.9239 0.5076196 230.1277
10 330.4658 0.4873716 233.8023
11 329.1011 0.4900234 233.1403
12 338.0410 0.4623885 238.5765
13 334.7892 0.4799414 234.2109
14 337.2009 0.4740761 235.2804
15 337.8476 0.4691908 237.7640
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was nvmax = 9.
regfit_seq$results
nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 2 345.4077 0.4511946 233.0780 74.98867 0.2124110 34.40733
2 3 348.9725 0.4397596 234.8597 67.00939 0.1931219 26.75132
3 4 364.8361 0.3798861 257.4740 76.78370 0.2494808 46.07572
4 5 349.5303 0.3981106 251.0870 64.89880 0.1936790 32.21371
5 6 345.0738 0.4393867 245.0293 67.73890 0.2061439 40.72797
6 7 332.9188 0.4848242 231.0050 61.59632 0.1820805 29.07269
7 8 335.2450 0.4773432 231.0535 64.52943 0.1826086 31.84854
8 9 321.9239 0.5076196 230.1277 65.27428 0.1795997 30.87293
9 10 330.4658 0.4873716 233.8023 70.32401 0.1876931 32.88795
10 11 329.1011 0.4900234 233.1403 67.83144 0.1741916 31.83542
11 12 338.0410 0.4623885 238.5765 74.32991 0.1848022 37.51587
12 13 334.7892 0.4799414 234.2109 69.53887 0.1910405 37.06923
13 14 337.2009 0.4740761 235.2804 77.06601 0.1966817 42.85444
14 15 337.8476 0.4691908 237.7640 70.51702 0.1852818 36.74643
min(regfit_seq$results$RMSE)
[1] 321.9239
coef(regfit_seq$finalModel, id = 9)
(Intercept) AtBat Hits Walks CAtBat
146.24960033 -1.93676754 6.65672102 5.55204413 -0.09953904
CRuns CRBI CWalks DivisionW PutOuts
1.25067124 0.66176849 -0.77798498 -115.34950146 0.27773062
TEST <- Hitters[!train, ]
head(predict(regfit_seq, newdata = TEST))
-Alvin Davis -Andres Galarraga -Alfredo Griffin -Andres Thomas
724.83980 519.19903 537.33094 51.00533
-Andre Thornton -Alex Trevino
720.47199 278.04157
### Note that the following is the same output???
coef(reg_best, 9)
(Intercept) AtBat Hits Walks CAtBat
146.24960033 -1.93676754 6.65672102 5.55204413 -0.09953904
CRuns CRBI CWalks DivisionW PutOuts
1.25067124 0.66176849 -0.77798498 -115.34950146 0.27773062
head(predict(reg_best, id = 9, newdata = TEST))
[,1]
-Alvin Davis 724.83980
-Andres Galarraga 519.19903
-Alfredo Griffin 537.33094
-Andres Thomas 51.00533
-Andre Thornton 720.47199
-Alex Trevino 278.04157