Create a directory named HomePricesProject
inside your private class GitHub repository. Store all work for this project in this directory.
Read the article Modeling Home Prices Using Realtor Data.
Create an Rmarkdown document named Project1.Rmd
inside the HomePricesProject
directory. Complete all subsequent directions in this document.
Read the data from http://ww2.amstat.org/publications/jse/datasets/homes76.dat.txt into an R object named HP
.
Remove columns 1, 7, 10, 15, 16, 17, 18, and 19 from HP
and store the result back in HP
.
Name the columns in HP
price
, size
, lot
, bath
, bed
, year
, age
, garage
, status
, active
, and elem
, respectively.
Use the function datatable
from the DT
package to display the data from HP
. Your data display should look similar to the one below.
Explore the data for variables that might help explain the price
of a house.
What are the units for price
and size
? Use the function stepAIC
from the MASS
package to create models using forward selection and backward elimination. Store the model from backward elimination in an object named mod.be
and the model from forward selection in an object named mod.fs
.
Which model (mod.be
or mod.fs
) do you believe is better and why?
Create a model and name it mod1
that regresses price
on all of the variables in HP
with the exception of status
and year
. Produce a summary of mod1
and graph the residuals using residualPlots
from the car
package. Based on your residual plots, what might you do to mod1
? Report the adjusted \(R^2\) value for mod1
.
Create a new model (mod2
) by adding bath:bed
and age
\(^2\) to mod1
. Report the adjusted \(R^2\) value for mod2
.
Create a new model (mod3
) by using only edison
and harris
from elem
from mod2
. Hint: use I()
. Your estimated coefficients should agree with those in the article. Conduct a nested F-test (anova(mod3, mod2
)). Does your p-value agree with the one presented in the article? Interpret this test. Report the adjusted \(R^2\) value for mod3
.
Compute the training mean square prediction error for all five of the models. Which model has the smallest training mean square prediction error? Do you think this model will also have the smallest test mean square prediction error?
Use mod3
to create a 95% prediction interval for a home with the following features: 1879 feet, lot size category 4, two and a half baths, three bedrooms, built in 1975, two-car garage, and near Parker Elementary School.
EXTRA CREDIT: Install the package effects
and run the following code:
library(effects)
plot(allEffects(mod2))
plot(effect("bath*bed", mod2))
plot(effect("bath*bed", mod2, xlevels=list(bed=2:5)))
plot(effect("bath*bed", mod2, xlevels=list(bath=1:3)))
Explain what each set of graphs is showing.