# Chapter 3 Working with the Data

When building a predictive model with a sufficiently large data set, it is common practice to hold out some fraction (usually less than 50%) of the data as a test set. It is difficult to provide a general rule for the size of the `training`

and `testing`

sets as the ideal split depends on the signal to noise ratio in the data Hastie, Tibshirani, and Friedman (2009). Figure 3.1 shows a schematic display of splitting the data into `training`

and `testing`

data sets. The process of holding out a portion of the data to be used as a `testing`

set is commonly referred to as the validation set approach or the holdout method.

For illustration purposes, the `Boston`

data set from the **MASS** package written by Ripley (2021) is used to illustrate various steps in predictive model building. The `Boston`

help file tells the reader that the data set consists of 506 observations on 14 different variables for houses in Boston collected in 1978. To open the `Boston`

help file, type `?Boston`

at the `R`

prompt once the **MASS** package has been loaded. The `Boston`

data set is divided into a `training`

set containing roughly 80% of the observations and a `testing`

set containing roughly 20% of the observations in the `R`

Code below. Before calling the `createDataPartition()`

function, it is important to set a seed to ensure the data partition is reproducible.

```
library(caret) # load the caret package
library(MASS) # load MASS package
set.seed(3178) # set seed for reproducibility
<- createDataPartition(y = Boston$medv,
trainIndex p = 0.80,
list = FALSE,
times = 1)
<- Boston[trainIndex, ]
training <- Boston[-trainIndex, ]
testing dim(training)
```

`[1] 407 14`

`dim(testing)`

`[1] 99 14`

The `training`

data set consists of 407 rows which are 80.43% of the original `Boston`

data set, while the `testing`

data set has 99 rows which are 19.57% of the original `Boston`

data set. The `createDataPartition()`

function can also split on important categorical variables. Once a model has been trained with the `training`

data set, some measure of the quality of fit is needed to assess the model. For regression settings, commonly-used measures are mean squared error (MSE), root mean squared error (RMSE), and mean absolute error (MAE) defined in (3.1), (3.2), and (3.3), respectively.

\[\begin{equation} \text{MSE} = \frac{1}{n}\sum_{i=1}^n(y_i - \hat{f}(x_i))^2, \tag{3.1} \end{equation}\]

\[\begin{equation} \text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^n(y_i - \hat{f}(x_i))^2}, \tag{3.2} \end{equation}\]

\[\begin{equation} \text{MAE} = \frac{1}{n}\sum_{i=1}^n\left|y_i - \hat{f}(x_i)\right|, \tag{3.3} \end{equation}\]

where \(\hat{f}(x_i)\) is the prediction that \(\hat{f}\) returns for the \(i^{\text{th}}\) observation. Figure 2.1 highlights the idea that the RMSE the practitioner should minimize is the RMSE associated with the `testing`

data set (\(\text{RMSE}_{\text{testing}}\)) not the RMSE associated with the `training`

set (\(\text{RMSE}_{\text{training}}\)).

## 3.1 Visualizing and Checking the Data

Visualizing the data allows the model builder to see relationships between the predictors and the response and should always be done before deciding on a functional form for \(f\). In addition to visualizing the data, the model builder can use the function `summary()`

to compute summary statistics on all the variables in a data set. If the data set has missing values, the user must decide how to deal with observations that are missing since many statistical learning algorithms will not accept missing values. Imputing missing values is possible with the `preProcess()`

function from **caret**. Analyzing and possibly imputing missing values, re-coding values or factors, creating new features from existing data, applying appropriate transformations to variables, and arranging the data in a readable format are all part of what is known as data munging or data preparation. Zhu et al. (2013) points out that few undergraduate texts have data munging exercises which creates a gap in the statistics curriculum. What is common is for textbooks to use clean “tidy” data from either a web page or an `R`

package. Chapter 5 introduces a moderately small data set readers can use to practice their data munging skills.

A scatterplot matrix is an array of scatterplots used to examine the marginal relationships of the predictors and response and is often a good starting point for understanding relationships in the data. The `R`

Code below loads the **MASS** package and provides code to create a scatterplot matrix using the `ggpairs()`

function from the **GGally** package written by Schloerke et al. (2021). Figure 3.2 shows scatterplots in the lower triangle of the matrix and density plots of the individual variables along the diagonal and computes and displays the correlation between variables in the upper triangle of the matrix. Based on Figure 3.2, the user might want to transform some of the variables in an attempt to make the scatterplots more linear in form. Additional techniques to visualize the multivariate linear regression model are described in Olive, Pelawa Watagoda, and Rupasinghe Arachchige Don (2015).

```
library(MASS) # load MASS package
library(GGally) # load GGally package
ggpairs(data = training,
columns = c("medv", "dis", "lstat", "nox", "age", "rm"),
aes(alpha = 0.01)) +
theme_bw()
```

## 3.2 Pre-Processing the Data

Some algorithms work better when the predictors are on the same scale. This section considers the `preProcess()`

function for the `caret`

package to find potentially helpful transformations of the `training`

predictors. Three different transformations are considered: `center`

, `scale`

, and `BoxCox`

. A `center`

transform computes the mean of a variable and subtracts the computed mean from each value of the variable. A `scale`

transform computes the standard deviation of a variable and divides each value of the variable by the computed standard deviation. Using both a `center`

and a `scale`

transform standardizes a variable. That is, using both `center`

and `scale`

on a variable creates a variable with a mean of 0 and a standard deviation of 1. When all values of a variable are positive, a `BoxCox`

transform will reduce the skew of a variable, making it more Gaussian. The `R`

Code below applies a `center`

, `scale`

, and `BoxCox`

transform to all the predictors in `training`

and stores the results in `pp_training`

. The computed transformations are applied to both the `training`

and the `testing`

data sets using the `predict()`

function with the results stored in the objects `trainingTrans`

and `testingTrans`

, respectively. Note that the response (`medv`

) is the last column (\(14^{\text{th}}\)) of the `training`

data frame and is removed before pre-processing with `training[ , -14]`

.

```
<- preProcess(training[ , -14],
pp_training method = c("center", "scale", "BoxCox"))
pp_training
```

```
Created from 407 samples and 13 variables
Pre-processing:
- Box-Cox transformation (11)
- centered (13)
- ignored (0)
- scaled (13)
Lambda estimates for Box-Cox transformation:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.00 -0.15 0.20 0.40 0.90 2.00
```

```
<- predict(pp_training, training)
trainingTrans <- predict(pp_training, testing) testingTrans
```

Figure 3.3 uses the same variables as Figure 3.2 after applying `center`

, `scale`

, and `BoxCox`

transforms to the predictors. Note that density plots found on the diagonal of Figure 3.3 appear less skewed than the density plots of the un-transformed predictors found on the diagonal of Figure 3.2.

### References

*The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition*. 2 edition. Springer.

*An Introduction to Statistical Learning: With Applications in R*. 1st ed. 2013, Corr. 7th printing 2017 edition. New York: Springer.

*International Journal of Statistics and Probability*4 (1): 126. https://doi.org/10.5539/ijsp.v4n1p126.

*MASS: Support Functions and Datasets for Venables and Ripley’s MASS*. http://www.stats.ox.ac.uk/pub/MASS4/.

*GGally: Extension to Ggplot2*. https://CRAN.R-project.org/package=GGally.

*The American Statistician*67 (4): 235–41. https://doi.org/10.1080/00031305.2013.842498.