hsgpa | collgpa |
---|---|
2.7 | 2.2 |
3.1 | 2.8 |
2.1 | 2.4 |
3.2 | 3.8 |
2.4 | 1.9 |
3.4 | 3.5 |
2.6 | 3.1 |
2.0 | 1.4 |
3.1 | 3.4 |
2.5 | 2.5 |
Least Squares Regression
Correlation
The correlation coefficient, denoted by \(r\), measures the direction and strength of the linear relationship between two numerical variables. Is is given by the equation
\[\begin{equation} \label{eq-cor} r = \frac{1}{(n-1)}\sum_{i=1}^n\left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right) \end{equation}\]
Following are the high school GPAs and the college GPAs at the end of the freshman year for ten different students from the Gpa
data set of the BSDA
package.
Create a scatterplot and then comment on the relationship between the two variables.
The college GPA is the response variable and is labeled on the vertical axis. The scatterplot in Figure 1 shows that the college GPA increases as the high school GPA increases. In fact, the the dots appear to cluster along a straight line. The correlation coefficient is \(r = 0.844\), which indicates that a straight line is a reasonable relationship between the two variables.
- Compute the correlation coefficient using the equation presented earlier.
Note:
<- ggplot(data = Gpa, aes(x = hsgpa, y = collgpa)) +
p1 geom_point() +
theme_bw()
<- ggplot(data = values, aes(x = zx, y = zy)) +
p2 geom_point() +
theme_bw()
library(gridExtra)
grid.arrange(p1, p2, ncol = 1, nrow = 2)
# Or better yet
library(patchwork)
/p2 p1
Least Squares Regression
The equation of a straight line is
\[y = b_0 + b_1x\]
where \(b_0\) is the \(y\)-intercept and \(b_1\) is the slope of the line. From the equation of the line that best fits the data,
\[\hat{y} = b_0 + b_1x\]
we can compute a predicted \(y\) for each value of \(x\) and then measure the error of the prediction. The error of the prediction, \(e_i\) (also called the residual) is the difference in the actual \(y_i\) and the predicted \(\hat{y}_i\). That is, the residual associated with the data point \((x_i, y_i)\) is
\[e_i = y_i - \hat{y}_i.\]
The least squares regression line is
\[\hat{y} = b_0 + b_1x\]
where
\[\begin{equation} b_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} = r\frac{s_y}{s_x} \end{equation}\]
and
\[\begin{equation} b_0 = \bar{y} - b_1\bar{x} \end{equation}\]
Find the least squares regression line \(\hat{y} = b_0 + b_1x\) for the Gpa
data.
The coefficients are also computed when using the lm()
function.
Find the residuals for mod1
.
Add the least squares line to the scatterplot for collgpa
versus hsgpa
.