Reproduce this document. The data frame WATER
is from the PASWR2
package written by Arnholt (2016). The example is from Ugarte, Militino, and Arnholt (2015).
A bottled water company acquires its water from two independent sources, x
and y
. The company suspects that the sodium content in the water from source x
is less than the sodium content for water from source y
. An independent agency measures the sodium content in 20 samples from source x
and 10 samples from source y
and stores them in the data frame WATER
. Is there statistical evidence to suggest the average sodium content in the water from source x
is less than the average sodium content in the water from source y
? The measurements for the sodium values are mg/L. Use an \(\alpha\) of 0.05 to test the appropriate hypotheses.
To solve this problem, start by verifying the reasonableness of the normality assumption. The side-by-side boxplots and normal quantile-quantile plots depicted in Figures 1.1 and 1.2, respectively suggest it is reasonable to assume the sodium values for both sources follow normal distributions; however, it is clear from the boxplots that the variances are very different.
library(PASWR2)
ggplot(data = WATER, mapping = aes(x = source, y = sodium)) +
geom_boxplot() +
theme_bw()
ggplot(data = WATER, mapping = aes(sample = sodium, color = source)) +
stat_qq() +
theme_bw()
Step 1: Hypotheses — Since the problem wants to test to see if the mean sodium content from source x
is less than the mean sodium content from source y
, use a lower one-sided alternative hypothesis as shown in Equation (1.1).
library(dplyr)
NDF <- WATER %>%
group_by(source) %>%
summarize(MEAN = mean(sodium), SD = sd(sodium), n = n())
knitr::kable(NDF, caption = "Summary statistics for the `WATER` data frame")
source | MEAN | SD | n |
---|---|---|---|
x | 76.4 | 11.080566 | 20 |
y | 81.2 | 2.299758 | 10 |
Step 2: Test Statistic — The test statistic chosen is \(\bar{X} - \bar{Y}\) because \(E\left[\bar{X} - \bar{Y} \right] = \mu_x - \mu_y\). The value of this test statistic is \(76.4 - 81.2 = -4.8\). The standardized test statistic under the assumption theat \(H_0\) is true and its appropriate distribution are given in Equation (1.2).
\[\begin{equation} \frac{\left[(\bar{X} - \bar{Y}) - \delta_0 \right]}{\sqrt{\left(\frac{S_x^2}{n_x} + \frac{S_y^2}{n_y}\right)}} \overset{\bullet}{\sim} t_{\nu} \tag{1.2} \end{equation}\]t.test(sodium ~ source, data = WATER, alternative = "less")
Welch Two Sample t-test
data: sodium by source
t = -1.8589, df = 22.069, p-value = 0.03822
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -0.3665724
sample estimates:
mean in group x mean in group y
76.4 81.2
Arnholt, Alan T. 2016. PASWR2: Probability and Statistics with R, Second Edition. https://CRAN.R-project.org/package=PASWR2.
Ugarte, Maria Dolores, Ana F. Militino, and Alan T. Arnholt. 2015. Probability and Statistics with R, Second Edition. 2 edition. Boca Raton: Chapman; Hall/CRC.