Can A New Drug Reduce the Spread of Schistosomiasis?

Schistosomiasis (skis-tuh-soh-may’-uh-sis) is a disease in humans caused by parasitic flatworms called schistosomes (skis’-tuh-sohms). Schistosomiasis affects about 200 million people worldwide and is a serious problem in sub-Saharan Africa, South America, China, and Southeast Asia. The disease can cause death, but more commonly results in chronic and debilitating symptoms, arising primarily from the body’s immune reaction to parasite eggs lodged in the liver, spleen,and intestines.

Currently there is one drug, praziquantel (pray’-zee-kwan-tel), in common use for treatment of schistosomiasis; it is inexpensive and effective. However, many organizations are concerned about relying on a single drug to treat a serious disease that affects so many people worldwide. Drug resistance may have prompted an outbreak in the 1990s in Senegal, where cure rates were low. In 2007, several researchers published work on a promising drug called K11777, which, in theory, might treat schistosomiasis.

gender <- c(rep("Female", 10), rep("Male", 10)) 
group <- rep(rep(c("Treatment", "Control"), each = 5), 2)
worms <- c(1, 2, 2, 10, 7, 16, 10, 10, 7, 17, 3, 5, 9, 10, 6, 31, 26, 28, 13, 47)
schis <- data.frame(gender, group, worms)
rm(gender, group, worms)
head(schis, n = 3)

  gender     group worms
1 Female Treatment     1
2 Female Treatment     2
3 Female Treatment     2

schis

   gender     group worms
1  Female Treatment     1
2  Female Treatment     2
3  Female Treatment     2
4  Female Treatment    10
5  Female Treatment     7
6  Female   Control    16
7  Female   Control    10
8  Female   Control    10
9  Female   Control     7
10 Female   Control    17
11   Male Treatment     3
12   Male Treatment     5
13   Male Treatment     9
14   Male Treatment    10
15   Male Treatment     6
16   Male   Control    31
17   Male   Control    26
18   Male   Control    28
19   Male   Control    13
20   Male   Control    47

library(ggplot2)
p <- ggplot(data = schis, aes(group, worms)) + 
  geom_point(position = "jitter", aes(color = group)) + 
  facet_grid(. ~ gender) + 
  theme_bw()
p

ACTIVITY

Use the previous graph to compare visually the number of worms for the treatment and control groups for both the male and female mice. Does each of the four groups appear to have a similar center and similar spread? Are there any outliers (extreme observations that don’t fit with the rest of the data)?
Calculate appropriate summary statistics (e.g., the median, mean, and standard deviation) for each of the four groups. For the female mice, calculate the difference between the treatment and control means. Do the same for the male mice.

# Classical approach with order()
with(data = schis, 
schis[order(gender, group, worms),]
     )

   gender     group worms
9  Female   Control     7
7  Female   Control    10
8  Female   Control    10
6  Female   Control    16
10 Female   Control    17
1  Female Treatment     1
2  Female Treatment     2
3  Female Treatment     2
5  Female Treatment     7
4  Female Treatment    10
19   Male   Control    13
17   Male   Control    26
18   Male   Control    28
16   Male   Control    31
20   Male   Control    47
11   Male Treatment     3
12   Male Treatment     5
15   Male Treatment     6
13   Male Treatment     9
14   Male Treatment    10

# Tidyverse Approach
library(dplyr)
schis <- schis %>%
  arrange(gender, group, worms)
schis

   gender     group worms
1  Female   Control     7
2  Female   Control    10
3  Female   Control    10
4  Female   Control    16
5  Female   Control    17
6  Female Treatment     1
7  Female Treatment     2
8  Female Treatment     2
9  Female Treatment     7
10 Female Treatment    10
11   Male   Control    13
12   Male   Control    26
13   Male   Control    28
14   Male   Control    31
15   Male   Control    47
16   Male Treatment     3
17   Male Treatment     5
18   Male Treatment     6
19   Male Treatment     9
20   Male Treatment    10

with(data = schis, 
tapply(worms, list(gender, group), median)    
     )

       Control Treatment
Female      10         2
Male        28         6

with(data = schis, 
tapply(worms, list(gender, group), mean)   
     )

       Control Treatment
Female      12       4.4
Male        29       6.6

with(data = schis, 
tapply(worms, list(gender, group), sd)    
     )

         Control Treatment
Female  4.301163  3.911521
Male   12.186058  2.880972

# Tidyverse Approach
schis %>%
  group_by(gender, group) %>%
  summarize(Md = median(worms), Mean = mean(worms), SD = sd(worms), N = n())

Source: local data frame [4 x 6]
Groups: gender [?]

  gender     group    Md  Mean        SD     N
  <fctr>    <fctr> <dbl> <dbl>     <dbl> <int>
1 Female   Control    10  12.0  4.301163     5
2 Female Treatment     2   4.4  3.911521     5
3   Male   Control    28  29.0 12.186058     5
4   Male Treatment     6   6.6  2.880972     5

The descriptive analysis in Questions 1 and 2 points to a positive treatment effect: K11777 appears to have reduced the number of parasite worms in this sample. But descriptive statistics are usually only the first step in ascertaining whether an effect is real; we often conduct a significance test or create a confidence interval to determine if chance alone could explain the effect.

We will introduce the basic concepts of randomization tests in a setting where units (mice in this example) are randomly allocated to a treatment or control group. Using a significance test, we will decide if an observed treatment effect (the difference between the mean responses in the treatment and control) is “real” or if “random chance alone” could plausibly explain the observed effect. The null hypothesis states that “random chance alone” is the reason for the observed effect. In this initial discussion, the alternative hypothesis will be one-sided because we want to show that the true treatment mean(\(\mu_{\text{treatment}}\)) is less than the true control mean (\(\mu_{\text{control}}\)).

Statistical Inference Through a Randomization Test

Whether they take the form of significance tests or confidence intervals, inferential procedures rest on the fundamental question for inference: “What would happen if we did this many times?” Let’s unpack this question in the context of the female mice schistosomiasis. We observed a difference in means of 7.6 = 12 - 4.4 worms between the control and treatment groups. While we expect that this large difference reflects the effectiveness of the drug, it is possible that chance alone could explain this difference. This “chance alone” position is usually called the null hypothesis and includes the following assumptions:

The number of parasitic worms found in the liver naturally varies from mouse to mouse.
Whether or not the drug is effective, there clearly is variability in the responses of mice to the infestation of schistosomes.
Each group exhibits this variability, and even if the drug is not effective, some mice do better than others.
The only explanation for the observed difference of 7.6 worms in the means is that the random allocation randomly placed mice with larger numbers of worms in the control group and mice with smaller numbers of worms in the treatment group.

In this study, the null hypothesis is that the treatment has no effect on the average worm count, and it is denoted as

\(H_0:\mu_{\text{control}} = \mu_{\text{treatment}}\) Another way to write this hypothesis is \(H_0: \text{the treatment has no effect on average worm count}\)

Alternative hypotheses can be “one-sided, greater-than” (as in this investigation), “one-sided, less-than” (the treatment causes an increase in worm count), or “two-sided” (the treatment is different, in one direction or the other, from the mean). We chose to test a one-sided hypothesis because there is a clear research interest in one direction. In other words, we will take action (start using the drug) only if we can show that K11777 reduces worm count.

The fundamental question for inference: Every statistical inference procedure is based on the question “How does what we observed in our data compare to what would happen if the null hypothesis were actually true and we repeated the process many times?”

For a randomization test comparing responses for the two groups, this question becomes “How does the observed difference between groups compare to what would happen if the treatments actually had no effect on the individual responses and we repeated the random allocation of individuals to groups many times?”

Conducting a Randomization Test by Hand

To get the feel for the concept of a p-value, write each of the female worm counts on an index card. Shuffle the 10 index cards, and then draw five cards at random (without replacement). Call these five cards the treatment group and the five remaining cards the control group. Under the null hypothesis (i.e. the treatment has no effect on the worms), this allocation mimics precisely what actually happened in our experiment, since the only cause of group differences is the random allocation. Calculate the mean of the five cards representing the treatment group and the mean of the five cards representing the control group. Then find the difference between the control and treatment group means that you obtained in your allocation. To be consistent, take the control group mean minus the treatment group mean.
If you were to do another random allocation, would you get the same difference in means? Explain.
Now, perform nine more random allocations, each time computing and writing down the difference in mean worm count between the control group and the treatment group. Graphically represent the differences. What proportion of these differences are 7.6 or larger?
If you performed the simulation many times, would you expect a large percentage of the simulations to result in a mean difference greater than 7.6? Explain.

The reasoning in the previous activity leads us to the randomization test and an interpretation of the fundamental question for inference. The fundamental question for this context is as follows: “If the null hypothesis were actually true and we randomly allocated our 10 mice to treatment and control groups many times, what proportion of the time would the observed difference in means be as big or bigger than 7.6?” This long-run proportion is a probability that statisticians call the p-value of the randomization test. The p-values for most randomization tests are found through simulations. Despite the fact that simulations do not give exact p-values, they are usually preferred over the tedious and time consuming process of listing all possible outcomes. Researchers usually pick a round number such as 10,000 repetitions on the simulation and approximate the p-value accordingly. Since this p-value is an approximation, it is often referred to as the empirical p-value.

Key Concept: Assuming that nothing except the random allocation process is creating group differences, the p-value of a randomization test is the probability of obtaining a group difference as large as or larger than the group difference actually observed in the experiment.

Key Concept: The calculation of an empirical p-value requires these steps:

Repeat the random allocation process a number of times (N times).
Record, each time, whether or not the group difference exceeds or is the same as the one observed in the actual experiment (let X be the number of times the group difference exceeds or is the same as the one observed).
Compute (X + 1)/(N + 1) to get the p-value, the proportion of times the difference exceeds or is the same as the observed difference.

Note: Including the observed value as one of the possible allocations is a more conservative approach and protects against getting a p-value of 0. Our observation from the actual experiment provides evidence that the true p-value is greater than zero.

Performing a Randomization Test Using Computer Simulation

While physical simulations (such as the index cards activity) help us understand the process of computing an empirical p-value, using computer software is a much more efficient way of producing an empirical p-value based on a large number of iterations. If you are simulating 10 random allocations, it is just as easy to use index cards as a computer. However, the advantage of a computer simulation is that 10,000 random allocations can be conducted in almost the same time it takes to simulate 10 allocations.

Two-Sample Permutation Test

Pool the m + n values.

repeat

Draw a resample of size m without replacement.
Use the remaining n observations for the other sample.
Calculate the difference in means or another test statistic that compares samples.

until

we have enough samples

Calculate the p-value as the fraction of times the random statistics are more or as extreme as the original statistic. Optionally, plot a histogram of the random statistic values.

Definition 3.2 A test statistic is a numerical function of the data whose value determines the result of the test. The function itself is generally denoted T=T(X), where X represents the data. After being evaluated for the sample data x, the result is called an observed test statistic and is written in lowercase, t=T(x).

Using Computer Simulations to Conduct a Hypothesis Test

Write code to allocate randomly each of the female worm counts to either the treatment or the control group.
Take the control group average minus the K11777 treatment group average.
Write code that perform steps 1. and 2. N = 99,999 times. Compute and report the empirical p-value.
Create a histogram of the N simulated differences between group means and comment of the shape of the histogram.
Based on your results from 3. and 4. and assuming the null hypothesis is true, about how frequently do you think you would obtain a mean difference as large as or larger than 7.6 by random allocation alone?
Does your answer from 5. lead you to believe the “chance alone” position (i.e., the null hypothesis that the mean worm count is the same for both the treatment and the control), or does is lead you to believe that K11777 has a positive inhibitory effect on the schistosome worm in female mice? Explain.

Using R

ND <- schis[schis$gender=="Female", ]
ND

   gender     group worms
1  Female   Control     7
2  Female   Control    10
3  Female   Control    10
4  Female   Control    16
5  Female   Control    17
6  Female Treatment     1
7  Female Treatment     2
8  Female Treatment     2
9  Female Treatment     7
10 Female Treatment    10

tapply(ND$worms, ND$group, mean)

  Control Treatment 
     12.0       4.4

# OR
ANS1 <- with(data = ND,
             tapply(worms, group, mean)
             )
ANS1

  Control Treatment 
     12.0       4.4

observed <- ANS1[1] - ANS1[2]
observed

Control 
    7.6

names(observed) <- NULL
observed

[1] 7.6

# Tidyverse approach
ND2 <- schis %>%
  filter(gender == "Female")
ND2

   gender     group worms
1  Female   Control     7
2  Female   Control    10
3  Female   Control    10
4  Female   Control    16
5  Female   Control    17
6  Female Treatment     1
7  Female Treatment     2
8  Female Treatment     2
9  Female Treatment     7
10 Female Treatment    10

IA <- ND2 %>%
  group_by(group) %>%
  summarize(MeanWorms = mean(worms))
IA

# A tibble: 2 × 2
      group MeanWorms
     <fctr>     <dbl>
1   Control      12.0
2 Treatment       4.4

obs_diff <- IA[1, 2] - IA[2, 2]
obs_diff

  MeanWorms
1       7.6

Since we will be working with the worms variable for females only, we will create a vector holding these values. Then, we will draw a random sample of size 5 from the numbers 1 through 10 (there are 10 observations). The worms values corresponding to these positions will be values for the Control group and the remaining ones for the Treatment group. The mean difference of this permutation will be stored in result. This will be repeated many times.

Worms <- ND$worms
Worms

 [1]  7 10 10 16 17  1  2  2  7 10

# Another way:
Worms2 <- subset(ND, select = worms, drop = TRUE)
Worms2

 [1]  7 10 10 16 17  1  2  2  7 10

N <- 10^4 - 1         # number of times to repeat the process
result <- numeric(N)  # space to save the random differences
set.seed(5)
for (i in 1:N){
  # sample of size 5, from 1 to 10, without replacement
  index <- sample(10, size = 5, replace = FALSE)
  result[i] <- mean(Worms2[index]) - mean(Worms2[-index])
}
hist(result, col = "blue", freq = FALSE, main = "", breaks = "Scott")

d.res <- density(result)
plot(d.res, main ="", xlab = "", ylab = "")
polygon(d.res, col ="pink")
xs <- c(7.6, d.res$x[d.res$x >= 7.6])
ys <- c(0, d.res$y[d.res$x>=7.6])
polygon(xs, ys, col = "red")

pvalue <- (sum(result >= observed) + 1)/(N + 1) # p-value
pvalue  # results will vary

[1] 0.0288

############################################################
# Yet another approach - 
ND2

   gender     group worms
1  Female   Control     7
2  Female   Control    10
3  Female   Control    10
4  Female   Control    16
5  Female   Control    17
6  Female Treatment     1
7  Female Treatment     2
8  Female Treatment     2
9  Female Treatment     7
10 Female Treatment    10

N <- 10^4 - 1          # number of times to repeat the process
result2 <- numeric(N)  # space to save the random differences
set.seed(5)
for (i in 1:N){
  # sample of size 5, from 1 to 10, without replacement
  index <- sample(10, size = 5, replace = FALSE)
  result2[i] <- mean(ND2$worms[index]) - mean(ND2$worms[-index])
}
# Graph Now
# ggplot2 approach now
DF <- data.frame(x = result)
p <- ggplot(data = DF) + geom_density(aes(x = x, y = ..density..), fill = 'pink', alpha = 0.4) + 
  theme_bw()
p

x.dens <- density(result)
df.dens <- data.frame(x = x.dens$x, y = x.dens$y)
p + geom_area(data = subset(df.dens, x >= 7.6 & x <= max(DF$x)), aes(x = x, y = y), fill = 'blue', alpha = 0.4) + 
  labs(x = expression(bar(x)[Control] - bar(x)[Treatment]), y = '', title = "Randomization Distribution") + 
  theme_bw()

The code snippet result >= observed results in a vector of TRUE’s and FALSE’s depending on whether or not the mean difference computed for a resample is greater than the observed mean difference. sum(result >= observed) counts the number of TRUE’s. Thus, the computed p-value is just the proportion of statistics (including the original) that are as large or larger than the original mean difference. The empirical p-value is 0.0288.

Because the sample sizes in the schistosomiasis study are small, it is possible to apply mathematical methods to obtain an exact p-value for this randomization test. An exact p-value can be obtained by writing down the set of all possibilities (assuming each possible outcome is equally likely under the null hypothesis) and then calculating the proportion of the set for which the difference is at least as large as the observed difference. In the schistosomiasis study, this requires listing every possible combination in which five of the 10 female mice can be allocated to the treatment (and the other five mice are assigned to the control). There are \(\binom{10}{5}=\) 252 possible combinations. For each of these combinations, the difference between the treatment and control means is then calculated. The exact p-value is the proportion of times in which the difference in the means is at least as large as the observed difference of 7.6 worms. Of these 252 combinations, six have a mean difference of 7.6 and one has a mean difference greater than 7.6 (namely 8.8). Since all 252 of these random allocations are equally likely, the exact p-value in this example is 7/252 = 0.0278. However, most real studies are too large to list all possible samples. Randomization tests are almost always adequate, providing approximate p-values that are close enough to the true p-value.

Key Concept: The larger the number of randomizations within a simulation study, the more precise the p-value is. If the true p-value is p, the estimated p-value has variance approximately equal to \(p(1 - p)/N\), where \(N\) is the number of resamples.

Sometimes we have some threshold p-value at or below which we will reject the null hypothesis and conclude in favor of the alternative hypothesis. This threshold value is called a significance level and is usually denoted by the Greek letter alpha (\(\alpha\)). Common values are \(\alpha = 0.05\) and \(\alpha = 0.01\), but the value will depend heavily on context and the researcher’s assessment of the acceptable risk of stating an incorrect conclusion. When the study’s p-value is less than or equal to this significance level, we state that the results are statistically significant at level \(\alpha\). If you see the phrase “statistically significant” without a specification of \(\alpha\) the writer is most likely assuming \(\alpha = 0.05\), for reasons of history and convention alone. However, it is best to show the p-value instead of simply stating a result is significant at a particular \(\alpha\)-level.

Activity

Explain what each line of the following code is doing.

# PROGRAMMING IS THE BEST WAY TO DEBUG YOUR THINKING!
# Theoretical Answer
library(PASWR2)
DATA <- c(1, 2, 2, 10, 7, 16, 10, 10, 7, 17)
DATA

 [1]  1  2  2 10  7 16 10 10  7 17

OBS <- mean(DATA[6:10]) - mean(DATA[1:5])
OBS

[1] 7.6

#
# ANS <- t(Combinations(10, 5))
ANS <- t(combn(10, 5))
head(ANS)

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    1    2    3    4    6
[3,]    1    2    3    4    7
[4,]    1    2    3    4    8
[5,]    1    2    3    4    9
[6,]    1    2    3    4   10

nn <- dim(ANS)[1]
nn

[1] 252

means <- numeric(nn)
for (i in 1:nn){
means[i] <- mean(DATA[ANS[i,]]) - mean(DATA[-ANS[i,]])
}
sort(means)

  [1] -8.8 -7.6 -7.6 -7.6 -7.6 -7.6 -7.6 -6.4 -6.4 -6.4 -5.6 -5.6 -5.6 -5.6
 [15] -5.6 -5.6 -5.2 -5.2 -5.2 -5.2 -5.2 -4.8 -4.8 -4.4 -4.4 -4.4 -4.4 -4.4
 [29] -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0
 [43] -4.0 -4.0 -3.6 -3.6 -3.6 -3.2 -3.2 -3.2 -3.2 -2.8 -2.8 -2.8 -2.8 -2.4
 [57] -2.4 -2.4 -2.4 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0
 [71] -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6
 [85] -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.2 -1.2 -1.2
 [99] -1.2 -1.2 -1.2 -1.2 -1.2 -1.2 -0.8 -0.8 -0.8 -0.8 -0.8 -0.8 -0.8 -0.8
[113] -0.8 -0.8 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4  0.0  0.0  0.0
[127]  0.0  0.0  0.0  0.4  0.4  0.4  0.4  0.4  0.4  0.4  0.4  0.4  0.8  0.8
[141]  0.8  0.8  0.8  0.8  0.8  0.8  0.8  0.8  1.2  1.2  1.2  1.2  1.2  1.2
[155]  1.2  1.2  1.2  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6
[169]  1.6  1.6  1.6  1.6  1.6  1.6  1.6  2.0  2.0  2.0  2.0  2.0  2.0  2.0
[183]  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.4  2.4  2.4
[197]  2.4  2.8  2.8  2.8  2.8  3.2  3.2  3.2  3.2  3.6  3.6  3.6  4.0  4.0
[211]  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.4  4.4  4.4  4.4  4.4  4.4  4.4
[225]  4.4  4.4  4.4  4.4  4.4  4.8  4.8  5.2  5.2  5.2  5.2  5.2  5.6  5.6
[239]  5.6  5.6  5.6  5.6  6.4  6.4  6.4  7.6  7.6  7.6  7.6  7.6  7.6  8.8

#
sum(means >= OBS)

[1] 7

pvalue <- sum(means >= OBS)/nn
pvalue

[1] 0.02777778

# 7/252
DF <- data.frame(x = means)
p <- ggplot(data = DF) + 
  geom_density(aes(x = x, y = ..density..), fill = "pink", alpha = 0.4) + 
  theme_bw()
p

x.dens <- density(means)
df.dens <- data.frame(x = x.dens$x, y = x.dens$y)
p + geom_area(data = subset(df.dens, x >= 7.6 & x <= max(DF$x)), aes(x = x, y = y), fill = 'blue', alpha = 0.4) + 
  labs(x = '', y = '')

# Another approach ....
#
P3 <- t(srs(DATA, n = 5))
P2 <- t(srs(DATA, n = 5))
# Note need to reorder the P2 values
P2R <- P2[, 252:1]
apply(P2R, 2, mean)

  [1] 12.0 10.2 11.4 11.4 12.0 10.0 10.8 12.0 12.0 12.6 10.6 10.2 10.2 10.8
 [15]  8.8 11.4 12.0 10.0 12.0 10.0 10.6  9.2 10.4 10.4 11.0  9.0  8.6  8.6
 [29]  9.2  7.2  9.8 10.4  8.4 10.4  8.4  9.0  9.2  9.2  9.8  7.8 10.4 11.0
 [43]  9.0 11.0  9.0  9.6  8.6  9.2  7.2  9.2  7.2  7.8 10.4  8.4  9.0  9.0
 [57]  9.2 10.4 10.4 11.0  9.0  8.6  8.6  9.2  7.2  9.8 10.4  8.4 10.4  8.4
 [71]  9.0  9.2  9.2  9.8  7.8 10.4 11.0  9.0 11.0  9.0  9.6  8.6  9.2  7.2
 [85]  9.2  7.2  7.8 10.4  8.4  9.0  9.0  7.6  7.6  8.2  6.2  8.8  9.4  7.4
 [99]  9.4  7.4  8.0  7.0  7.6  5.6  7.6  5.6  6.2  8.8  6.8  7.4  7.4  7.6
[113]  8.2  6.2  8.2  6.2  6.8  9.4  7.4  8.0  8.0  7.6  5.6  6.2  6.2  7.4
[127]  9.0 10.2 10.2 10.8  8.8  8.4  8.4  9.0  7.0  9.6 10.2  8.2 10.2  8.2
[141]  8.8  9.0  9.0  9.6  7.6 10.2 10.8  8.8 10.8  8.8  9.4  8.4  9.0  7.0
[155]  9.0  7.0  7.6 10.2  8.2  8.8  8.8  7.4  7.4  8.0  6.0  8.6  9.2  7.2
[169]  9.2  7.2  7.8  6.8  7.4  5.4  7.4  5.4  6.0  8.6  6.6  7.2  7.2  7.4
[183]  8.0  6.0  8.0  6.0  6.6  9.2  7.2  7.8  7.8  7.4  5.4  6.0  6.0  7.2
[197]  7.4  7.4  8.0  6.0  8.6  9.2  7.2  9.2  7.2  7.8  6.8  7.4  5.4  7.4
[211]  5.4  6.0  8.6  6.6  7.2  7.2  7.4  8.0  6.0  8.0  6.0  6.6  9.2  7.2
[225]  7.8  7.8  7.4  5.4  6.0  6.0  7.2  5.8  6.4  4.4  6.4  4.4  5.0  7.6
[239]  5.6  6.2  6.2  5.8  3.8  4.4  4.4  5.6  6.4  4.4  5.0  5.0  6.2  4.4

apply(P3, 2, mean)

  [1]  4.4  6.2  5.0  5.0  4.4  6.4  5.6  4.4  4.4  3.8  5.8  6.2  6.2  5.6
 [15]  7.6  5.0  4.4  6.4  4.4  6.4  5.8  7.2  6.0  6.0  5.4  7.4  7.8  7.8
 [29]  7.2  9.2  6.6  6.0  8.0  6.0  8.0  7.4  7.2  7.2  6.6  8.6  6.0  5.4
 [43]  7.4  5.4  7.4  6.8  7.8  7.2  9.2  7.2  9.2  8.6  6.0  8.0  7.4  7.4
 [57]  7.2  6.0  6.0  5.4  7.4  7.8  7.8  7.2  9.2  6.6  6.0  8.0  6.0  8.0
 [71]  7.4  7.2  7.2  6.6  8.6  6.0  5.4  7.4  5.4  7.4  6.8  7.8  7.2  9.2
 [85]  7.2  9.2  8.6  6.0  8.0  7.4  7.4  8.8  8.8  8.2 10.2  7.6  7.0  9.0
 [99]  7.0  9.0  8.4  9.4  8.8 10.8  8.8 10.8 10.2  7.6  9.6  9.0  9.0  8.8
[113]  8.2 10.2  8.2 10.2  9.6  7.0  9.0  8.4  8.4  8.8 10.8 10.2 10.2  9.0
[127]  7.4  6.2  6.2  5.6  7.6  8.0  8.0  7.4  9.4  6.8  6.2  8.2  6.2  8.2
[141]  7.6  7.4  7.4  6.8  8.8  6.2  5.6  7.6  5.6  7.6  7.0  8.0  7.4  9.4
[155]  7.4  9.4  8.8  6.2  8.2  7.6  7.6  9.0  9.0  8.4 10.4  7.8  7.2  9.2
[169]  7.2  9.2  8.6  9.6  9.0 11.0  9.0 11.0 10.4  7.8  9.8  9.2  9.2  9.0
[183]  8.4 10.4  8.4 10.4  9.8  7.2  9.2  8.6  8.6  9.0 11.0 10.4 10.4  9.2
[197]  9.0  9.0  8.4 10.4  7.8  7.2  9.2  7.2  9.2  8.6  9.6  9.0 11.0  9.0
[211] 11.0 10.4  7.8  9.8  9.2  9.2  9.0  8.4 10.4  8.4 10.4  9.8  7.2  9.2
[225]  8.6  8.6  9.0 11.0 10.4 10.4  9.2 10.6 10.0 12.0 10.0 12.0 11.4  8.8
[239] 10.8 10.2 10.2 10.6 12.6 12.0 12.0 10.8 10.0 12.0 11.4 11.4 10.2 12.0

DiffMeans <- apply(P3, 2, mean) - apply(P2R, 2, mean)
sort(DiffMeans)

  [1] -8.8 -7.6 -7.6 -7.6 -7.6 -7.6 -7.6 -6.4 -6.4 -6.4 -5.6 -5.6 -5.6 -5.6
 [15] -5.6 -5.6 -5.2 -5.2 -5.2 -5.2 -5.2 -4.8 -4.8 -4.4 -4.4 -4.4 -4.4 -4.4
 [29] -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.4 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0 -4.0
 [43] -4.0 -4.0 -3.6 -3.6 -3.6 -3.2 -3.2 -3.2 -3.2 -2.8 -2.8 -2.8 -2.8 -2.4
 [57] -2.4 -2.4 -2.4 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0
 [71] -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -2.0 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6
 [85] -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.6 -1.2 -1.2 -1.2
 [99] -1.2 -1.2 -1.2 -1.2 -1.2 -1.2 -0.8 -0.8 -0.8 -0.8 -0.8 -0.8 -0.8 -0.8
[113] -0.8 -0.8 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4 -0.4  0.0  0.0  0.0
[127]  0.0  0.0  0.0  0.4  0.4  0.4  0.4  0.4  0.4  0.4  0.4  0.4  0.8  0.8
[141]  0.8  0.8  0.8  0.8  0.8  0.8  0.8  0.8  1.2  1.2  1.2  1.2  1.2  1.2
[155]  1.2  1.2  1.2  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6  1.6
[169]  1.6  1.6  1.6  1.6  1.6  1.6  1.6  2.0  2.0  2.0  2.0  2.0  2.0  2.0
[183]  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.0  2.4  2.4  2.4
[197]  2.4  2.8  2.8  2.8  2.8  3.2  3.2  3.2  3.2  3.6  3.6  3.6  4.0  4.0
[211]  4.0  4.0  4.0  4.0  4.0  4.0  4.0  4.4  4.4  4.4  4.4  4.4  4.4  4.4
[225]  4.4  4.4  4.4  4.4  4.4  4.8  4.8  5.2  5.2  5.2  5.2  5.2  5.6  5.6
[239]  5.6  5.6  5.6  5.6  6.4  6.4  6.4  7.6  7.6  7.6  7.6  7.6  7.6  8.8

sum(DiffMeans >= 7.6)

[1] 7

# Note the following:
obs <- mean(DATA[6:10])
sort(apply(P3, 2, mean))

  [1]  3.8  4.4  4.4  4.4  4.4  4.4  4.4  5.0  5.0  5.0  5.4  5.4  5.4  5.4
 [15]  5.4  5.4  5.6  5.6  5.6  5.6  5.6  5.8  5.8  6.0  6.0  6.0  6.0  6.0
 [29]  6.0  6.0  6.0  6.0  6.0  6.0  6.0  6.2  6.2  6.2  6.2  6.2  6.2  6.2
 [43]  6.2  6.2  6.4  6.4  6.4  6.6  6.6  6.6  6.6  6.8  6.8  6.8  6.8  7.0
 [57]  7.0  7.0  7.0  7.2  7.2  7.2  7.2  7.2  7.2  7.2  7.2  7.2  7.2  7.2
 [71]  7.2  7.2  7.2  7.2  7.2  7.2  7.2  7.4  7.4  7.4  7.4  7.4  7.4  7.4
 [85]  7.4  7.4  7.4  7.4  7.4  7.4  7.4  7.4  7.4  7.4  7.4  7.6  7.6  7.6
 [99]  7.6  7.6  7.6  7.6  7.6  7.6  7.8  7.8  7.8  7.8  7.8  7.8  7.8  7.8
[113]  7.8  7.8  8.0  8.0  8.0  8.0  8.0  8.0  8.0  8.0  8.0  8.2  8.2  8.2
[127]  8.2  8.2  8.2  8.4  8.4  8.4  8.4  8.4  8.4  8.4  8.4  8.4  8.6  8.6
[141]  8.6  8.6  8.6  8.6  8.6  8.6  8.6  8.6  8.8  8.8  8.8  8.8  8.8  8.8
[155]  8.8  8.8  8.8  9.0  9.0  9.0  9.0  9.0  9.0  9.0  9.0  9.0  9.0  9.0
[169]  9.0  9.0  9.0  9.0  9.0  9.0  9.0  9.2  9.2  9.2  9.2  9.2  9.2  9.2
[183]  9.2  9.2  9.2  9.2  9.2  9.2  9.2  9.2  9.2  9.2  9.2  9.4  9.4  9.4
[197]  9.4  9.6  9.6  9.6  9.6  9.8  9.8  9.8  9.8 10.0 10.0 10.0 10.2 10.2
[211] 10.2 10.2 10.2 10.2 10.2 10.2 10.2 10.4 10.4 10.4 10.4 10.4 10.4 10.4
[225] 10.4 10.4 10.4 10.4 10.4 10.6 10.6 10.8 10.8 10.8 10.8 10.8 11.0 11.0
[239] 11.0 11.0 11.0 11.0 11.4 11.4 11.4 12.0 12.0 12.0 12.0 12.0 12.0 12.6

sum(apply(P3, 2, mean) >= obs)

[1] 7

Two-Sided Tests

The direction of the alternative hypothesis is derived from the research hypothesis. In this K11777 study, we enter the study expecting a reduction in worm counts and hoping the data will bear out this expectation. It is our expectation, hope, or interest that drives the alternative hypothesis and the randomization calculation. Occasionally, we enter a study without a firm direction in mind for the alternative, in which case we use a two-sided alternative. Furthermore, even if we hope that the new treatment will be better than the old treatment or better than the control, we might be wrong—it may be that the new treatment is actually worse than the old treatment or even harmful (worse than the control). Some statisticians argue that a conservative objective approach is to always consider the two-sided alternative. For a two-sided test, the p-value must take into account extreme values of the test statistic in either direction (no matter which direction we actually observe in our sample data).

Key Concept: The direction of the alternative hypothesis does not depend on the sample data, but instead is determined by the research hypothesis before the data are collected.

We will now make our definition of the p-value more general for a wider variety of significance testing situations. The p-value is the probability of observing a group difference as extreme as or more extreme than the group difference actually observed in the sample data, assuming that there is nothing creating group differences except the random allocation process.

This definition is consistent with the earlier definition for one-sided alternatives, as we can interpret extreme to mean either greater than or less than, depending on the direction of the alternative hypothesis. But in the two-sided case, extreme encompasses both directions. In the K11777 example, we observed a difference of 7.6 between control and treatment group means. Thus, the two-sided p-value calculation is a count of all instances among the N replications where the randomly allocated mean difference is either as small as or smaller than -7.6 worms (\(\leq -7.6\)) or as great or greater than 7.6 worms (\(\geq 7.6\)). This is often written as \(|\text{diff}| \geq 7.6\).

Activity: A two-sided hypothesis test

Run the simulation study again to fine the empirical p-value for a two-sided hypothesis test to determine if there is a difference between the treatment and control group means for female mice.

ND <- schis[schis$gender=="Female", ]
ND

   gender     group worms
1  Female   Control     7
2  Female   Control    10
3  Female   Control    10
4  Female   Control    16
5  Female   Control    17
6  Female Treatment     1
7  Female Treatment     2
8  Female Treatment     2
9  Female Treatment     7
10 Female Treatment    10

tapply(ND$worms, ND$group, mean)

  Control Treatment 
     12.0       4.4

# OR
ANS1 <- with(data = ND,
             tapply(worms, group, mean)
             )
ANS1

  Control Treatment 
     12.0       4.4

observed <- ANS1[1] - ANS1[2]
observed

Control 
    7.6

names(observed) <- NULL
observed

[1] 7.6

Worms2 <- subset(ND, select = worms, drop = TRUE)
Worms2

 [1]  7 10 10 16 17  1  2  2  7 10

N <- 10^5 - 1         # number of times fo repeat the process
set.seed(13)
result <- numeric(N)  # space to save the random differences
for (i in 1:N){
  # sample of size 5, from 1 to 10, without replacement
  index <- sample(10, size = 5, replace = FALSE)
  result[i] <- mean(Worms2[index]) - mean(Worms2[-index])
}
hist(result, col = "blue", main = "", freq = FALSE, breaks = "Scott")

d.res <- density(result)
plot(d.res, main ="", xlab = "", ylab = "")
polygon(d.res, col ="pink")
xsr <- c(7.6, d.res$x[d.res$x >= 7.6])
ysr <- c(0, d.res$y[d.res$x>=7.6])
xsl <- c(-7.6, d.res$x[d.res$x <= -7.6])
ysl <- c(0, d.res$y[d.res$x <= -7.6])
polygon(xsr, ysr, col = "red")
polygon(xsl, ysl, col = "red")

pvalue <- (sum(result >= observed) + sum(result <= -observed) + 1)/(N + 1) # p-value
pvalue  # results will vary

[1] 0.05649

# ggplot2 approach now
DF <- data.frame(x = result)
p <- ggplot(data = DF) + 
  geom_density(aes(x = x, y = ..density..), fill = 'pink', alpha = 0.4) + 
  theme_bw()
p

x.dens <- density(result)
df.dens <- data.frame(x = x.dens$x, y = x.dens$y)
p + geom_area(data = subset(df.dens, x >= 7.6 & x <= max(DF$x)), aes(x = x, y = y), fill = 'blue', alpha = 0.4) + 
  geom_area(data = subset(df.dens, x <= -7.6 & x >= min(DF$x)), aes(x = x, y = y), fill = 'blue', alpha = 0.4) +
  labs(x = expression(bar(x)[Control] - bar(x)[Treatment]), y = '', title = "Randomization Distribution") + 
  theme(plot.title = element_text(hjust = 0.5))

The empirical p-value for a two-sided test is 0.05649.

Is the number of simulations resulting in a difference greater than or equal to 7.6 identical to the number of simulations resulting in a difference less than or equal to -7.6? Explain why these values are likely to be close but not identical.

sum(result >= observed)

[1] 2819

sum(result <= -observed)

[1] 2829

Explain why you expect the p-value for the two-sided alternative to be about double that for the one sided alternative.

What Can We Conclude from the Schistosomiasis Study?

The key question in this study is whether K11777 will reduce the spread of a common and potentially deadly disease. The result that you calculated from the one-sided randomization hypothesis test should have been close to the exact p-value of 0.0278. This small p-value allows you to reject the null hypothesis and conclude that the worm counts are lower in the female treatment group than in the female control group. In every study, it is important to consider how random allocation and random sampling impact the conclusions.

Random allocation: The schistosomiasis stude was an experiment because the units (female mice) were randomly allocated to treatment or contol groups. To the best of our knowledge this experiment controlled for any outside influences and allows us to state that there is a cause and effect relationship between the treatment and response. Therefore, we can conclude that K11777 did cause a reduction in the average number of schistosome parasites in these female mice.

Random sampling: Mice for this study are typically ordered from a facility that breeds and rasies lab mice. It is possilbe that the mice in this study were biologically related or were exposed to something that caused their response to be different from that of other mice. Similarly, there are risks in simply assuming that male mice have the same response as females. Since our sample of 10 female mice was not selected at random from the population of all mice, we should question whether the results from this study hold for all mice.

More importantly, the results have not shown that this new drug will have the same impact on humans as it does on mice. In addition, even though we found that K11777 does cause a reduction in worm counts, we did not specifically show that it will reduce the spread of the disease. Is the disease less deadly if only two worms are in the body instead of 10? Statistical consultants aren’t typically expected to know the answers to these theoretical, biological, or medical types of questions, but they should ask questions to ensure that the study conclusions match the hypothesis that was tested. In most cases, drug tests require multiple levels of studies to ensure that the drug is safe and to show that the results are consistent across the entire population of interest. While this study is very promising, much more work is needed before we can conclude that K11777 can reduce the spread of schistosomiasis in humans.

Permutation Test versus Randomization Tests

The random allocation of experimental units (e.g. mice) to groups provides the basis for statistical inference in a randomized comparative experiment. In the schistosomiasis K11777 treatment study, we used a significance test to ascertain whether cause and effect was at work. In the context of the random allocation study design, we called our significance test a randomization test. In observational studies, subjects are not randomly allocated to groups. In this context, we apply the same inferential procedures as in the previous experiment, but we commonly call the significance test a permutation test rather than a a randomization test. More importantly, in observational studies, the results of the test cannot typically be used to claim cause and effect; a researcher should exhibit more caution in the interpretation of results.

Key Concept: Wheras in experiments units are randomly allocated to treatment groups, observational studies do no impose a treatment on a unit. Because the random allocation process protects against potential biases caused by extraneous variables, experiments are often used to show causation.

Age Discrimination Study

Westvaco is a company that produces paper products. IN 1991, Robert Martin was working in the engineering department of the company’s envelope division when he was laid off in Round 2 of several rounds of layoffs by the company. He sued the company, claiming to be a victim of age discrimination. The ages of the 10 workers involved in Round 2 were: 25, 33, 35, 38, 48, 55, 55, 56, and 64. The ages of the three people laid off were 55, 55, and 64.

ages <- c(25, 33, 35, 38, 48, 55, 55, 55, 56, 64)
status <- c(rep("Job", 6), rep("LaidOff", 2), "Job", "LaidOff")
west <- data.frame(ages, status)
rm(ages, status)
west

   ages  status
1    25     Job
2    33     Job
3    35     Job
4    38     Job
5    48     Job
6    55     Job
7    55 LaidOff
8    55 LaidOff
9    56     Job
10   64 LaidOff

require(ggplot2)
p <- ggplot(data = west, aes(status, ages)) + 
  geom_point(position = "jitter", aes(color = status), size = 5) + theme_bw()
p

* Conduct a permutation test to determine whether the observed difference between means is likely to occur just by chance. Here we are interested in only a one-sided test to determine if the mean age of people who were laid off is higher than the mean age of people who were not laid off.

OBS <- with(data = west,
tapply(ages, status, mean))
OBS

     Job  LaidOff 
41.42857 58.00000

obsDiff <- OBS[2] - OBS[1]
names(obsDiff) <- NULL
obsDiff

[1] 16.57143

Ages <- subset(west, select = ages, drop = TRUE)
Ages

 [1] 25 33 35 38 48 55 55 55 56 64

set.seed(12)
N <- 10^5 - 1         # number of times fo repeat the process
result <- numeric(N)  # space to save the random differences
for (i in 1:N){
  # sample of size 3, from 1 to 10, without replacement
  index <- sample(10, size = 3, replace = FALSE)
  result[i] <- mean(Ages[index]) - mean(Ages[-index])
}
hist(result, col = "blue", main = "", xlim =c(-25, 25), breaks = "Scott")

d.res <- density(result)
plot(d.res, main ="", xlab = "", ylab = "", xlim =c(-25, 25))
polygon(d.res, col ="pink")
xs <- c(obsDiff, d.res$x[d.res$x >= obsDiff])
ys <- c(0, d.res$y[d.res$x >= obsDiff])
polygon(xs, ys, col = "red")

# ggplot2 approach now
DF <- data.frame(x = result)
p <- ggplot(data = DF) + 
  geom_density(aes(x = x, y = ..density..), fill = 'pink', alpha = 0.4) + 
  theme_bw()
p

x.dens <- density(result)
df.dens <- data.frame(x = x.dens$x, y = x.dens$y)
p + geom_area(data = subset(df.dens, x >= obsDiff & x <= max(DF$x)), aes(x = x, y = y), fill = 'blue', alpha = 0.4) + 
  labs(x = expression(bar(x)[Job] - bar(x)[LaidOff]), y = '', title = "Randomization Distribution") + 
  theme(plot.title = element_text(hjust = 0.5)) # Center the title

pvalue <- (sum(result >= obsDiff) + 1)/(N + 1)  # p-value
pvalue  # results will vary

[1] 0.05005

The p-value is 0.05005.

Modify the code used in the previous problem to test if the median age of people who were laid off is higher than the median age of people who were not laid off. Report the p-value and compare your results to those in the previous question.

OBSM <- with(data = west,
tapply(ages, status, median))
OBSM

    Job LaidOff 
     38      55

obsDiffMedian <- OBSM[2] - OBSM[1]
names(obsDiffMedian) <- NULL
obsDiffMedian

[1] 17

Ages <- subset(west, select = ages, drop = TRUE)
Ages

 [1] 25 33 35 38 48 55 55 55 56 64

N <- 10^5 - 1         # number of times fo repeat the process
MedianDiff <- numeric(N)  # space to save the random differences
set.seed(11)
for (i in 1:N){
  # sample of size 3, from 1 to 10, without replacement
  index <- sample(10, size = 3, replace = FALSE)
  MedianDiff[i] <- median(Ages[index]) - median(Ages[-index])
}
hist(MedianDiff, col = "blue", main = "", breaks = "Scott")

# ggplot2 approach now
DF <- data.frame(x = MedianDiff)
p <- ggplot(data = DF) + 
  geom_density(aes(x = x, y = ..density..), fill = 'pink', alpha = 0.4) + 
  theme_bw()
p

x.dens <- density(MedianDiff)
df.dens <- data.frame(x = x.dens$x, y = x.dens$y)
p + geom_area(data = subset(df.dens, x >= obsDiffMedian & x <= max(DF$x)), aes(x = x, y = y), fill = 'blue', alpha = 0.4) + 
  labs(x = expression(tilde(x)[Job] - tilde(x)[LaidOff]), y = '', title = "Randomization Distribution") + 
  theme(plot.title = element_text(hjust = 0.5))  # Center title

#######
pvalueMED <- (sum(MedianDiff >= obsDiffMedian) + 1)/(N + 1) # p-value
pvalueMED  # results will vary

[1] 0.16754

The p-value is 0.16754.

Since there was no random allocation (i.e., people were not randomly assigned to layoff group), statistical significance does not give us the right to assert that greater age is causing a difference in being laid off. We “imagine” an experiment in which workers are randomly allocated to a layoff group and then determine if the observed average difference between the ages of laid-off workers and those not laid off is significantly larger than would be expected to occur by chance is a randomized comparative experiment.

While age could be the cause for the difference —hence proving the allegation of age discrimination—there are many other possibilities (i.e., extraneous variables), such as educational levels of the workers, their competence to do the job, and ratings on past performance evaluations. Rejecting the null hypothesis in a nonrandomized context can be a useful step toward establishing causality; however, it cannot establish causality unless the extraneous variables have been properly accounted for.

In the actual court case, data from all three rounds of layoffs were statistically analyzed. The analysis showed some evidence that older people were more likely to be laid off; however, Robert Martin ended up settling out of court.