Randomization Tests in R

A "Sampling distribution" describes the behavior of a sample statistic across many many samples from the population or random process. If we had the entire population, we could generate sampling distributions through simulation. However, in practice, we don't have the population (or else why would we conduct inference!). Thus, we need to use mathematical probability results for sampling distributions, or use a chance model to simulate a sampling distribution assuming some null hypothesis.

Though online applets can perform randomization and simulation-based hypothesis tests for us, e.g., StatKeyRossmanChance, or ISI RossmanChance, we would like to program our own randomization/simulation-based tests using R. This will also give us a deeper understanding of randomization/simulation-based inference methods.

Consider the following case study from class:

Swimming with dolphins can certainly be fun, but is it also therapeutic for patients suffering from clinical depression? To investigate this possibility, researchers recruited 30 subjects aged 18-65 with a clinical diagnosis of mild to moderate depression. Subjects were required to discontinue use of any antidepressant drugs or psychotherapy four weeks prior to the experiment, and throughout the experiment. These 30 subjects went to an island off the coast of Honduras, where they were randomly assigned to one of two treatment groups. Both groups engaged in the same amount of swimming and snorkeling each day, but one group (the animal care program) did so in the presence of bottlenose dolphins and the other group (outdoor nature program) did not. At the end of two weeks, each subjects’ level of depression was evaluated, as it had been at the beginning of the study, and it was determined whether they showed substantial improvement (reducing their level of depression) by the end of the study  (Antonioli and Reveley, 2005).

Observed data: The researchers found that 10 of 15 subjects in the dolphin therapy group showed substantial improvement, compared to 3 of 15 subjects in the control group.

The null hypothesis in this study is that swimming with dolphins does not improve depression symptoms in the population of people aged 18-65 with a clinical diagnosis of mild to moderate depression. Under this assumption, the 10+3 =13 subjects who showed improvement would have improved under either condition, and the other 17 subjects who did not improve would have done so regardless of the treatment. Thus, let's simulate what would happen if we randomly assigned 13 "improvers" and 17 "non-improvers" to two groups of 15:

subjects <- c(rep("improver",13), rep("nonimprover",17))
dolphin <- sample(subjects, 15, replace = FALSE)
table(dolphin)
# The number of "improvers" that ended up in the dolphin group due to random chance:
sum(dolphin == "improver")

Now, let's use a for loop to repeat this simulation many times and keep track of the number of "improvers" that ended up in the dolphin group each time:

counts <- NULL
for(i in 1:10000){
dolphin <- sample(subjects, 15, replace = FALSE)
counts[i] <- sum(dolphin == "improver")
}
table(counts)
hist(counts, breaks=(0:15+.5), col="turquoise",
xlab="Number of improvers in dolphin group",
main="Distribution of number of improvers in dolphin group")

Practice Questions

  1. Use the simulated counts to calculate a p-value, i.e., the proportion of the time we would see data like our observed data (10 of 15 improving in the dolphin group) or more extreme (more than 10 improving in the dolphin group), assuming that dolphins have no effect on depression symptoms. How does this p-value help you address the research question? What is your conclusion?
  2. Perform the same simulation, but instead of tracking the number of improvers in the dolphin group as your statistic, track the difference in proportion of improvers between the two groups. Do you get the same p-value?
  3. Try coding a simulation test for one of the other two cases in the case study handout from class.

R Functions for Inference

In Lab 1, we learned that the form of an R function is:

function.name( arg.name=value, ... )

R has many built-in functions, including ones that conduct statistical inference (hypothesis tests and confidence intervals) on a given data set. In this lab, we'll explore two R functions for inference:

binom.test
prop.test

The first function conducts exact inference using a binomial distribution, whereas the second function uses the normal approximation to conduct inference. To see the arguments to these functions, look at their help files.

?binom.test
?prop.test

Binomial Inference: Hypothesis Testing

Charles Darwin did an experiment comparing the heights of cross- and self-pollinated plants (Zea Mays, or corn, to be precise; see "The effect of cross- and self-fertilization in the plant kingdom", 1876). Enter the data into your R session by copying and pasting the following:

Darwin <- structure(
      list(Pot = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4),
       Crossed = c(23.5, 12, 21, 22, 19.125, 21.5, 22.125, 20.375, 
                   18.25, 21.625, 23.25, 21, 22.125, 23, 12), 
          Self = c(17.375, 20.375, 20, 20, 18.375, 18.625, 18.625, 
                   15.25, 16.5, 18, 16.25, 18, 12.75, 15.5, 18)), 
         .Names = c("Pot", "Crossed", "Self"), 
          class = "data.frame", 
       row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", 
                      "11", "12", "13", "14", "15"))

Remember - to look at an object just enter its name: Darwin. The resulting data table should look like this:

       Pot Crossed   Self 
     1   1  23.500 17.375
     2   1  12.000 20.375
     3   1  21.000 20.000
     4   2  22.000 20.000
     5   2  19.125 18.375
     6   2  21.500 18.625
     7   3  22.125 18.625
     8   3  20.375 15.250
     9   3  18.250 16.500
    10   3  21.625 18.000
    11   3  23.250 16.250
    12   4  21.000 18.000
    13   4  22.125 12.750
    14   4  23.000 15.500
    15   4  12.000 18.000

Darwin was interested in demonstrating that cross-pollinated plants were healthier than self-pollinated. The standard approach to questions like this one is to frame it as a hypothetical: if there were really no difference between the two groups, what would we expect to see?

To translate this into a question about the binomial distribution, start by converting the pairs of values above to Bernoulli trials. Call it a success (1) if the crossed plant is taller than the self-pollinated plant with which it is paired, and a failure (0) otherwise.

attach(Darwin)
X <- sum(Crossed > Self)

You should have 13 successes out of the 15 trials. Our hypothetical scenario, known as the null hypothesis, corresponds to the claim that the population proportion of 1's is 1/2. In other words, the cross-pollinated and self-pollinated plants are equally likely to be the taller of the two.

Assuming independence, we have a set of 15 Bernoulli trials, of which 13 were successes.

p-values

The p-value is an attempt to directly measure the improbability of an observed result, under the model of the null hypothesis. With Darwin's data, we could compute the probability of seeing exactly 13 successes in 15 trials if self-pollinated and cross-pollinated plants are equally likely to be taller:

dbinom(13,15,1/2)

However for a large number of trials, even the most likely outcome can be quite improbable. For example, try computing the probability of tossing 5000 heads in 10000 trials for a fair coin.

What we really want here is the probability of a result at least as unlikely as the one we observed. For a symmetric distribution, this is the set of outcomes at least as far from the expected count as the one we got. The expected value for n=15 and p=1/2 is 7.5. Thus we want the probability of seeing a result at least as far from 7.5 as 13 is. That includes 13, 14, and 15 in one direction, and 2, 1, and 0 in the other direction. Note: In general a result "at least as unlikely as the one we observed" should be defined by its probability under H0, not by distance from the expected value. In a symmetric distribution these are the same, but in a skewed distribution, they may be different! Can you see how to compute the p-value directly using the definition? For our example with n=15, p=1/2, and X=13, try

p <- dbinom(0:15,15,1/2)
sum(p[p <= dbinom(13,15,1/2)])

Here is a plot of the distribution with a line showing the probability of our observed result:

barplot(dbinom(0:15,15,1/2), names.arg=c(0:15), space=0)
abline(h = dbinom(13,15,1/2))  

Note: abline(b,m) adds a line with intercept b and slope m.

The desired calculation is easy to perform in R, just add up the binomial probabilities corresponding to those outcomes:

sum(dbinom(c(0:2,13:15),15,1/2))

Alternatively, we could use the pbinom function for this computation:

pbinom(2,15,1/2) + (1 - pbinom(12,15,1/2))

Either way, you should get the same answer: 0.00738..., which is less than a 1 percent chance. This quantity is called a p-value, which is short-short-hand for `the probability of seeing a result at least as unlikely as the one observed, if the null hypothesis were really true'.

WARNING: the p-value is NOT `the probability that the null hypothesis is correct'. That is not, in general computable without more information, specifically an initial probability (prior) for the truth of the null hypothesis, to be updated by Bayes rule.

Rejection regions and significance level

Rather than calculating a p-value, the classical hypothesis test starts with construction of a rejection region - a set of possible values of the statistic for which we would reject the null hypothesis. The size of the rejection region is determined by the significance level, the probability of getting an outcome in the rejection region, computed under the null hypothesis.

Suppose we wish to choose a rejection region of size no bigger than alpha=0.05. To see which values to include, take a look at the binomial probabilities, rounded off to 5 significant digits for ease of reading:

round(dbinom(0:15,15,1/2),5)

Let's try using 0:3 and 12:15 for our rejection region: R={0,1,2,3,12,13,14,15}. To compute the effective significance level for this rejection region, create a dataset containing the values in the rejection region, and then sum the dbinom values for those outcomes:

R <- c(0:3,12:15)
sum(dbinom(R,15,1/2))

The significance level for this rejection region is 0.035. You should verify that if we make the rejection region any larger, then alpha will be bigger than 0.05.

Now, we look at the observed value, 13, and note that it is in the rejection region. The conclusion: we reject the null hypothesis p=1/2 at the 0.035 significance level. Either p is not 1/2 or we were unlucky.

Power

It is easy to compute the power (probability of rejecting the null hypothesis, given an alternative value for p). We just need to sum the binomial probabilities for the rejection region, using the alternative value of p. Suppose that we were interested in the alternative "p=2/3":

sum(dbinom(R,15,2/3))

It is often useful to compute the power for many values of "p" at once, and plot against "p". Here is a function that will do that for a given rejection region RR or a given significance level alpha:

bin.pwr <- function(n,p0=1/2,RR=NULL,alpha=.05,COL="red")
{
# plot power vs p for H0: p=p0, default p0=1/2
# RR if present is the chosen rejection region
# alpha is the target significance level

if(is.null(RR)){
    X = 0:n
    db = dbinom(X,n,p0)
    sdb = sort(db)
    cb = cumsum(sdb)
    m = max((1:(n+1))[cb <= alpha])
    R = X[db <= sdb[m]]
}
else{
    R=RR
}
Alpha <- sum(dbinom(R,n,p0))
p <- seq(.01,.99,.01)
P <- p
for(i in 1:length(p))
   {
   P[i] <- sum(dbinom(R,n,p[i]))
   }
plot(p,P,ylab="Power",type="l",col=COL)
title(paste("Power curve for n=",n,"   alpha =",round(Alpha,4)))
}

The type="l" in the plot command is an "ell", not a "one". It tells the plot function to make a line plot. To use the function with a specified rejection region, create a dataset R with the values defining the rejection region, and enter the sample size n, for example:

R <- c(0:3,12:15)
bin.pwr(15,p0=1/2,RR=R)

The minimum point on this curve is the value at p=1/2, the null hypothesis. The power increases for alternatives farther away from 1/2. Does that make sense to you?

You may also use the function with a specified target significance level without specifying a rejection region:

bin.pwr(15,p0=1/2,alpha=.05)

Try it with some different sample sizes and rejection regions:

par(mfrow=c(2,2))
R10 <- c(0:1,9:10)
bin.pwr(10,RR=R10)
R20 <- c(0:5,15:20)
bin.pwr(20,RR=R20)
R40 <- c(0:13,27:40)
bin.pwr(40,RR=R40)
R80 <- c(0:30,50:80)
bin.pwr(80,RR=R80)

How does the power curve change as the sample size gets larger? (We haven't held alpha exactly fixed, but all were close to .03)

Testing other null hypotheses

Finally, we could test some other hypothesis here, such as p=3/4 instead of p=1/2. The procedure is the same, the only change is that calculations would then be based on the assumption that p=3/4. What should we choose for the rejection region?

> # Binomial(15,3/4) probabilities
> cbind(0:15,round(dbinom(0:15,15,3/4),5))
      [,1]    [,2]
 [1,]    0 0.00000
 [2,]    1 0.00000
 [3,]    2 0.00000
 [4,]    3 0.00001
 [5,]    4 0.00010
 [6,]    5 0.00068
 [7,]    6 0.00340
 [8,]    7 0.01311
 [9,]    8 0.03932
[10,]    9 0.09175
[11,]   10 0.16515
[12,]   11 0.22520
[13,]   12 0.22520
[14,]   13 0.15591
[15,]   14 0.06682
[16,]   15 0.01336

barplot(dbinom(0:15,15,3/4),names.arg=c(0:15))
abline(dbinom(7,15,3/4),0)

The tails of the distribution are no longer the same: we need to choose a rejection region assymetrically. For example, to achieve a significance level no bigger than .05, we could include in the rejection region {0:7, 15}. Verify that the significance level (the sum of the probabilities of outcomes in the rejection region computed under the null hypothesis) is about .03. If we included 8 in the rejection region, the significance level would exceed .05; if we let {0:8} be the rejection region, then we are including an outcome with greater probability than the one we left out (15), which would not make sense. Outcomes with smaller probability must be stronger evidence against the null hypothesis!

Question: Back to Darwin's data: can we reject the null hypothesis that p=3/4? Verify that the p-value for testing the null hypothesis that p=3/4 is about .38.  

The binom.test() function

The R binom.test() function does the computations discussed above. The only disadvantage of using it is that you don't see explicitly what the rejection region is. Read the help file:

?binom.test

Try it with Darwin's data:

binom.test(13,15,p=1/2)

The prop.test() function

The R prop.test() function carries out a hypothesis test and a confidence interval for a proportion using the normal approximation to a sample proportion. Note that this approximation is only valid for large numbers of successes and failures, i.e., both np and n(1-p) should be sufficiently large. Since this is only an approximation, for a single proportion, there is no reason to use this over the exact inference produced by binom.test. 

Try it with Darwin's data:

prop.test(13,15,p=1/2)

How do the results from using a normal approximation to the p-value compare to the exact p-value computed by binom.test()?

Comparing two proportions

The dolphin study described at the beginning of this lab is an example of a study that compares two proportions: the proportion that showed substantial improvement in depression symptoms for the dolphin group and for the control group. In this case, binom.test() cannot be used. Instead, we can use fisher.test() or prop.test().

An exact p-value can be computed using Fisher's Exact test:

dolphin.table <- matrix(c(10, 5, 3, 12), nrow=2, ncol=2, byrow = TRUE, 
dimnames = list(c("Dolphin","Control"), c("Improved","Not Improved")))
# Check to make sure table was created correctly:
dolphin.table

# Run Fisher's Exact Test:
fisher.test(dolphin.table)

Note that the statistic reported for Fisher's Exact Test is the odds ratio, (p1/(1-p1))/(p2/(1-p2)), not the difference in proportions, p1 - p2. Fisher's Exact Test computes the p-value from a Hypergeometric distribution (the exact distribution that the count in the first cell has under the null hypothesis).

 

More commonly used (even though a randomization test or Fisher's Exact Test is more exact) is the normal approximation to a sample difference in proportions:

prop.test(dolphin.table)

Instead of reporting a z-statistic, this test reports a chi-squared statistic, which is equivalent to the z-statistic squared.


This lab was adapted from Albyn Jones' Math 141 Lab Notes, Reed College.