Goodness of fit test for a categorical variable

In this lab, we are going to practice summarizing, visualizing, and analyzing categorical data using the chisq.test function. We will start with one-way tables and move to two-way tables:

Are the tulip colors in a field equally distributed? Consider the tulip data below from a random sample of tulips in the field:

Color Count
Red 123
White 102
Yellow 81

Let's summarize and visualize the data.

counts <- c(123, 102, 81)
n <- sum(counts)
# Sample proportions:
counts/n
barplot(counts/n, names.arg = c("Red","White","Yellow"), ylab="Proportion",
main = "Observed Distribution of Tulip Colors")

It seems that Red is most common, occurring 40% of the time, and yellow least common at 26%. Does this mean tulips in the entire field are not equally distributed? Or did this just happen by chance. Let's test the hypotheses

H0pred = 1/3, pwhite = 1/3, pyellow = 1/3 

versus

Ha: At least one probability differs.

First, do the test calculations "by hand":

# Expected counts
expected <- n*c(1/3, 1/3, 1/3)
expected
# Chi-squared test statistic
stat <- sum( (counts - expected)^2/expected )
stat
# p-value
pchisq(stat, df = 2, lower.tail=FALSE)

With a p-value of 0.0132, there is fairly strong evidence that the tulip colors are not equally distributed (do not occur in the same proportions) across the entire field. Specifically how do the observed counts differ from what we'd expect to see under the null?

residuals <- (counts - expected)/sqrt(expected)
residuals

We can also plot the expected counts versus the observed counts in a bar plot.

barplot(rbind(counts, expected), names.arg = c("Red","White","Yellow"), beside=TRUE,
ylab="Count", legend.text=c("Observed","Expected"), col=c(3,5),
main="Observed versus Expected Counts")

There were more Red tulips than we would expect, and less White tulips than we would expect, if the colors were equally distributed. Note that the squared residuals add up to the chi-squared test statistic.

Note: The R function "rbind" stands for "row bind". It creates a matrix where each row is one argument in the function. When you give the "barplot" function a matrix, it plots the columns of the matrix for each category.

Exercise 1

Based on the tulip data above, perform a goodness-of-fit test to determine if the distribution  pred = .40, pwhite = .35, pyellow = .25  is a reasonable fit to the data.

We can use the chisq.test function to check our calculations above:

chisq.test(counts, p=c(1/3,1/3,1/3), correct=FALSE)
chisq.test(counts, p=c(.4,.35,.25), correct=FALSE)

Goodness of fit test for a discrete quantitative variable

The Poisson distribution is a discrete probability distribution that counts the number of events that occur over a time-frame. It can be thought of as an approximation to the binomial distribution when the number of independent trials (n) is large and the probability of an event (p) is small. In other words, it models a scenario where we are counting the number of rare events in an interval of time. The R functions are:

dpois(x, mu)    # poisson density at k: Pr(X = k)
ppois(q, mu)    # poisson CDF at k: Pr(X <= k)
rpois(n, mu)    # generate n poisson(mu) random variables

In the above expressions, mu is the mean of the distribution - the expected number of events. It is a very important probability model, often useful when looking at counts of events like deaths per year, phone calls per minute, etc. The earliest known application is a dataset on deaths by horse-kick in Prussian Army cavalry corps (Bortkiewicz, 1898)

Deaths <- c(0,1,2,3,4,5)
Count <- c(109,65,22,3,1,0)
weighted.mean(Deaths, Count)

"Deaths" correspond to the outcomes for a single unit, "Count" is the number of cavalry units corresponding to each outcome (109 had 0 deaths, 65 had 1 death, etc.). Using the weighted.mean function is equivalent to computing the average of 109 0's, 65 1's, 22 2's, etc. You can verify this by creating the full dataset:

D1 <- rep(Deaths, Count)
mean(D1)

Exercise 2

Using the Bortkiewicz dataset, 

  1. Compare the observed data to the theoretical frequencies predicted by the Poisson distribution, using the observed sample mean (.61) as the Poisson parameter. The total count is 200 (the number of cavalry units). To compute expected counts, multiply the Poisson probabilities by the number of observations (200) to get the predicted counts. Hint: A Poisson random variable can take on any value from 0, 1, 2, 3,..., thus, to get the expected count for the 5 deaths group, use 1-ppois(4,0.61).
  2. Create a barplot comparing the observed and expected counts from 1.
  3. Carry out all steps of a chi-squared goodness of fit test to determine if a Poisson distribution is a reasonable model for these data: state the hypotheses, check assumptions, calculate the test statistic, calculate the p-value, state your decision, write a conclusion.

Chi-squared tests of independence for two categorical variables

Returning to the tulip example, suppose we wanted to know whether the distribution of tulip colors differed across three fields, A, B, and C. We already looked at data from field A. Here is a summary of the entire data set:

  Field
Color A B C
Red 123 240 100
White 102 200 35
Yellow 81 163 47

Let's first visualize the data.

red <- c(123,240,100)
white <- c(102,200,35)
yellow <- c(81,163,47)
tulips <- rbind(red, white, yellow)
tulips
barplot(tulips, names.arg=c("A","B","C"), col=c("red","white","yellow"),
ylab="Frequency", xlab="Field", beside=TRUE,
main="Number of tulips by color and field")

Note that a barplot of counts is not really a fair comparison since the sample sizes differ across fields. Instead, we'd like to look at the conditional proportions within fields:

# Number of tulips collected in each field
colSums(tulips)
tulips.prop <- rbind(red/colSums(tulips), white/colSums(tulips), yellow/colSums(tulips))
barplot(tulips.prop, names.arg=c("A","B","C"), col=c("red","white","yellow"),
ylab="Conditional Proportions", xlab="Field", beside=FALSE,
main="Distribution of Colors by Field")

The tulip color distribution seems very similar between fields A and B, but C appears to have a larger number of red tulips. Is this due to chance, or do the distributions truly differ?

H0: Field and tulip color are independent.

Ha: Field and tulip color and dependent.

The chisq.test function will do our test calculations for us. 

chisq.test(tulips, correct=FALSE)
chisq.test(tulips, correct=FALSE)$observed
chisq.test(tulips, correct=FALSE)$expected

This data provides evidence to suggest that the distribution of tulip color is dependent on field.

Exercise 3

Calculate the expected counts, the chi-squared test statistic, and the p-value for the previous example without using the chisq.test function.