The Situation:
- We will be considering problems involving count data that are
classified according to two variables (say, A and B ) and are
summarized in a two-way frequency table.
- Each variable will have 2 or more non-overlapping categories and each
response must fall into one of the categories.
- The rows of the two-way table correspond to the categories for variable
A and the columns of the two-way table correspond to the categories for
variable B .
- Each row and column category defines a cell in the
table. If there are R rows and C columns then there are
cells in the table.
Example
- Failures of an electronic device are classified by the manufacturer
by type of failure and location of failure. There are two types of
failure: mechanical and electronic. There are three primary
locations within the device that failures occur which will denote
as locations 1, 2, and 3. Thus,
-
- Variable A is ``type of failure''.
-
- Variable B is ``location of failure''.
- The data (once collected) can be summarized by filling in a
two-way table with the observed counts. For the electronic device
example, the data are displayed:
- There are R=2 rows and C=3 columns so there
are
cells in this table.
- The 6 cells in the electronic device example can be represented as a
pair (Type of Failure, Location of Failure). The 6 cells are
(Mechanical,1), (Mechanical,2), (Mechanical,3), (Electronic,1),
(Electronic,2), (Electronic,3).
- Also shown are marginal totals. Thus there were 50+16+31 =
97 mechanical failures and there were 16+26=42 failures (both
mechanical and electrical) at location 2.
- The counts alone can give a misleading picture of the
relationship between type of failure and location of failure. For
example we see that there were 50 mechanical failures at location
1 and only 31 at location 2 but
of the
failures at location 1 were mechanical while
of
the failures at location 3 were mechanical. In general percentages
give a more accurate picture of the relationship.
Marginal Distributions
- The marginal distribution of the row variable summarizes the
percentage of items having each of the possible values of that
variable.
- To compute the marginal row distribution, compute the percentage
of each row based on the grand total (the total sample size).
- The marginal row distribution for Type of Failure in the
above example is
Type of Failure
- The percentages were computed as shown below.
- Percent of Mechanical Failures: (97/200)100=48.5 %.
- Percent of Electrical Failures: (103/200)100=51.5 %.
- The marginal column distribution of the column variable
summarizes the percentage of items having each of the possible
values of that variable.
- To compute the marginal column distribution, compute the
percentage of each row based on the grand total (the total sample
size).
- the marginal column distribution for Location of Failure in
the above example is
- The percentages were computed as shown below.
- Percent of failures at Location 1: (111/200)100=55.5 %.
- Percent of failures at Location 2: (42/200)100=21.0 %.
- Percent of failures at Location 3: (47/200)100=23.5 %.
Conditional Distributions
- The conditional distribution of the row variable, given the
column variable, is found by expressing the counts in the
frequency table as percentages of the column totals.
- The distribution of Type of Failure given Location of Failure
is shown below.
- The percentages were computed as follows.
- Percent of failures at Location 1 that were mechanical:
(50/111)100=45.0 %.
- Percent of failures at Location 2 that were mechanical:
(16/42)100=38.1 %.
- Percent of failures at Location 1 that were electrical:
(61/111)100=55.0 %.
- You should make sure you know how to compute the other
percentages shown in the table.
- The conditional distribution of the column variable, given
the row variable, is found by expressing the counts in the
frequency table as percentages of the row totals.
- the conditional distribution of Location of Failure given
Type of Failure is shown below.
- The percentages were computed as follows.
- Percent of mechanical failures occurring at Location 1:
(50/97)100=51.5 %.
- Percent of mechanical failures occurring at Location 2: (16/97)100=16.5 %.
- Percent of electrical failures occurring at Location 1:
(61/103)100=59.2 %
- You should make sure you know how to compute the other
percentages shown in the table.
- It can help to draw bar graphs displaying the conditional
distributions.
Simpson's Paradox
- An association that holds for all of several groups can
reverse direction when the data are combined to form a single
group. This reversal is called Simpson's Paradox.
- Below is table of applicants to Upper Wabash Tech's 2
professional schools categorized by gender and admission decision.
There were 1200 applicants
- Note that
- (490/700)100=70.0 % of Males were admitted.
- (280/500)100=56.0 % of Females were admitted.
There appears to be evidence of gender discrimination, with Males
being favored over Females.
- Upper Wabash Tech has 2 professional schools, Business and
Law. There were 800 applicants for the Business School and 400 for the Law School.
Below are 2 frequency tables showing the relationship
between gender and admission decision for each school separately.
- Note that
- (480/600)100=80.0 % of Males were admitted to the
Business School and (180/200)100=90.0 % of Females were admitted to the
Business
School.
- (10/100)100=10.0 % of Males were admitted to the
Law School and (100/300)100=33.3 % were admitted to the
Law School.
- There is no longer any evidence of gender discrimination in
favor of Males. In fact it appears that Females may be favored
over Males.
- What happened? The Law School is harder to get into than the
Business School. Only (110/400)100=27.5 % of applicants to the
Law School were admitted whereas 82.5 % of the applicants to the
Business School were admitted. Of the 500 Female applicants,
(300/500)100=60.0 % applied to the more selective Law School
whereas only 14.3 % of the men applied to the Law School. The
original frequency table showed a favoritism towards Males because
it failed to account for the particular school to which applicants
had applied.
- Numbers will lie to you if you are not careful.
Testing for Significant Associations in Two-Way Tables: Chapter 13, Section 2
- We will be looking at two types of tests: a test of
homogeneity and a test of independence. Fortunately both
tests use exactly the same test statistic.
- Method: The chi-square test. The approach
taken is:
- For both tests the data consist of observed counts, the
number of observations that fall into each cell of a frequency
table.
- The counts we expect to see under H0 will be computed.
These are referred to as expected counts. We will see how to
compute these below.
- If H0 is true the observed counts (O ) and expected
counts (E ) should be close to one another. We will measure
how close using a chi-square test statistic.
- The test statistic, denoted X2 , compares the set of observed counts
with the set of expected counts. The formula for X2 is

- Large values of X2 are evidence against H0 . The question
remains ``How large is too large?''. To answer this question
we refer to a new distribution called the chi-square
distribution. In Greek letter notation, we write
distribution.
- With each
distribution is an associated degrees of freedom. We
use the notation
to denote a
distribution
with df degrees of freedom.
- For an
table, the test statistic X2 will approximately
follow a
distribution with
degrees of freedom when H0 is true.
- Properties of the
distribution:
- The distribution is skewed to the right.
- The values are nonnegative.
- There is a different
distribution for different
degrees of freedom.
- The mean of a
distribution is equal to its degrees
of freedom.
- The variance of a
distribution is equal to twice
its degrees of freedom.
- Tabled values of
distributions can be found in
Table VI on page 662.
Testing Hypotheses about Two-Way
Tables
- There are 2 tests of interest.
- 1.
- Test of Homogeneity.
- 2.
- Test of Independence.
- To determine which test is appropriate you need to know how
the data were collected.
- For a test of homogeneity there are C independent random
samples from each of the C populations under study. The response
is a qualitative variable with R possible categories. The data
are summarized in a two-way table with the columns referring to
the populations and the rows to the categories of the response.
- For a test of independence there is just one SRS from
a single population. For each unit two qualitative responses are
recorded. One response has R categories and the other has C
categories.
- Fortunately both hypotheses are tested in exactly the same
way.
The
Test
- To test the null hypothesis
in
tables, we compare the
observed cell counts with the expected
cell counts. The expected cell counts are calculated under the
assumption that H0 is true.
- The test statistic, denoted X2 , compares the set of observed counts
with the set of expected counts. We calculate X2 as follows:
- 1.
- Calculate each expected cell count. The expected count for a cell is

- 2.
- Calculate
for each cell in the table.
- 3.
- The test statistic X2 is the sum of all of the values calculated in
step 2. That is,

- Back to the electronic device example.
- For cell (i,j) the expected
cell counts are:



- Then the (O-E)2/E values for cell (i,j) are:



- Then
.
- Regardless of which of the 2 tests is being conducted if H0 is true
then the observed and expected cell frequencies should be similar.
When these frequencies are similar, the X2 statistic will tend
to be small.
- On the other hand, if H0 is false then we will tend to find observed and expected cell
frequencies that differ significantly. When these frequencies do
differ significantly, the X2 statistic will tend to be large.
- For the hypothesis test, we can get bounds on the p -value from Table
VI (page 662) by doing the following:
- 1.
- Go to the (R-1)(C-1) row in Table VI .
- 2.
- Scan the row until you find two consecutive values that X2 falls
between.
- 3.
- Look at the column headings corresponding to tail probabilities.
These are used to bound the p -value. For example if the column
headings are X20.95 and X20.975 then the p -value is
between 0.025 and 0.05. (If the X2 falls outside the first or last column we say the
p -value is
or
, respectively.)
- At this point, we follow the decision rule: Reject H0 if p -value
or Fail to Reject H0 if p -value
. - If a significant association is detected, it is recommended to view the
bar charts described earlier to gain some understanding of the
nature of the association.
- Example For the electronic devices example, the test statistic was
X2OBS=8.0856 . Suppose
. If the null hypothesis
is true, i.e. location of failure is independent of type of
failure then the test statistic is approximately distributed as
with (R-1)(C-1)=(1)(2)=2 degrees of freedom. Using
Table VI we see that the p -value is bounded between 0.01 and
0.025. Thus we would reject H0 and conclude that Type of
Failure depends on Location of Failure.
- Note that all this test tells us is the two variables are
dependent on one another. It tells us nothing about the nature of
the dependence. You should always look at the appropriate
marginal/conditional distributions to get some idea of the nature
of the dependence.
- The example was for a test of independence but exactly the same
procedure is followed for a test of homogeneity. Of course,
statement of the null and alternative hypotheses will differ as
will your conclusions but the mechanical part of carrying out the
test is identical.