Chapter 22

Chi-Square and Tests of Contingency tables

Hypothesis tests may be performed on contingency tables in order to decide whether or not effects are present. Effects in a contingency table are defined as relationships between the row and column variables; that is, are the levels of the row variable differentially distributed over levels of the column variables? Significance in this hypothesis test means that interpretation of the cell frequencies is warranted. Non-significance means that any differences in cell frequencies could be explained by chance.

Hypothesis tests on contingency tables are based on a statistic called chi-square. Before we get into a discussion of chi-square, let's review contingency tables.

Frequency tables of two variables presented simultaneously are called contingency tables. Contingency tables are constructed by listing all the levels of one variable as rows in a table and the levels of the other variables as columns, then finding the joint or cell frequency for each cell. The cell frequencies are then summed across both rows and columns. The sums are placed in the margins, the values of which are called marginal frequencies. The lower right hand corner value contains the sum of either the row or column marginal frequencies, which both must be equal to N.

For example, suppose that a researcher studied the relationship between being HIV positive and the sexual preference of individuals. The study resulted in the following data for thirty male subjects:

HIV+ | Y | N | N | N | Y | N | N | N | Y | N | N | N | Y | N | N | N | N | N | N | N | Y | N | Y | Y | N | Y | N | Y | N | N |

SexPref | B | F | F | B | F | F | F | M | F | F | F | F | B | F | F | B | F | M | F | F | M | F | B | M | F | M | F | M | F | M |

HIV+ Y=Yes, N=No; SexPref F=Female, M=Male, and B=Both.

The data file coding 0=No 1=Yes for HIV and 1=Males, 2=Females, and 3=Both for SEXPREF, would appear as follows (partial view):

A contingency table and chi-square hypothesis test of independence could be generated SPSS by selecting Analyze/Descriptive Statistics/Crosstabs as the following figure shows.

Then select the options indicated in the following figure.

The resulting output tables should look like these:

The Pearson chi-square value in the Asymp. Sig. (2-tailed) column is 0.022 and less than .05 indicating that the rows and columns of the contingency table are independent. Generally this means that it is worthwhile to interpret the cells in the contingency table. In this particular case it means that being HIV positive or not is not distributed similarly across the different levels of sexual preference. In other words, males who prefer other males or who prefer both males and females are more likely to be HIV positive than males who prefer only females.

The procedure used to test the significance of contingency tables is similar to all other hypothesis tests. That is, a statistic is computed and then compared to a model of what the world would look like if the experiment was repeated an infinite number of times when there were no effects. In this case the statistic computed is called the chi-square statistic. This statistic will be discussed first, followed by a discussion of its theoretical distribution. Finding critical values of chi-squared and its interpretation will conclude the chapter.

The first step in computing the chi-square statistic is the computation of the contingency table. The preceding table is reproduced here:

The next step in computing the chi-square statistic is the computation of the expected cell frequency for each cell. This is accomplished by multiplying the marginal frequencies for the row and column (row and column totals) of the desired cell and then dividing by the total number of observations. The formula for computation can be represented as follows:

For example, computation of the expected cell frequency for HIV+ Males would proceed as follows:

Expected Cell Frequency = (Row Total * Column Total) / N =(9*7)/30=2.1. You can see the cell we're working with in the following table:

Expected Cell Frequency = (Row Total * Column Total) / N

= ( 9 * 7 ) / 30 = 2.1

Using the same procedure to compute all the expected cell frequencies results in the following table:

Note that the sum of the expected row total is the same as the sum of the observed row totals; the same holds true for the column totals.

The next step is to subtract the expected cell frequency from the observed cell frequency for each cell. This value gives the amount of deviation or error for each cell. Adding these to the preceding table results in the following:

Note also that the sum of the Observed - Expected for both the rows and columns equals zero.

Following this, the difference computed in the last step is squared, resulting in the following table:

Each of the squared differences is then divided by the expected cell frequency for each cell, resulting in the following table:

The chi-square statistic is computed by summing the last row of each cell in the preceding table, the formula being represented by:

This computation for the example table would result in the following:

= 1.72 + 2.14 + 1.50 + .74 + .92 + .64 = 7.66

Note that this value is within rounding error of the value for chi-square computed by SPSS in an earlier section of this chapter.

The distribution of the chi-square statistic may be specified given the preceding experiment were conducted an infinite number of times and the effects were not real. The resulting distribution is called the chi-squared distribution. The chi-squared distribution is characterized by a parameter called the degrees of freedom (df) that determines the shape of the distribution. Two chi-squared distributions are presented here, each with a different value for the degrees of freedom parameter.

The degrees of freedom in the example chi-square distribution is computed by multiplying one minus the number of rows, times one minus the number of columns, or:

df =(#Rows-1)*(#Columns-1)

In the example problem the degrees of freedom is equal to (2-1)*(3-1)=1*2=2.

The exact significance level for a chi-square statistic can be found using the Probability Calculator. Select Chi-Square Distribution; enter 2 in the df box and 7.66 in the Value box; and then click the right-facing arrow, as the following figure illustrates.

The exact significance level computed by the Probability Calculator (.0217) agrees within rounding error of the value computed by SPSS (.022). In both cases the null hypothesis would be rejected.

The interpretation of the cell frequencies may be guided by the amount each cell contributes to the chi-square statistic, as seen in the (O-E)^{2}/E value. In general, the larger the difference between the observed and expected values, the greater this value. In the example data, it can be seen that the homosexual males had a greater incidence of being HIV positive (Observed = 4, Expected = 2.1) than would be expected by chance alone, while heterosexual males had a lesser incidence (Observed = 2, Expected = 5.4). This sort of evidence could direct the search for the causes of HIV.

The chi-squared test of significance is useful as a tool to determine whether or not it is worth the researcher's effort to interpret a contingency table. A significant result of this test means that the cells of a contingency table should be interpreted. A non-significant test means that no effects were discovered and chance could explain the observed differences in the cells which means that an interpretation of the cell frequencies is not useful.