All-or-none, categorical data are common: male or female; married, divorced, widowed, or single; success or failure; dead or alive. The basic data are frequencies of cases in each category.
Such categorical data are qualitative, not quantitative as has been assumed in previous chapters on Anova–regression. With Anova, scores were magnitudes of response for individual cases. With categorical data, in contrast, the “score” for each case is the category to which it belongs. Such category scores do not generally represent magnitude of any response measure, as illustrated with marrieddivorced–widowed–single.
A new statistical technique is needed to handle categorical data: chi-square. One principal application of chi-square is to compare the pattern of observed frequencies in the several categories with the null hypothesis of equal (or proportional) frequencies.
To illustrate, consider the study of smoking prevention cited in the appendix to Chapter 3. Three treatment conditions were used, each with about 1000 subjects. There were two response categories: successful quit attempt and relapse into smoking. Quit frequencies after three months were approximately 190, 240, and 270 for the three treatments. This looks promising. But perhaps chance alone could readily produce this difference. Is the observed difference in successes large enough to infer a real difference between these treatment conditions?
This question is answered by the chi-square test. Chi-square compares the observed frequencies with those expected under the null hypothesis of no difference between treatments.
Chi-square uses a single formula, quite simple, which applies universally. One simple rule suffices to get expected frequencies for the common two-way contingency tables, that is, tables of the form that describe the cited smoking data. This formula and this rule build directly on concepts developed in previous chapters.