The Problems
- We have two independent groups of individuals (e.g. homosexual men with and without a history of gonorrhoea). We want to know whether the proportions of individuals with a characteristic (e.g. infected with human herpesvirus-8, HHV-8) are the same in the two groups.
- We have two related groups, e.g. individuals may be matched, or measured twice in different circumstances (say, before and after treatment). We want to know whether the proportions with a characteristic (e.g. raised test result) are the same in the two groups.
Independent Groups: the Chi-Squared Test
Terminology
The data are obtained, initially, as frequencies, i.e. the numbers with and without the characteristic in each sample. A table in which the entries are frequencies is called a contingency table; when this table has two rows and two columns it is called a 2 × 2 table. Table 24.1 shows the observed frequencies in the four cells corresponding to each row/column combination, the four marginal totals (the frequency in a specific row or column, e.g. a + b), and the overall total, n. We can calculate (see Rationale below) the frequency that we would expect in each of the four cells of the table if H0 were true (the expected frequencies).
Assumptions
We have samples of sizes n1 and n2 from two independent groups of individuals. We are interested in whether the proportions of individuals who possess the characteristic are the same in the two groups. Each individual is represented only once in the study. The rows (and columns) of the table are mutually exclusive, implying that each individual can belong in only one row and only one column. The usual, albeit conservative, approach requires that the expected frequency in each of the four cells is at least five.
Rationale
If the proportions with the characteristic in the two groups are equal, we can estimate the overall proportion of individuals with the characteristic by p = (a + b)/n; we expect n1 × p of them to be in Group 1 and n2 × p to be in Group 2. We evaluate expected numbers without the characteristic similarly. Therefore, each expected frequency is the product of the two relevant marginal totals divided by the overall total. A large discrepancy between the observed (O) and the corresponding expected (E) frequencies is an indication that the proportions in the two groups differ. The test statistic is based on this discrepancy.
where O and E are the observed and expected frequencies, respectively, in each of the four cells of the table. The vertical lines around O − E indicate that we ignore its sign. The 1/2 in the numerator is the continuity correction (Chapter 19). The test statistic follows the Chi-squared distribution with 1 degree of freedom.
Refer χ2 to Appendix A3.