8.1 Introduction
The two most common types of observational study designs in epidemiology are cohort studies and case–control studies. The objective of both these types of studies is to learn about causal relations between antecedent exposures and subsequent health outcomes. They differ, however, in the manner in which they select subjects for study.
Cohort studies start by identifying disease-free study subjects from a source population. Individuals are then classified as exposed or nonexposed and are followed (either prospectively or retrospectively) to determine incidents of relevant events.
In contrast, case–control studies begin by identifying individuals who have already experienced the study outcome. These individuals comprise the case series. The study then selects a series of individuals from the same source population that gave rise to the cases but who have not (yet) experienced the study outcome. These individuals comprise the control series. Exposures to prior risk factors are then ascertained retrospectively in the cases and controls. The odds of prior exposures to risk factors are then compared in case series and control series.
Note that the distinction between cohort studies and case–control studies is based on the manner in which subjects are selected for study. Cohort studies start with disease-free study subjects and “wait” for the outcome to develop. Case–control studies start by selecting diseased and non-diseased subjects and ascertain prior exposures to risk factors. By selecting study subjects in this manner, case–control studies gain statistical efficiencies that could not otherwise be achieved when the disease is rare. In so doing, case–control studies forfeit the ability to estimate rates and risks directly (because the denominators needed to estimate risks and rates are absent). Case–control studies are, however, able to estimate the effect of an exposure in relative terms through a statistic known as the exposure odds ratio.
Table 8.1 exhibits the notation we use for cross-tabulated counts based on independent groups. In this notation, A represents “case” and B represents “control.” The subscript 1 represents “exposure-positive” and the subscript 0 represents “exposure-negative.” As examples, A1 represents the number of exposed cases and B0 represents the number of nonexposed controls. Using this notation, the odds of exposure in the case series is A1/A0, while the odds of exposure in the control series is B1/B0. The exposure odds ratio estimatea is thereby:
8.1
This formula can be simplified to this algebraically equivalent form:
8.2
Formula 8.2 is the cross-product ratio of the counts in the 2-by-2 cross-tabulation of counts: multiply the products of the counts in the cross-cells and form these products into a ratio.
It can be demonstrated that exposure odds ratios from case–control studies are direct estimates of the rate ratio in the underlying source population—see Section 8.5 for proof. Therefore, ORs are interpreted as if they were RRs. As examples, an OR of 1 indicates no association between the exposure and disease, whereas an OR of 2 indicates that the exposure doubles the risk.b
A case–control study carried out in the Ille-et-Vilaine (Brittany) region of France identified 200 cases of esophageal cancer in men; 775 men without esophageal cancer were selected from electoral lists from the same region of France to serve as controls (Tuyns et al., 1977; Breslow and Day, 1980). Table 8.2 cross-tabulates counts from this study with alcohol consumption dichotomized at less than or more than or equal to 80 g/day. Note that 96 (48%) of the 200 cases are classified as exposed. In contrast, 109 (14%) of the 775 controls are classified as exposed. Thus, the , indicating 5.64 times the risk of esophageal cancer in the high-alcohol consuming group relative to the low-alcohol consumers.
8.2 Identifying cases and controls
Ascertainment of cases
Before searching for cases, the study defines the diagnostic and epidemiologic criteria by which cases will be identified. These criteria are called the case definition. The case definition is then uniformly applied to screen for cases. At times it may be advantageous to establish several different case definitions to allow for separate examinations by the different criteria (see Chapter 16).
Cases ascertainment may be limited to incident cases or prevalent cases. Incident cases commenced during the study period. Prevalent cases may have onset at any time either before or during the study period. Incident cases are generally preferred because the survival of prevalent cases may depend on factors that are separate from their cause. In addition, use of prevalent cases may make it difficult to differentiate between past events that are causally related to the disease and events that have occurred consequent to disease onset.
There are times, however, when it is possible only to study prevalent cases. For example, when studying birth defects, the causes occur en utero, before birth. Since many birth defects are associated with high fetal death rates and both spontaneous and induced abortions, birth defects detected after birth represent cases that have survived until delivery (i.e., prevalent cases). Note that use of prevalent cases will not bias the results of a case–control comparison if survival is independent of the cause you are studying and the timing of the exposure in relation to disease onset can be accurately ascertained.
Selection of controls
Valid selection of controls is crucial when conducting case–control studies. The valid selection is best understood in terms of their function, which is to represent the exposure experience of the population that gave rise to the cases. In this regard, it is important to clearly define the underlying source population (study base) that gave rise to the cases. As disease events arise in the study base, they are identified as cases. For each case, one or more controls are selected from the same study base.
The study base for a case–control study can be an opened or closed population (see Chapter 3). Cases for the study in Illustrative Example 8.1 (esophageal cancer and alcohol consumption), for example, came from hospitals in the open population of the Ille-et-Vilaine region (France). The catchment area served by these hospitals comprised the source population or study base. Therefore, controls were selected from this catchment area.
Case–control studies that use closed populations (cohorts) as their source population are called nested case–control studies (“case–control studies nested in a cohort”). As an example, a nested case–control study was carried out in a retrospective cohort of 223 292 men employed by three electric utility companies in France and Canada from 1970 to 1989 (Theriault et al., 1994). During this time, 4151 incident cancer cases were identified. For each incident cancer case, between one and four controls were selected from the underlying cohort. Exposure to electromagnetic field radiation for cases and controls was measured by dosimetry. Based on these data, workers who had more than the median cumulative exposure to magnetic fields had an increased risk of acute nonlymphoid leukemia ( = 2.4), acute myeloid leukemia ( = 3.2), and brain cancer ( = 2.0). No elevation in risk was observed for any of the other 29 cancers that were studied. These data strengthen the belief in the hypothesis that occupational exposures to 60 Hz electromagnetic fields increase the risk of certain types of cancer.
When cases came from a restricted source—say for example from a particular clinic—then the controls should be drawn from the same restricted source population. Here is an example in which an HMO population served as the source population for cases and controls.
A case–control study evaluated the relationship between vasectomy and prostate cancer in the Group Health Cooperative Health Maintenance Organization of Puget Sound (Zhu et al., 1996). Cases were 175 histologically confirmed cases of prostate cancer treated by the Health Maintenance Organization. The control series consisted of 258 similarly-aged men selected at random from the Health Maintenance Organization’s general membership roles. Data were collected from medical record reviews and via questionnaires. Table 8.3 presents data for prior vasectomy in cases and controls. The . Since this odds ratio is close to unity, it is reasonably interpreted as “no association” between vasectomy and prostate cancer.
This example illustrates an important benefit of the case–control approach. The study was completed with a total sample size of 433 (175 cases and 258 controls). Because prostate cancer is rare, occurring on the order of 150 cases per 100 000 man-years, a very large cohort would be required to derive 175 incident cases. This demonstrates the statistical efficiency of case–control studies.
Number of controls per case
In conducting case–control studies, the investigator must decide on the number of controls to select per case. When large numbers of cases are available and the cost and difficulty of collecting information for cases and controls is equal, maximum statistical efficiency is gained by studying an equivalent number of cases and controls. This control-to-case sampling ratio of 1:1 will maximize the statistical efficiency of the study for a given total sample size.
However, when the number of cases in the source population is limited, or the collection of information from cases is relatively expensive, an increase in the precision of the study can be achieved by increasing the control-to-case ratio sampling to 4:1. Increasing the control-to-case sampling ratio above 4:1, however, produces negligible increases in statistical power and precision (Gail et al., 1976).
Sample size considerations
The sample size requirements of case–control studies depends on the following factors: (a) the alpha level of the study, (b) the desired confidence interval length expressed on a logarithmic scale (or desired statistical power of the significance test), (c) the control-to-cases sampling ratio, (d) the expected proportion of controls that are classified as exposed, and (e) the expected odds ratio one is trying to detect.
Calculation of sample size requirements is facilitated with computer programs such as www.OpenEpi.com (Dean et al., 2009) and WinPEPI (Abramson, 2011).
Figure 8.1 is a screenshot from the program WinPEPI → Compare2 → Sample Size. This illustration indicates that a sample size of 208 cases and 208 controls is adequate to detect an odds ratio of 2 with 80% power at an alpha level of 5% when the prevalence of exposure to the risk factor in controls is 15%.
8.3 Obtaining information on exposure
Information about exposure to potential risks in cases-control studies can be derived by interviewing study subjects or their surrogates (e.g., family members), from the review of health care records, from vital statistics sources (e.g., death certificates), from employment records, from environmental records, and from biological specimens. Whatever the source, information should be obtained in a uniform and accurate way. For example, when deriving information by questionnaire, cases and controls should be questioned in identical manners to obtain information of comparable accuracy and completeness.
The induction time between exposure to a risk factor and its effects can be substantial for chronic diseases. In the meantime, cases may have altered their exposure habits since developing disease, making current exposure status irrelevant. For example, cases with chronic lung disease may have stopped smoking after the damage had already been done. Therefore, it is important to focus on exposure histories from the relevant past.
Questionnaires and data collection forms should be thoughtfully worded and carefully designed. However, they need not be long and elaborate in order to be useful. For example, the questionnaire used to collect data for the landmark 1950 case–control study on smoking and lung cancer by Wynder and Graham (Figure 8.2) included only 15 items.
A key point in questionnaire design is to avoid misunderstanding of questions by study subjects and interviewers. We should neither take for granted that the interviewer nor interviewee will understand the question. The exposure being measured should be fully defined in each instance. Each question should be stated as precisely as possible. Items should be stated in a neutral way, taking care not to lead interviewees toward a particular response. If possible, the interviewer and interviewee should be kept unaware of the study hypotheses.
8.4 Data analysis
Dichotomous exposure
Illustrative Example 8.1 (Table 8.2) introduced the use of cross-tabulated data to derive the odds ratio associated with an exposures divided into two parts from a case–control study. Recall that this illustration derived an of 5.6, indicating a strong positive association between esophageal cancer and alcohol consumption. Figure 8.3 is a screenshot of output for these data from the online application www.OpenEpi.com → Counts → two-by-two table. The odds ratio is reported as 5.627 with a 95% confidence interval of from 3.992 to 7.947. This should be reported as either 5.6 (95% CI: 4.0–7.9) or 5.63 (95% CI: 3.99–7.95) to avoid a spurious impression of precision.c