4 The Study of Risk Factors and Causation
Epidemiologists are frequently involved in studies to determine causation—that is, to find the specific cause or causes of a disease. This is a more difficult and elusive task than might be supposed, and it leaves considerable room for obfuscation, as shown in a newspaper article on cigarette smoking.1 The article quoted a spokesman for the Tobacco Institute (a trade association for cigarette manufacturers) as saying that “smoking was a risk factor, though not a cause, of a variety of diseases.”
Is a risk factor a cause, or is it not? To answer this question, we begin with a review of the basic concepts concerning causation. Studies can yield statistical associations between a disease and an exposure; epidemiologists need to interpret the meaning of these relationships and decide if the associations are artifactual, noncausal, or causal.
Most scientific research seeks to identify causal relationships. The three fundamental types of causes, as discussed next in order of decreasing strength, are (A) sufficient cause, (B) necessary cause, and (C) risk factor (Box 4-1).
Box 4-1 Types of Causal Relationships
Noncausal association: The relationship between two variables is statistically significant, but no causal relationship exists because the temporal relationship is incorrect (the presumed cause comes after, rather than before, the effect of interest) or because another factor is responsible for the presumed cause and the presumed effect.
A sufficient cause precedes a disease and has the following relationship with the disease: if the cause is present, the disease will always occur. However, examples in which this proposition holds true are surprisingly rare, apart from certain genetic abnormalities that, if homozygous, inevitably lead to a fatal disease (e.g., Tay-Sachs disease).
Smoking is not a sufficient cause of bronchogenic lung cancer, because many people who smoke do not acquire lung cancer before they die of something else. It is unknown whether all smokers would eventually develop lung cancer if they continued smoking and lived long enough, but within the human life span, smoking cannot be considered a sufficient cause of lung cancer.
A necessary cause precedes a disease and has the following relationship with the disease: the cause must be present for the disease to occur, although it does not always result in disease. In the absence of the organism Mycobacterium tuberculosis, tuberculosis cannot occur. M. tuberculosis can thus be called a necessary cause, or prerequisite, of tuberculosis. It cannot be called a sufficient cause of tuberculosis, however, because it is possible for people to harbor the M. tuberculosis organisms all their lives and yet have no symptoms of the disease.
Cigarette smoking is not a necessary cause of bronchogenic lung cancer because lung cancer can and does occur in the absence of cigarette smoke. Exposure to other agents, such as radioactive materials (e.g., radon gas), arsenic, asbestos, chromium, nickel, coal tar, and some organic chemicals, has been shown to be associated with lung cancer, even in the absence of active or passive cigarette smoking.2
A risk factor is an exposure, behavior, or attribute that, if present and active, clearly increases the probability of a particular disease occurring in a group of people compared with an otherwise similar group of people who lack the risk factor. A risk factor, however, is neither a necessary nor a sufficient cause of disease. Although smoking is the most important risk factor for bronchogenic carcinoma, producing 20 times as high a risk of lung cancer in men who are heavy smokers as in men who are nonsmokers, smoking is neither a sufficient nor a necessary cause of lung cancer.
What about the previously cited quotation, in which the spokesman from the Tobacco Institute suggested that “smoking was a risk factor, though not a cause, of a variety of diseases”? If by “cause” the speaker included only necessary and sufficient causes, he was correct. However, if he included situations in which the presence of the risk factor clearly increased the probability of the disease, he was wrong. An overwhelming proportion of scientists who have studied the question of smoking and lung cancer believe the evidence shows not only that cigarette smoking is a cause of lung cancer, but also that it is the most important cause, even though it is neither a necessary nor a sufficient cause of the disease.
The first and most basic requirement for a causal relationship to exist is an association between the outcome of interest (e.g., a disease or death) and the presumed cause. The outcome must occur either significantly more often or significantly less often in individuals who are exposed to the presumed cause than in individuals who are not exposed. In other words, exposure to the presumed cause must make a difference, or it is not a cause. Because some differences would probably occur as a result of random variation, an association must be statistically significant, meaning that the difference must be large enough to be unlikely if the exposure really had no effect. As discussed in Chapter 10, “unlikely” is usually defined as likely to occur no more than 1 time in 20 opportunities (i.e., 5% of the time, or 0.05) by chance alone.
If an association is causal, the causal pathway may be direct or indirect. The classification depends on the absence or presence of intermediary factors, which are often called intervening variables, mediating variables, or mediators.
A directly causal association occurs when the factor under consideration exerts its effect without intermediary factors. A severe blow to the head would cause brain damage and death without other external causes being required.
An indirectly causal association occurs when one factor influences one or more other factors through intermediary variables. Poverty itself may not cause disease and death, but by preventing adequate nutrition, housing, and medical care, poverty may lead to poor health and premature death. In this case, the nutrition, housing, and medical care would be called intervening variables. Education seems to lead to better health indirectly, presumably because it increases the amount of knowledge about health, the level of motivation to maintain health, and the ability to earn an adequate income.
A statistical association may be strong but may not be causal. In such a case, it would be a noncausal association. An important principle of data analysis is that association does not prove causation. If a statistically significant association is found between two variables, but the presumed cause occurs after the effect (rather than before it), the association is not causal. For example, studies indicated that estrogen treatments for postmenopausal women were associated with endometrial cancer, so that these treatments were widely considered to be a cause of the cancer. Then it was realized that estrogens often were given to control early symptoms of undiagnosed endometrial cancer, such as bleeding. In cases where estrogens were prescribed after the cancer had started, the presumed cause (estrogens) was actually caused by the cancer. Nevertheless, estrogens are sometimes prescribed long before symptoms of endometrial cancer appear, and some evidence indicates that estrogens may contribute to endometrial cancer. As another example, quitting smoking is associated with an increased incidence of lung cancer. However, it is unlikely that quitting causes lung cancer or that continuing to smoke would be protective. What is much more likely is that smokers having early, undetectable or undiagnosed lung cancer start to feel sick because of their growing malignant disease. This sick feeling prompts them to stop smoking and thus, temporarily, they feel a little better. When cancer is diagnosed shortly thereafter, it appears that there is a causal association, but this is false. The cancer started before the quitting was even considered. The temporality of the association precludes causation.
Likewise, if a statistically significant association is found between two variables, but some other factor is responsible for both the presumed cause and the presumed effect, the association is not causal. For example, baldness may be associated with the risk of coronary artery disease (CAD), but baldness itself probably does not cause CAD. Both baldness and CAD are probably functions of age, gender, and dihydrotestosterone level.
Finally, there is always the possibility of bidirectional causation. In other words, each of two variables may reciprocally influence the other. For example, there is an association between the density of fast-food outlets in neighborhoods and people’s purchase and consumption of fast foods. It is possible that people living in neighborhoods dense with sources of fast food consume more of it because fast food is so accessible and available. It is also possible that fast-food outlets choose to locate in neighborhoods where people’s purchasing and consumption patterns reflect high demand. In fact, the association is probably true to some extent in both directions. This bidirectionality creates somewhat of a feedback loop, reinforcing the placement of new outlets (and potentially the movement of new consumers) into neighborhoods already dense with fast food.
Investigators must have a model of causation to guide their thinking. The scientific method for determining causation can be summarized as having three steps, which should be considered in the following order3:
These steps in epidemiologic investigation are similar in many ways to the steps followed in an investigation of murder, as discussed next.
Investigations may test hypotheses about risk factors or protective factors. For causation to be identified, the presumed risk factor must be present significantly more often in persons with the disease of interest than in persons without the disease. To eliminate chance associations, this difference must be large enough to be considered statistically significant. Conversely, the presumed protective factor (e.g., a vaccine) must be present significantly less often in persons with the disease than in persons without it. When the presumed factor (either a risk factor or a protective factor) is not associated with a statistically different frequency of disease, the factor cannot be considered causal. It might be argued that an additional, unidentified factor, a “negative” confounder (see later), could be obscuring a real association between the factor and the disease. Even in that case, however, the principle is not violated, because proper research design and statistical analysis would show the real association.
The first step in an epidemiologic study is to show a statistical association between the presumed risk or protective factor and the disease. The equivalent early step in a murder investigation is to show a geographic and temporal association between the murderer and the victim—that is, to show that both were in the same place at the same time, or that the murderer was in a place from which he or she could have caused the murder.
The relationship between smoking and lung cancer provides an example of how an association can lead to an understanding of causation. The earliest epidemiologic studies showed that smokers had an average overall death rate approximately two times that of nonsmokers; the same studies also indicated that the death rate for lung cancer among all smokers was approximately 10 times that of nonsmokers.4 These studies led to further research efforts, which clarified the role of cigarette smoking as a risk factor for lung cancer and for many other diseases as well.
In epidemiologic studies the research design must allow a statistical association to be shown, if it exists. This usually means comparing the rate of disease before and after exposure to an intervention that is designed to reduce the disease of interest, or comparing groups with and without exposure to risk factors for the disease, or comparing groups with and without treatment for the disease of interest. Statistical analysis is needed to show that the difference associated with the intervention or exposure is greater than would be expected by chance alone, and to estimate how large this difference is. Research design and statistical analysis work closely together (see Chapter 5).
If a statistically significant difference in risk of disease is observed, the investigator must first consider the direction and extent of the difference. Did therapy make patients better or worse, on average? Was the difference large enough to be etiologically or clinically important? Even if the observed difference is real and large, statistical association does not prove causation. It may seem initially that an association is causal, when in fact it is not. For example, in the era before antibiotics were developed, syphilis was treated with arsenical compounds (e.g., salvarsan), despite their toxicity. An outbreak of fever and jaundice occurred in many of the patients treated with arsenicals.5 At the time, it seemed obvious that the outbreak was caused by the arsenic. Many years later, however, medical experts realized that such outbreaks were most likely caused by an infectious agent, probably hepatitis B or C virus, spread by inadequately sterilized needles during administration of the arsenical compounds. Any statistically significant association can only be caused by one of four possibilities: true causal association, chance (see Chapter 12), random error, or systematic error (bias or its special case, confounding, as addressed later).
Several criteria, if met, increase the probability that a statistical association is true and causal6 (Box 4-2). (These criteria often can be attributed to the 19th-century philosopher John Stuart Mill.) In general, a statistical association is more likely to be causal if the criteria in Box 4-2 are true:
Box 4-2 Statistical Association and Causality
Factors that Increase Likelihood of Statistical Association Being Causal