10 Statistical Inference and Hypothesis Testing
Inference means the drawing of conclusions from data. Statistical inference can be defined as the drawing of conclusions from quantitative or qualitative information using the methods of statistics to describe and arrange the data and to test suitable hypotheses.
Because data do not come with their own interpretation, the interpretation must be put into the data by inductive reasoning (from Latin, meaning “to lead into”). This approach to reasoning is less familiar to most people than deductive reasoning (Latin, “to lead out from”), which is learned from mathematics, particularly from geometry.
Deductive reasoning proceeds from the general (i.e., from assumptions, propositions, or formulas considered true) to the specific (i.e., to specific members belonging to the general category). Consider the following two propositions:
If both propositions are true, then the following deduction must be true:
Deductive reasoning is of special use in science after hypotheses are formed. Using deductive reasoning, an investigator can say, “If the following hypothesis is true, then the following prediction or predictions also should be true.” If a prediction can be tested empirically, the hypothesis may be rejected or not rejected on the basis of the findings. If the data are inconsistent with the predictions from the hypothesis, the hypothesis must be rejected or modified. Even if the data are consistent with the hypothesis, however, they cannot prove that the hypothesis is true, as shown in Chapter 4 (see Fig. 4-2).
Clinicians often proceed from formulas accepted as true and from observed data to determine the values that variables must have in a certain clinical situation. For example, if the amount of a medication that can be safely given per kilogram of body weight is known, it is simple to calculate how much of that medication can be given to a patient weighing 50 kg. This is deductive reasoning because it proceeds from the general (a formula) to the specific (the patient).
Inductive reasoning, in contrast, seeks to find valid generalizations and general principles from data. Statistics, the quantitative aid to inductive reasoning, proceeds from the specific (i.e., from data) to the general (i.e., to formulas or conclusions about the data). By sampling a population and determining the age and the blood pressure of the persons in the sample (the specific data), an investigator, using statistical methods, can determine the general relationship between age and blood pressure (e.g., that, on average, blood pressure increases with age).
This equation is the formula for a straight line in analytic geometry. It is also the formula for simple regression analysis in statistics, although the letters used and their order customarily are different.
In the mathematical formula the b is a constant and stands for the y-intercept (i.e., value of y when the variable x equals 0). The value m also is a constant and stands for the slope (amount of change in y for a unit increase in the value of x). The important point is that in mathematics, one of the variables (x or y) is unknown and needs to be calculated, whereas the formula and the constants are known. In statistics the reverse is true. The variables x and y are known for all persons in the sample, and the investigator may want to determine the linear relationship between them. This is done by estimating the slope and the intercept, which can be done using the form of statistical analysis called linear regression (see Chapter 11).
As a general rule, what is known in statistics is unknown in mathematics, and vice versa. In statistics the investigator starts from specific observations (data) to induce (estimate) the general relationships between variables.
Hypotheses are predictions about what the examination of appropriately collected data will show. This discussion introduces the basic concepts underlying common tests of statistical significance, such as t-tests. These tests determine the probability that an observed difference between means, for example, represents a true, statistically significant difference (i.e., a difference probably not caused by chance). They do this by determining if the observed difference is convincingly different from what was expected from the model. In basic statistics the model is usually a null hypothesis that there will be no difference between the means.
The discussion in this section focuses on the justification for, and interpretation of, the p value, which is the probability that a difference as large as one observed might have occurred by chance. The p value is obtained from calculating one of the standard statistical tests. It is designed to minimize the likelihood of making a false-positive conclusion. False-negative conclusions are discussed more fully in Chapter 12 in the section on sample size.
When deciding whether data are consistent or inconsistent with the hypotheses, investigators are subject to two types of error. An investigator could assert that the data support a hypothesis, when in fact the hypothesis is false; this would be a false-positive error, also called an alpha error or a type I error. Conversely, they could assert that the data do not support the hypothesis, when in fact the hypothesis is true; this would be a false-negative error, also called a beta error or a type II error.
Based on the knowledge that scientists become attached to their own hypotheses, and the conviction that the proof in science (as in courts of law) must be “beyond a reasonable doubt,” investigators historically have been particularly careful to avoid false-positive error. This is probably best for theoretical science in general. It also makes sense for hypothesis testing related specifically to medical practice, where the greatest imperative is “first, do no harm” (Latin primum non nocere). Although it often fails in practice to avoid harm, medicine is dedicated to this principle, and the high standards for the avoidance of type I error reflect this. However, medicine is subject to the harms of error in either direction. False-negative error in a diagnostic test may mean missing a disease until it is too late to institute therapy, and false-negative error in the study of a medical intervention may mean overlooking an effective treatment. Therefore, investigators cannot feel comfortable about false-negative errors in either case.
Box 10-1 Process of Testing a Null Hypothesis for Statistical Significance
The first step consists of stating the null hypothesis and the alternative hypothesis. The null hypothesis states that there is no real (true) difference between the means (or proportions) of the groups being compared (or that there is no real association between two continuous variables). For example, the null hypothesis for the data presented in Table 9-2 is that, based on the observed data, there is no true difference between the percentage of men and the percentage of women who had previously had their serum cholesterol levels checked.
It may seem strange to begin the process by asserting that something is not true, but it is much easier to disprove an assertion than to prove that something is true. If the data are not consistent with a hypothesis, the hypothesis should be rejected and the alternative hypothesis accepted instead. Because the null hypothesis stated there was no difference between means, and that was rejected, the alternative hypothesis states that there must be a true difference between the groups being compared. (If the data are consistent with a hypothesis, this still does not prove the hypothesis, because other hypotheses may fit the data equally well or better.)
Consider a hypothetical clinical trial of a drug designed to reduce high blood pressure among patients with essential hypertension (hypertension occurring without an organic cause yet known, such as hyperthyroidism or renal artery stenosis). One group of patients would receive the experimental drug, and the other group (the control group) would receive a placebo. The null hypothesis might be that, after the intervention, the average change in blood pressure in the treatment group will not differ from the average change in blood pressure in the control group. If a test of significance (e.g., t-test on average change in systolic blood pressure) forces rejection of the null hypothesis, the alternative hypothesis—that there was a true difference in the average change in blood pressure between the two groups—would be accepted. As discussed later, there is a statistical distinction between hypothesizing that a drug will or will not change blood pressure, versus hypothesizing whether a drug will or will not lower blood pressure. The former does not specify a directional inclination a priori (before the fact) and suggests a “two-tailed” hypothesis test. The latter does suggest a directional inclination and suggests a “one-tailed” test.
Second, before doing any calculations to test the null hypothesis, the investigator must establish a criterion called the alpha level, which is the highest risk of making a false-positive error that the investigator is willing to accept. By custom, the level of alpha is usually set at p = 0.05. This says that the investigator is willing to run a 5% risk (but no more) of being in error when rejecting the null hypothesis and asserting that the treatment and control groups truly differ. In choosing an arbitrary alpha level, the investigator inserts value judgment into the process. Because that is done before the data are collected, however, it avoids the post hoc (after the fact) bias of adjusting the alpha level to make the data show statistical significance after the investigator has looked at the data.
An everyday analogy may help to simplify the logic of the alpha level and the process of significance testing. Suppose that a couple were given instructions to buy a silver bracelet for a friend during a trip, if one could be bought for $50 or less. Any more would be too high a price to pay. Alpha is similar to the price limit in the analogy. When alpha has been set (e.g., at p ≤0.05, analogous to ≤$50 in the illustration), an investigator would buy the alternative hypothesis of a true difference if, but only if, the cost (in terms of the probability of being wrong in rejecting the null hypothesis) were no greater than 1 in 20 (0.05). The alpha is analogous to the amount an investigator is willing to pay, in terms of the risk of being wrong, if he or she rejects the null hypothesis and accepts the alternative hypothesis.
When the alpha level is established, the next step is to obtain the p value for the data. To do this, the investigator must perform a suitable statistical test of significance on appropriately collected data, such as data obtained from a randomized controlled trial (RCT). This chapter and Chapter 11 focus on some suitable tests. The p value obtained by a statistical test (e.g., t-test, described later) gives the probability of obtaining the observed result by chance rather than as a result of a true effect. When the probability of an outcome being caused by chance is sufficiently remote, the null hypothesis is rejected. The p value states specifically just how remote that probability is.
Usually, if the observed p value in a study is ≤0.05, members of the scientific community who read about an investigation accept the difference as being real. Although setting alpha at ≤0.05 is arbitrary, this level has become so customary that it is wise to provide explanations for choosing another alpha level or for choosing not to perform tests of significance at all, which may be the best approach in some descriptive studies. Similarly, two-tailed tests of hypothesis, which require a more extreme result to reject the null hypothesis than do one-tailed tests, are the norm; a one-tailed test should be well justified. When the directional effect of a given intervention (e.g., it can be neutral or beneficial, but is certain not to be harmful) is known with confidence, a one-tailed test can be justified (see later discussion).
If the p value is found to be greater than the alpha level, the investigator fails to reject the null hypothesis. Failing to reject the null hypothesis is not the same as accepting the null hypothesis as true. Rather, it is similar to a jury’s finding that the evidence did not prove guilt (or in the example here, did not prove the difference) beyond a reasonable doubt. In the United States a court trial is not designed to prove innocence. The defendant’s innocence is assumed and must be disproved beyond a reasonable doubt. Similarly, in statistics, a lack of difference is assumed, and it is up to the statistical analysis to show that the null hypothesis is unlikely to be true. The rationale for using this approach in medical research is similar to the rationale in the courts. Although the courts are able to convict the guilty, the goal of exonerating the innocent is an even higher priority. In medicine, confirming the benefit of a new treatment is important, but avoiding the use of ineffective therapies is an even higher priority (first, do no harm).
If the p value is found to be less than or equal to the alpha level, the next step is to reject the null hypothesis and to accept the alternative hypothesis, that is, the hypothesis that there is in fact a real difference or association. Although it may seem awkward, this process is now standard in medical science and has yielded considerable scientific benefits.
Most tests of significance relate to a difference between two means or proportions of a variable (e.g., a decrease in blood pressure). The two groups are often a treatment group and a control group. They help investigators decide whether an observed difference is real, which in statistical terms is defined as whether the difference is greater than would be expected by chance alone. In the example of the experimental drug to reduce blood pressure in hypertensive patients, the experimenters would measure the blood pressures of the study participants under experimental conditions before and after the new drug or placebo is given. They would determine the average change seen in the treatment group and the average change seen in the control group and pursue tests to determine whether the difference was large enough to be unlikely to have occurred by chance alone. The fundamental process in this particular test of significance would be to see if the mean blood pressure changes in the two study groups were different from each other.
Why not just inspect the means to see if they were different? This is inadequate because it is unknown whether the observed difference was unusual or whether a difference that large might have been found frequently if the experiment were repeated. Although the investigators examine the findings in particular patients, their real interest is in determining whether the findings of the study could be generalized to other, similar hypertensive patients. To generalize beyond the participants in the single study, the investigators must know the extent to which the differences discovered in the study are reliable. The estimate of reliability is given by the standard error, which is not the same as the standard deviation discussed in Chapter 9.
Chapter 9 focused on individual observations and the extent to which they differed from the mean. One assertion was that a normal (gaussian) distribution could be completely described by its mean and standard deviation. Figure 9-6 showed that, for a truly normal distribution, 68% of observations fall within the range described as the mean ± 1 standard deviation, 95.4% fall within the range of the mean ± 2 standard deviations, and 95% fall within the range of the mean ± 1.96 standard deviations. This information is useful in describing individual observations (raw data), but it is not directly useful when comparing means or proportions.
Because most research is done on samples, rather than on complete populations, we need to have some idea of how close the mean of our study sample is likely to come to the real-world mean (i.e., mean in underlying population from whom the sample came). If we took 100 samples (such as might be done in multicenter trials), the means in our samples would differ from each other, but they would cluster around the true mean. We could plot the sample means just as we could plot individual observations, and if we did so, these means would show their own distribution. This distribution of means is also a normal (gaussian) distribution, with its own mean and standard deviation. The standard deviation of the distribution of means is called something different, the standard error, because it helps us to estimate the probable error of our sample mean’s estimate of the true population mean. The standard error is an unbiased estimate of the standard error in the entire population from whom the sample was taken. (Technically, the variance is an unbiased estimator of the population variance, and the standard deviation, although not quite unbiased, is close enough to being unbiased that it works well.)
The standard error is a parameter that enables the investigator to do two things that are central to the function of statistics. One is to estimate the probable amount of error around a quantitative assertion (called “confidence limits”). The other is to perform tests of statistical significance. If the standard deviation and sample size of one research sample are known, the standard error can be estimated.
The data shown in Table 10-1 can be used to explore the concept of standard error. The table lists the systolic and diastolic blood pressures of 26 young, healthy, adult subjects. To determine the range of expected variation in the estimate of the mean blood pressure obtained from the 26 subjects, the investigator would need an unbiased estimate of the variation in the underlying population. How can this be done with only one small sample?
Although the proof is not shown here, an unbiased estimate of the standard error can be obtained from the standard deviation of a single research sample if the standard deviation was originally calculated using the degrees of freedom (N − 1) in the denominator (see Chapter 9). The formula for converting this standard deviation (SD) to a standard error (SE) is as follows:
The larger the sample size (N), the smaller is the standard error, and the better the estimate of the population mean. At any given point on the x-axis, the height of the bell-shaped curve for the distribution of the sample means represents the relative probability that a single sample mean would have that value. Most of the time, the sample mean would be near the true mean, which would be estimated closely by the mean of the means. Less often, it would be farther away from the average of the sample means.
In the medical literature, means are often reported either as the mean ± 1 SD or as the mean ± 1 SE. Reported data must be examined carefully to determine whether the SD or the SE is shown. Either is acceptable in theory because an SD can be converted to an SE, and vice versa, if the sample size is known. Many journals have a policy, however, stating whether the SD or SE must be reported. The sample size should always be shown.
The SD shows the variability of individual observations, whereas the SE shows the variability of means. The mean ± 1.96 SD estimates the range in which 95% of individual observations would be expected to fall, whereas the mean ± 1.96 SE estimates the range in which 95% of the means of repeated samples of the same size would be expected to fall. If the value for the mean ± 1.96 SE is known, it can be used to calculate the 95% confidence interval, which is the range of values in which the investigator can be 95% confident that the true mean of the underlying population falls. Other confidence intervals, such as the 99% confidence interval, also can be determined easily. Box 10-2 shows the calculation of the SE and the 95% confidence interval for the systolic blood pressure data in Table 10-1.
Box 10-2 Calculation of Standard Error and 95% Confidence Interval for Systolic Blood Pressure Values of 26 Subjects