12 Applying Statistics to Trial Design
Sample Size, Randomization, and Control for Multiple Hypotheses
II. RANDOMIZING STUDY PARTICIPANTS
III. CONTROLLING FOR THE TESTING OF MULTIPLE HYPOTHESES
I Sample Size
The determination of sample size is critical in planning clinical research because sample size is usually the most important factor determining the time and funding necessary to perform the research. The sample size has a profound impact on the likelihood of finding statistical significance.
Members of committees responsible for evaluating and funding clinical studies look closely at the assumptions used to estimate the number of study participants needed and at the way in which calculations of sample size were performed. Part of their task when reviewing the sample size is to determine whether the proposed research is realistic (e.g., whether adequate participants are included in the intervention and control groups in a randomized clinical trial, or in the groups of cases and controls in a case-control study). In research reported in the literature, inadequate sample size may explain why apparently useful clinical results are not statistically significant.
Statisticians are probably consulted more often because an investigator wants to know the sample size needed for a study than for any other reason. Sample size calculations can be confusing, even for people who can do ordinary statistical analyses without trouble. As a test of intuition regarding sample size, try to answer the following three questions:
1. What size sample—large or small—would be needed if there was a very large variance in the outcome variable?
2. What size sample would be needed if the investigator wanted the answer to be extremely close to the true value (i.e., have narrow confidence limits or small p value)?
3. What size sample would be needed if the difference that the investigator wanted to be able to detect was extremely small?
If your intuition suggested that all these requirements would create the need for a large sample size, you would be correct. If intuition did not suggest the correct answers, review these questions again after reading the following information about how the basic formulas for sample size are derived.
Other factors affecting the number of participants required for a study include whether the:
1. Research design involves paired data (e.g., each subject has a pair of observations from two points in time—before treatment and after treatment) or unpaired data (e.g., observations are compared between an experimental group and a control group).
2. Investigator anticipates a large or small variance in the variable of interest.
3. Investigator wants to consider beta (β, type II or false-negative) errors in addition to alpha (α, type I or false-positive) errors.
4. Investigator chooses the usual alpha level (p value of 0.05 or confidence interval of 95%) or chooses a smaller level of alpha.
5. Alpha chosen is one-sided or two-sided.
6. Investigator wants to be able to detect a fairly small or extremely small difference between the means or proportions of the outcome variable.
A Derivation of Basic Sample Size Formula
To derive the basic formula for calculating the sample size, it is easiest to start with the formula for the paired t-test (see Chapter 10):
where tα is the critical ratio to determine the probability of a false-positive (α) error if the null hypothesis is rejected; is the mean difference that was observed in the outcome variable, sd is the standard deviation of the before-after differences, and N is the sample size.
To solve for N, several rearrangements and substitutions of terms in the equation must be made. First, everything can be squared and the equation rearranged so that N is in the numerator and s2 is the variance of the distribution of d:
Next, the terms can be rearranged so that the equation for N in a paired (before and after) study becomes:
Now the t in the formula must be replaced with z. This provides a solution to a circular problem: To know the value of t, the degrees of freedom (df) must be known. The df depends on N, however, which is what the investigator is initially trying to calculate. Because the value of z is not dependent on df, and because z is close to t when the sample size is large, z can be used instead of t. The formula becomes:
In theory, using z instead of t might produce a slight underestimate of the sample size needed. In practice, however, using z seems to work well, and its use is customary. The previous formula is for a study using the paired t-test, in which each participant serves as his or her own control. For a study using Student’s t-test, such as a randomized controlled trial (RCT) with an experimental group and a control group, it would be necessary to calculate N for each group. The previous formula considers only the problem of alpha error; to minimize the possibility of beta error, a z term for beta error must be introduced as well. Before these topics are discussed, however, the answers to the three questions posed earlier should be explored more fully in light of the information provided by the formula for the calculation of N.
1. The larger the variance (s2), the larger the sample size must be because the variance is in the numerator of the formula for N. This makes sense intuitively because with a large variance (and large standard error), a larger N is needed to compensate for the greater uncertainty of the estimate.
2. To have considerable confidence that a mean difference shown in a study is real, the analysis must produce a small p value for the observed mean difference, which implies that the value for tα or zα was large. Because zα is in the numerator of the sample size formula, the larger zα is, the larger the N (the sample size) that is needed. For a two-tailed test, a p value of 0.05 (the alpha level chosen) would require a zα of 1.96, which, when squared as in the formula, would equal 3.84. To be even more confident, the investigator might set alpha at 0.01. This would require a zα of 2.58, which equals 6.66 when squared, 73% greater than when alpha is set at 0.05. To decrease the probability of being wrong from 5% to 1% would require the sample size to be almost doubled.
3. If the investigator wanted to detect with confidence a very small difference between the mean values of two study groups (i.e., a small ), a very large N would be needed because the difference (squared) is in the denominator. The smaller the denominator is, the larger the ratio is, and the larger the N must be. A precise estimate and a large sample size are needed to detect a small difference.
Whether a small difference is considered clinically important often depends on the topic of research. Studies showing that a new treatment for hypertension reduces the systolic blood pressure by 2 to 3 mm Hg would be considered clinically trivial. Studies showing that a new treatment for pancreatic cancer improves the survival rate by 10% (0.1) would be considered a major advance. Clinical judgment is involved in determining the minimum difference that should be considered clinically important.
B Beta (False-Negative) Error
If a difference is examined with a t-test, and it is statistically significant at the prestated level of alpha (e.g., 0.05), beta error is not an issue. What if a reported finding seems to be clinically important, but it is not “statistically significant” in that study? Here the question of a possible false-negative (beta) error becomes important. Beta error may have occurred because the sample size was too small. When planning a study, investigators want to avoid the likelihood of beta (false-negative) error and the likelihood of alpha (false-positive) error, and readers of the literature should be on the lookout for this problem as well. The relationship between the results of a study and the true status can be seen in a “truth table” (Table 12-1). The similarity of Table 12-1 to the relationship between a test result and the disease status is obvious (compare with Table 7-1).
A seminal article illustrated the need to be concerned about beta error: in most of 71 negative RCTs of new therapies published in prominent medical journals, the sample sizes were too small “to provide reasonable assurance that a clinically meaningful ‘difference’ (i.e., therapeutic effect) would not be missed.”1 In the study, “reasonable assurance” was 90%. In 94% of these negative studies, the sample size was too small to detect a 25% improvement in outcome with reasonable (90%) assurance. In 75% of the studies, the sample size was too small to detect a 50% improvement in outcome with the same level of assurance. Evidence indicates that this problem has persisted over time.2
A study with a large beta error has a low sensitivity for detecting a true difference because, as discussed in Chapter 7:
When investigators speak of a research study versus a clinical test, however, they usually use the term “statistical power” instead of “sensitivity.” With this substitution in terms:
which means that statistical power is equal to (1 − beta error). When calculating a sample size, if the investigators accept a 20% possibility of missing a true finding (beta error = 0.2), the study should have a statistical power of 0.8, or 80%. That means the investigators are 80% confident that they would be able to detect a true mean difference of the size they specify with the sample size they determine. The best way to incorporate beta error into a study is to include it beforehand in the determination of sample size. Incorporating the statistical term for beta error (zβ) in the sample size calculation is simple but likely to increase the sample size considerably.
C Steps in Calculation of Sample Size
The first step in calculating sample size is to choose the appropriate formula to use, based on the type of study and the type of error to be considered. Four common formulas for calculating sample size are discussed in this chapter and listed in Table 12-2, and their use is illustrated in Boxes 12-1 through 12-4.3 The second step in calculating sample size requires that the investigators specify the following values: the variance expected (s2); the zα value for the level of alpha desired; the smallest clinically important difference ; and, usually, beta (measured as zβ). All values except the variance must come from clinical and research judgment, although the estimated variance should be based on knowledge of data. If the outcome variable being studied is continuous, such as blood pressure, the estimate of variance can be obtained from the literature or from a small pilot study.
Box 12-1 Calculation of Sample Size for a Study Using the Paired ± t-Test and Considering Alpha Error Only
Part 2 Calculation of Sample Size (N)
Interpretation: Only nine participants are needed for this study because each paired subject serves as his or her own control in a before-and-after study, thereby greatly reducing the variance (i.e., variance between a subject and him/herself before and after a particular treatment is almost certain to be much less than the variance between one person and another). When the estimated N is a fraction, the N used should be rounded up to be safe.
Box 12-2 Calculation of Sample Size for a Study Using Student’s t-Test and Considering Alpha Error Only
Part 2 Calculation of Sample Size (N)
Interpretation: For the type of study depicted in this box, 18 participants are needed in the experimental group and 18 in the control group, for a total N of 36 study participants. The N needed to be rounded up. The total N needed in this box is four times as large as the total N needed in Box 12-1, although the values for zα, s, and are the same in both boxes. One reason for the larger sample size for a randomized controlled trial is that there are two groups, and the N calculated is for the intervention group only. The other reason is the fact that there are two variances to consider (i.e., the intervention and control groups contribute to the overall variance), so the estimated variance must be multiplied by 2.