## Chapter Contents

Components of Sample Size Calculations 122

Effect of Selecting α Error and Power 123

Estimation of Population Parameters 125

Low Power With Limited Available Participants 126

Sample Size Samba 127

Sample Size Modification 127

Futility of

*post hoc*Power Calculations 127

What Should Readers Look for in Sample Size Calculations? 128

Conclusion 128

Investigators should properly calculate sample sizes before the start of their randomised trials and adequately describe the details in their published report. In these *a priori* calculations, determining the effect size to detect (e.g., event rates in treatment and control groups) reflects inherently subjective clinical judgements. Furthermore, these judgements greatly affect sample size calculations. We question the branding of trials as unethical on the basis of an imprecise sample size calculation process. So-called underpowered trials might be acceptable if investigators use methodological rigour to eliminate bias, properly report to avoid misinterpretation, and always publish results to avert publication bias. Some shift of emphasis from a fixation on sample size to a focus on methodological quality would yield more trials with less bias. Unbiased trials with imprecise results trump no results at all.

Sample size calculations for randomised trials seem unassailable. Indeed, investigators should properly calculate sample sizes and adequately describe the key details in their published report. Research methodologists describe the approaches in books and articles. Protocol committees and ethics review boards require adherence. CONSORT reporting guidelines clearly specify the reporting of sample size calculations. Everyone agrees. An important impetus to this unanimity burst on the medical world more than a quarter of a century ago. A group of researchers, led by Tom Chalmers, published a landmark article detailing the lack of statistical power in so-called negative randomised trials published in premier general medical journals. In Chalmers’ long illustrious career, he published hundreds of articles. This article on sample size and power received many citations. Paradoxically, that troubled him. He regarded it as the most damaging paper that he had ever coauthored. Why? We will describe his concerns later, so stay tuned.

## Components of Sample Size Calculations

Calculating sample sizes for trials with dichotomous outcomes (e.g., sick versus well) requires four components: type I error (α), power, event rate in the control group, and a treatment effect of interest (or analogously an event rate in the treatment group). These basic components persist through calculations with other types of outcomes, except other assumptions can be necessary. For example, with quantitative outcomes and a typical statistical test, investigators might assume a difference between means and a variance for the means. One other component of sample size calculations is the allocation ratio for the probability of assignment to the treatment groups. Throughout this chapter, we have assumed that ratio to be 1:1, which means the probability of assignment to the two groups is equal. Most medical researchers choose this 1:1 allocation ratio for their trials. Although a 1:1 allocation ratio usually maximises trial power, ratios up to 2:1 minimally reduce the power. In Chapter 12 , we discuss the implications of altering the allocation ratio on trial costs, power, and sample size.

In clinical research, hypothesis testing risks two fundamental errors ( Panel 11.1 ). First, researchers can conclude that two treatments differ when, in fact, they do not. This type I error (α) measures the probability of making this false-positive conclusion. Conventionally, α is most frequently set at 0.05, meaning that investigators desire a less than 5% chance of making a false-positive conclusion. Second, researchers can conclude that two treatments do not differ when, in fact, they do (i.e., a false-negative conclusion). This type II error (β) measures the probability of this false-negative conclusion. Conventionally, investigators set β at 0.20, meaning that they desire less than a 20% chance of making a false-negative conclusion.

Type I Error (α) |

The probability of detecting a statistically significant difference when the treatments are in reality equally effective (i.e., the chance of a false-positive result). |

Type II Error (β) |

The probability of not detecting a statistically significant difference when a difference of a given magnitude in reality exists (i.e., the chance of a false-negative result). |

Power (1 – β) |

The probability of detecting a statistically significant difference when a difference of a given magnitude really exists. |

Power derives from β error. Mathematically, it is the complement of β (i.e., 1 – β) and represents the probability of avoiding a false-negative conclusion. For example, for β = 0.20, the power would be 0.80, or 80%. Stated alternatively, power represents the likelihood of detecting a difference (as significant, with *p* < α), assuming a difference of a given magnitude exists. For example, a trial with a power of 80% has an 80% chance of detecting a difference between two treatments if a real difference of assumed magnitude exists in the population.

Admittedly, understanding α error, β error, and power can be a challenge. Convention, however, usually guides investigators for inputs into sample size calculations. The other inputs cause lesser conceptual difficulties but produce pragmatic problems. Investigators estimate the true event rates in the treatment and control groups as inputs. Usually, we recommend estimating the event rate in the population and then determining a treatment effect of interest. For example, investigators estimate an event rate of 10% in the controls. They then would estimate an absolute change (e.g., an absolute reduction of 3%), a relative change (e.g., a relative reduction of 30%), or simply estimate a 7% event rate in the treatment group. Using these assumptions, investigators calculate sample sizes. Standard texts describe the procedures encompassing, for example, binary, continuous, and time-to-event measures. Commonly, investigators use sample size and power software (preferably with guidance from a statistician). Most hand calculations diabolically strain human limits, even for the easiest formula, such as we offer in Panel 11.2 . In this chapter, we address sample size and power for the most commonly used design, a parallel group randomised superiority trial. With this design, investigators seek to find whether one treatment is superior to another. Inability to demonstrate superiority fails to demonstrate equivalence. In contrast, investigators can undertake a noninferiority trial, which strives to establish whether a new treatment is not worse than a standard (reference) treatment by more than a tolerable amount. Noninferiority of the new treatment relative to the standard treatment interests investigators because the new treatment has some other benefit, such as greater accessibility, reduced cost, less invasiveness, fewer adverse effects, or greater ease of administration. Noninferiority trials are much less commonly conducted in medicine, and we do not address calculating sample sizes for them in this chapter. However, we recommend the CONSORT extension paper on equivalence and noninferiority trials for interested readers.

n The sample size of each of the groups |

p _{1 }The event rate in the treatment group (not in formula, but implied when R and p _{2 }are estimated) |

p _{2 }The event rate in the control group |

R The risk ratio ( p _{1 }/ p _{2 }): |

<SPAN role=presentation tabIndex=0 id=MathJax-Element-1-Frame class=MathJax style="POSITION: relative" data-mathml='n=10.51R+1−p2R2+1p21−R2′>𝑛=10.51((𝑅+1)−𝑝2(𝑅2+1))𝑝2(1−𝑅)2n=10.51R+1−p2R2+1p21−R2 n = 10.51 R + 1 − p 2 R 2 + 1 p 2 1 − R 2 |

For example, we estimate a 10% event rate in the control group ( p _{2 }= 0.10) and determine that the clinically important difference to detect is a 40% reduction ( R = 0.60) with the new treatment at α = 0.05 and power = 0.90. ( Note: R = 0.60 equates to an event rate in the treatment group of p _{1 }= 0.06, i.e. R = 6%/10%) |

<SPAN role=presentation tabIndex=0 id=MathJax-Element-2-Frame class=MathJax style="POSITION: relative" data-mathml='n=10.510.60+1−0.100.602+10.101−0.602′>𝑛=10.51((0.60+1)−0.10(0.602+1))0.10(1−0.60)2n=10.510.60+1−0.100.602+10.101−0.602 n = 10.51 0.60 + 1 − 0.10 0.60 2 + 1 0.10 1 − 0.60 2 |

n = 961.665 ≈ 962 in each group (PASS software version 6.0 (NCSS, Kaysville, UT)) with a more accurate formula yields 965) |

This formula accommodates alternate α levels and power by replacing 10.51 with the appropriate value from the following table. |