Errors in hypothesis testing


c18-fig-5002


Making a Decision


Most hypothesis tests in medical statistics compare groups of people who are exposed to a variety of experiences. We may, for example, be interested in comparing the effectiveness of two forms of treatment for reducing 5 year mortality from breast cancer. For a given outcome (e.g. death), we call the comparison of interest (e.g. the difference in 5 year mortality rates) the effect of interest or, if relevant, the treatment effect. We express the null hypothesis in terms of no effect (e.g. the 5 year mortality from breast cancer is the same in two treatment groups); the two-sided alternative hypothesis is that the effect is not zero. We perform a hypothesis test that enables us to decide whether we have enough evidence to reject the null hypothesis (Chapter 17). We can make one of two decisions; either we reject the null hypothesis, or we do not reject it.


Making the Wrong Decision


Although we hope we will draw the correct conclusion about the null hypothesis, we have to recognize that, because we only have a sample of information, we may make the wrong decision when we reject/do not reject the null hypothesis. The possible mistakes we can make are shown in Table 18.1.



  • Type I error: we reject the null hypothesis when it is true, and conclude that there is an effect when, in reality, there is none. The maximum chance (probability) of making a Type I error is denoted by α (alpha). This is the significance level of the test (Chapter 17); we reject the null hypothesis if our P-value is less than the significance level, i.e. if P < α.
    We must decide on the value of α before we collect our data. We usually assign a conventional value of 0.05 to it, although we might choose a more restrictive value such as 0.01 (if we are particularly concerned about the consequences of incorrectly rejecting the null hypothesis) or a less restrictive value such as 0.10 (if we do not want to miss a real effect). Our chance of making a Type I error will never exceed our chosen significance level, say α = 0.05, because we will only reject the null hypothesis if P < 0.05. If we find that P ≥ 0.05, we will not reject the null hypothesis, and, consequently, not make a Type I error.
  • Type II error: we do not reject the null hypothesis when it is false, and conclude that there is no evidence of an effect when one really exists. The chance of making a Type II error is denoted by β (beta); its complement, (1 − β), is the power of the test. The power, therefore, is the probability of rejecting the null hypothesis when it is false; i.e. it is the chance (usually expressed as a percentage) of detecting, as statistically significant, a real treatment effect of a given size.
    Ideally, we should like the power of our test to be 100%; we must recognize, however, that this is impossible because there is always a chance, albeit slim, that we could make a Type II error. Fortunately, however, we know which factors affect power, and thus we can control the power of a test by giving consideration to them.

Table 18.1 The Consequences of Hypothesis Testing.
















Reject H0 Do not reject H0
H0 true Type I error No error
H0 false No error Type II error

Power and Related Factors


It is essential that we know the power of a proposed test at the planning stage of our investigation. Clearly, we should only embark on a study if we believe that it has a ‘good’ chance of detecting a clinically relevant effect, if one exists (by ‘good’ we mean that the power should be at least 80%). It is ethically irresponsible, and wasteful of time and resources, to undertake a clinical trial that has, say, only a 40% chance of detecting a real treatment effect.


A number of factors have a direct bearing on power for a given test.



  • The sample size: power increases with increasing sample size. This means that a large sample has a greater ability than a small sample to detect a clinically important effect if it exists. When the sample size is very small, the test may have an inadequate power to detect a particular effect. We explain how to choose sample size, with power considerations, in Chapter 36. The methods can also be used to evaluate the power of the test for a specified sample size.
  • The variability of the observations: power increases as the variability of the observations decreases (Fig. 18.1).
  • The effect of interest: the power of the test is greater for larger effects. A hypothesis test thus has a greater chance of detecting a large real effect than a small one.
  • The significance level: the power is greater if the significance level is larger (this is equivalent to the probability of the Type I error (α) increasing as the probability of the Type II error (β) decreases). So, we are more likely to detect a real effect if we decide at the planning stage that we will regard our P-value as significant if it is less than 0.05 rather than less than 0.01. We can see this relationship between power and the significance level in Fig. 18.2.


Figure 18.1 Power curves showing the relationship between power and the sample size in each of two groups for the comparison of two means using the unpaired t-test (Chapter 21). Each power curve relates to a two-sided test for which the significance level is 0.05, and the effect of interest (e.g. the difference between the treatment means) is 2.5. The assumed equal standard deviation of the measurements in the two groups is different for each power curve (see Example 1, Chapter 36).


c18f001

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

May 9, 2017 | Posted by in GENERAL & FAMILY MEDICINE | Comments Off on Errors in hypothesis testing

Full access? Get Clinical Tree

Get Clinical Tree app for offline access