Why Bother?
Computer analysis of data offers the opportunity of handling large data sets that might otherwise be beyond our capabilities. However, do not be tempted to ‘have a go’ at statistical analyses simply because they are available on the computer. The validity of the conclusions drawn relies on the appropriate analysis being conducted in any given circumstance, and a requirement that the underlying assumptions inherent in the proposed statistical analysis are satisfied.
Are the Data Normally Distributed?
Many analyses make assumptions about the underlying distribution of the data. The following procedures verify approximate Normality, the most common of the distributional assumptions.
- We produce a dot plot (for small samples) or a histogram, stem-and-leaf plot (Fig. 4.2) or box plot (Fig. 6.1) to show the empirical frequency distribution of the data (Chapter 4). We conclude that the distribution is approximately Normal if it is bell-shaped and symmetrical. The median in a box plot should cut the rectangle defining the first and third quartiles in half, and the two whiskers should be of equal length if the data are Normally distributed.
- Alternatively, we can produce a Normal plot (preferably on the computer) which plots the Standard Normal deviate for the cumulative distribution against the sample values. Lack of Normality is indicated by the resulting plot producing a curve that deviates from a straight line (Fig. 35.1).
Although both approaches are subjective, the Normal plot is more effective for smaller samples. The Kolmogorov–Smirnov and Shapiro–Wilk tests, both performed on the computer, can be used to assess Normality more objectively.
Are Two or More Variances Equal?
We explained how to use the t-test (Chapter 21) to compare two means and ANOVA (Chapter 22) to compare more than two means. Underlying these analyses is the assumption that the variability of the observations in each group is the same, i.e. we require equal variances, described as homogeneity of variance or homoscedasticity. We have heterogeneity of variance if the variances are unequal.
- We can use Levene’s test, using a computer program, to test for homogeneity of variance in two or more groups. The null hypothesis is that all the variances are equal. Levene’s test has the advantage that it is not strongly dependent on the assumption of Normality. Bartlett’s test can also be used to compare more than two variances, but it is non-robust to departures from Normality.
- We can use the F-test (variance-ratio test) described in the following box to compare two variances, provided the data in each group are approximately Normally distributed (the test is non-robust to a violation of this assumption). The two estimated variances are and , calculated from n1 and n2 observations, respectively. By convention, we choose to be the larger of the two variances, if they differ.
- We also assume homogeneity of variance of the residuals in simple and multiple regression (Chapters 28 and 29) and in random effects models (Chapter 42). We explained how to check this assumption in Chapters 28 and 29.