Summarizing Data
If we are able to provide two summary measures of a continuous variable, one that gives an indication of the ‘average’ value and the other that describes the ‘spread’ of the observations, then we have condensed the data in a meaningful way. We explained how to choose an appropriate average in Chapter 5. We devote this chapter to a discussion of the most common measures of spread (dispersion or variability) which are compared in Table 6.1.
Measure of spread | Advantages | Disadvantages |
Range |
|
|
Ranges based on percentiles |
|
|
Variance |
|
|
Standard deviation |
|
|
The Range
The range is the difference between the largest and smallest observations in the data set; you may find these two values quoted instead of their difference. Note that the range provides a misleading measure of spread if there are outliers (Chapter 3).
Ranges Derived from Percentiles
What Are Percentiles?
Suppose we arrange our data in order of magnitude, starting with the smallest value of the variable, x, and ending with the largest value. The value of x that has 1% of the observations in the ordered set lying below it (and 99% of the observations lying above it) is called the 1st percentile. The value of x that has 2% of the observations lying below it is called the 2nd percentile, and so on. The values of x that divide the ordered set into 10 equally sized groups, that is the 10th, 20th, 30th, …, 90th percentiles, are called deciles. The values of x that divide the ordered set into four equally sized groups, that is the 25th, 50th and 75th percentiles, are called quartiles. The 50th percentile is the median (Chapter 5).
Using Percentiles
We can obtain a measure of spread that is not influenced by outliers by excluding the extreme values in the data set, and then determining the range of the remaining observations. The interquartile range is the difference between the 1st and the 3rd quartiles, i.e. between the 25th and 75th percentiles (Fig. 6.1). It contains the central 50% of the observations in the ordered set, with 25% of the observations lying below its lower limit, and 25% of them lying above its upper limit. The interdecile range contains the central 80% of the observations, i.e. those lying between the 10th and 90th percentiles. Often we use the range that contains the central 95% of the observations, i.e. it excludes 2.5% of the observations above its upper limit and 2.5% below its lower limit (Fig. 6.1). We may use this interval, provided it is calculated from enough values of the variable in healthy individuals, to diagnose disease. It is then called the reference interval, reference range or normal range (see Chapter 38).