Describing data: the ‘spread’


c6-fig-5002


Summarizing Data


If we are able to provide two summary measures of a continuous variable, one that gives an indication of the ‘average’ value and the other that describes the ‘spread’ of the observations, then we have condensed the data in a meaningful way. We explained how to choose an appropriate average in Chapter 5. We devote this chapter to a discussion of the most common measures of spread (dispersion or variability) which are compared in Table 6.1.


Table 6.1 Advantages and disadvantages of measures of spread.
























Measure of spread Advantages Disadvantages
Range

  • Easily determined


  • Uses only two observations
  • Distorted by outliers
  • Tends to increase with increasing sample size
Ranges based on percentiles

  • Usually unaffected by outliers
  • Independent of sample size
  • Appropriate for skewed data


  • Clumsy to calculate
  • Cannot be calculated for small samples
  • Uses only two observations
  • Not algebraically defined
Variance

  • Uses every observation
  • Algebraically defined


  • Units of measurement are the square of the units of the raw data
  • Sensitive to outliers
  • Inappropriate for skewed data
Standard deviation

  • Same advantages as the variance
  • Units of measurement are the same as those of the raw data
  • Easily interpreted


  • Sensitive to outliers
  • Inappropriate for skewed data

The Range


The range is the difference between the largest and smallest observations in the data set; you may find these two values quoted instead of their difference. Note that the range provides a misleading measure of spread if there are outliers (Chapter 3).


Ranges Derived from Percentiles


What Are Percentiles?


Suppose we arrange our data in order of magnitude, starting with the smallest value of the variable, x, and ending with the largest value. The value of x that has 1% of the observations in the ordered set lying below it (and 99% of the observations lying above it) is called the 1st percentile. The value of x that has 2% of the observations lying below it is called the 2nd percentile, and so on. The values of x that divide the ordered set into 10 equally sized groups, that is the 10th, 20th, 30th, …, 90th percentiles, are called deciles. The values of x that divide the ordered set into four equally sized groups, that is the 25th, 50th and 75th percentiles, are called quartiles. The 50th percentile is the median (Chapter 5).


Using Percentiles


We can obtain a measure of spread that is not influenced by outliers by excluding the extreme values in the data set, and then determining the range of the remaining observations. The interquartile range is the difference between the 1st and the 3rd quartiles, i.e. between the 25th and 75th percentiles (Fig. 6.1). It contains the central 50% of the observations in the ordered set, with 25% of the observations lying below its lower limit, and 25% of them lying above its upper limit. The interdecile range contains the central 80% of the observations, i.e. those lying between the 10th and 90th percentiles. Often we use the range that contains the central 95% of the observations, i.e. it excludes 2.5% of the observations above its upper limit and 2.5% below its lower limit (Fig. 6.1). We may use this interval, provided it is calculated from enough values of the variable in healthy individuals, to diagnose disease. It is then called the reference interval, reference range or normal range (see Chapter 38).



Figure 6.1 A box-and-whisker plot of the baby’s weight at birth (Chapter 2). This figure illustrates the median, the interquartile range, the range that contains the central 95% of the observations and the maximum and minimum values.


c06f001

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

May 9, 2017 | Posted by in GENERAL & FAMILY MEDICINE | Comments Off on Describing data: the ‘spread’

Full access? Get Clinical Tree

Get Clinical Tree app for offline access