Why Transform?
The observations in our investigation may not comply with the requirements of the intended statistical analysis (Chapter 35).
- A variable may not be Normally distributed, a distributional requirement for many different analyses.
- The spread of the observations in each of a number of groups may be different (constant variance is an assumption about a parameter in the comparison of means using the unpaired t-test and analysis of variance – Chapters 21 and 22).
- Two variables may not be linearly related (linearity is an assumption in many regression analyses – Chapters 27–33 and 42).
It is often helpful to transform our data to satisfy the assumptions underlying the proposed statistical techniques.
How Do We Transform?
We convert our raw data into transformed data by taking the same mathematical transformation of each observation. Suppose we have n observations (y1, y2, …, yn) on a variable, y, and we decide that the log transformation is suitable. We take the log of each observation to produce (log y1, log y2, …, log yn). If we call the transformed variable z, then zi = log yi for each i (i = 1, 2, …, n), and our transformed data may be written (z1, z2, …, zn).
We check that the transformation has achieved its purpose of producing a data set that satisfies the assumptions of the planned statistical analysis (e.g. by plotting a histogram of the transformed data – see Chapter 35), and proceed to analyse the transformed data (z1, z2, …, zn). We often back-transform any summary measures (such as the mean) to the original scale of measurement; we then rely on the conclusions we draw from hypothesis tests (Chapter 17) on the transformed data.
Typical Transformations
The Logarithmic Transformation, z = log y
When log transforming data, we can choose to take logs either to base 10 (log10 y, the ‘common’ log) or to base e (loge y or ln y, the ‘natural’ or Naperian log), or to any other base, but must be consistent for a particular variable in a data set. Note that we cannot take the log of a negative number or of zero. The back-transformation of a log is called the antilog; the antilog of a Naperian log is the exponential, e.
- If y is skewed to the right, z = log y is often approximately Normally distributed (Fig. 9.1a). Then y has a Lognormal distribution (Chapter 8).
- If there is an exponential relationship between y and another variable, x, so that the resulting curve bends upward when y (on the vertical axis) is plotted against x (on the horizontal axis), then the relationship between z = log y and x is approximately linear (Fig. 9.1b).
- Suppose we have different groups of observations, each comprising measurements of a continuous variable, y. We may find that the groups that have the higher values of y also have larger variances. In particular, if the coefficient of variation (the standard deviation divided by the mean) of y is constant for all the groups, the log transformation, z = log y, produces groups that have similar variances (Fig. 9.1c).