What is It?
We may be interested in the effect of several explanatory variables, x1, x2, …, xk, on a response variable, y. If we believe that these x’s may be inter-related, we should not look, in isolation, at the effect on y of changing the value of a single x, but should simultaneously take into account the values of the other x’s. For example, as there is a strong relationship between a child’s height and weight, we may want to know whether the relationship between height and systolic blood pressure (Chapter 28) is changed when we take the child’s weight into account. Multiple linear regression allows us to investigate the joint effect of these explanatory variables on y; it is an example of a multivariable analysis where we relate a single outcome variable to two or more explanatory variables simultaneously. Note that, although the explanatory variables are sometimes called independent variables, this is a misnomer because they may be related.
We take a sample of n individuals, and measure the value of each of the variables on every individual. The multiple linear regression equation which estimates the relationships in the population is:
Y = a + b1x1 + b2x2 + … + bkxk
- xi is the ith explanatory variable or covariate (i = 1, 2, 3, …, k);
- Y is the estimated predicted, expected, mean or fitted value of y, which corresponds to a particular set of values of x1, x2, …, xk;
- a is a constant term, the estimated intercept; it is the value of Y when all the x’s are zero;
- b1, b2, …, bk are the estimated partial regression coefficients; b1 represents the amount by which Y increases on average if we increase x1 by one unit but keep all the other x’s constant (i.e. adjust or control for them). If there is a relationship between x1 and the other x’s, b1 differs from the estimate of the regression coefficient obtained by regressing y on only x1, because the latter approach does not adjust for the other variables. b1 represents the effect of x1 on y that is independent of the other x’s.
Multiple linear regression analyses are invariably performed on the computer, and so we omit the formulae for these estimated parameters.
Why Do It?
We perform a multiple regression analysis to be able to:
- identify explanatory variables that are associated with the dependent variable in order to promote understanding of the underlying process;
- determine the extent to which one or more of the explanatory variables is/are linearly related to the dependent variable, after adjusting for other variables that may be related to it; and,
- possibly, predict the value of the dependent variable as accurately as possible from the explanatory variables.
Assumptions
The assumptions in multiple linear regression are the same (if we replace ‘x’ by ‘each of the ‘x’s’) as those in simple linear regression (Chapter 27), and they are checked in the same way. Failure to satisfy the linearity or independence assumptions is particularly important. We can transform (Chapter 9) the y variable and/or some or all of the x variables if the assumptions are in doubt, and then repeat the analysis (including checking the assumptions) on the transformed data.
Categorical Explanatory Variables
We can perform a multiple linear regression analysis using categorical explanatory variables. In particular, if we have a binary variable, x1 (e.g. male = 0, female = 1), and we increase x1 by one unit, we are ‘changing’ from males to females. b1 thus represents the difference in the estimated mean values of y between females and males, after adjusting for the other x’s.
If we have a nominal explanatory variable (Chapter 1) that has more than two categories of response, we have to create a number of dummy or indicator variables1. In general, for a nominal variable with k categories, we create k − 1 binary dummy variables. We choose one of the categories to represent our reference category, and each dummy variable allows us to compare one of the remaining k − 1 categories of the variable with the reference category. For example, we may be interested in comparing mean systolic blood pressure levels in individuals living in four countries in Europe (the Netherlands, UK, Spain and France). Suppose we choose our reference category to be the Netherlands. We generate one binary variable to identify those living in the UK; this variable takes the value 1 if the individual lives in the UK and 0 otherwise. We then generate binary variables to identify those living in Spain and France in a similar way. By default, those living in the Netherlands can then be identified since these individuals will have the value 0 for each of the three binary variables. In a multiple linear regression analysis, the regression coefficient for each of the other three countries represents the amount by which Y (systolic blood pressure) differs, on average, among those living in the relevant country compared with those living in the Netherlands. The intercept provides an estimate of the mean systolic blood pressure for those living in the Netherlands (when all of the other explanatory variables take the value zero). Some computer packages will create dummy variables automatically once it is specified that the variable is categorical.
If we have an ordinal explanatory variable and its three or more categories can be assigned values on a meaningful linear scale (e.g. social classes 1–5), then we can either use these values directly in the multiple linear regression equation (see also Chapter 33), or generate a series of dummy variables as for a nominal variable (but this does not make use of the ordering of the categories).
Analysis of Covariance
An extension of analysis of variance (ANOVA, Chapter 22) is the analysis of covariance, in which we compare the response of interest between groups of individuals (e.g. two or more treatment groups) when other variables measured on each individual are taken into account. Such data can be analysed using multiple linear regression techniques by creating one or more dummy binary variables to differentiate between the groups. So, if we wish to compare the mean values of y in two treatment groups, while controlling for the effect of variables x2, x3, …, xk (e.g. age, weight, etc.), we create a binary variable, x1, to represent ‘treatment’ (e.g. x1 = 0 for treatment A, x1 = 1 for treatment B). In the multiple linear regression equation, b1 is the estimated difference in the mean responses on y between treatments B and A, adjusting for the other x’s.
Analysis of covariance is the preferred analysis for a randomized controlled trial comparing treatments when each individual in the study has a baseline and post-treatment follow-up measurement. In this instance, the response variable, y, is the follow-up measurement and two of the explanatory variables in the regression model are a binary variable representing treatment, x1, and the individual’s baseline level at the start of the study, x2. This approach is generally better (i.e. has a greater power – see Chapter 36) than using either the change from baseline or the percentage change from baseline as the response variable.
Choice of Explanatory Variables
As a rule of thumb, we should not perform a multiple linear regression analysis if the number of variables is greater than the number of individuals divided by 10. Most computer packages have automatic procedures for selecting variables, e.g. stepwise selection (Chapter 33). These are particularly useful when many of the explanatory variables are related. A particular problem arises when collinearity is present, i.e. when pairs of explanatory variables are extremely highly correlated (Chapter 33).
Analysis
Most computer output contains the following items.
The adjusted R2 represents the proportion (often expressed as a percentage) of the variability of y which can be explained by its relationship with the x’s. R2 is adjusted so that models with different numbers of explanatory variables can be compared. If it has a low value (judged subjectively), the model is a poor fit. Goodness of fit is particularly important when we use the multiple linear regression equation for prediction.