Whichever type of statistical model we choose, we have to make decisions about which explanatory variables to include in the model and the most appropriate way in which they should be incorporated. These decisions will depend on the type of explanatory variable (either nominal categorical, ordinal categorical or numerical) and the relationship between these variables and the dependent variable.
Nominal Explanatory Variables
It is usually necessary to create dummy or indicator variables (Chapter 29) to investigate the effect of a nominal categorical explanatory variable in a regression analysis. Note that when assessing the adequacy of fit of a model that includes a nominal variable with more than two categories, or when assessing the significance of that variable, it is important to include all of the dummy variables in the model at the same time; if we do not do this (i.e. if we only include one of the dummy variables for a particular level of the categorical variable), then we would only partially assess the impact of that variable on the outcome. For this reason, it is preferable to judge the significance of the variable using the likelihood ratio test statistic (LRS – Chapter 32), rather than by considering individual P-values for each of the dummy variables.
Ordinal Explanatory Variables
In the situation where we have an ordinal variable with more than two categories, we may take one of two approaches.
- Treat the categorical variable as a continuous numerical measurement by allocating a numerical value to each category of the variable. This approach makes full use of the ordering of the categories but it usually assumes a linear relationship (when the numerical values are equally spaced) between the explanatory variable and the dependent variable (or a transformation of it) and this should be validated.
- Treat the categorical variable as a nominal explanatory variable and create a series of dummy or indicator variables for it (Chapter 29). This approach does not take account of the ordering of the categories and is therefore wasteful of information. However, it does not assume a linear relationship with the dependent variable and so may be preferred.
The difference in the values of the LRS from these two models provides a test statistic for a test of linear trend (i.e. an assessment of whether the model assuming a linear relationship gives a better fitting model than one for which no linear relationship is assumed). This test statistic follows a Chi-squared distribution with degrees of freedom equal to the difference in the number of parameters in the two models; a significant result suggests non-linearity. See also Chapter 25 for a test of a linear trend in proportions.
Numerical Explanatory Variables
When we include a numerical explanatory variable in the model, the estimate of its regression coefficient provides an indication of the impact of a one-unit increase in the explanatory variable on the outcome. Thus, for simple and multiple linear regression, the relationship between each explanatory variable and the dependent variable is assumed to be linear. For Poisson and logistic regression, the parameter estimate provides a measure of the impact of a one-unit increase in the explanatory variable on the loge of the dependent variable (i.e. the model assumes an exponential relationship with the actual rate or odds). It is important to check the appropriateness of the assumption of linearity (see next section) before including numerical explanatory variables in regression models.
Assessing The Assumption of Linearity
To check the linearity assumption in a simple or multiple linear regression model, we plot the numerical dependent variable, y, against the numerical explanatory variable, x, or plot the residuals of the model against x (Chapter 28). The raw data should approximate a straight line and there should be no discernible pattern in the residuals. We may assess the assumption of linearity in logistic regression (Chapter 30) or Poisson regression (Chapter 31) by categorizing individuals into a small number (5–10) of equally sized subgroups according to their values of x. In Poisson regression, we calculate the log (to any base) of the rate of the outcome in each subgroup and plot this against the mid-point of the range of values for x for the corresponding subgroup (see Fig. 33.1). For logistic regression, we similarly calculate the log odds for each subgroup and plot this against the mid-point. In each case, if the assumption of linearity is reasonable, we would expect to see a similarly sized step-wise increase (or decrease) in the log of the rate or odds when moving between adjacent categories of x. Another approach to checking for linearity in a regression model is to give consideration to higher order models (see polynomial regression in the next section).
Dealing with Non-Linearity
If non-linearity is detected in any of these plots, there are a number of approaches that can be taken.
- Replace x by a set of dummy variables created by categorizing the individuals into three or four subgroups according to the magnitude of x (often defined using the tertiles or quartiles of the distribution). This set of dummy variables can be incorporated into the multivariable regression model as categorical explanatory variables (see Example).
- Transform the x variable in some way (e.g. by taking a logarithmic or square root transformation of x; Chapter 9) so that the resulting relationship between the transformed value of x and the dependent variable (or its loge for Poisson or its logit for logistic regression) is linear.
- Find some algebraic description that approximates the non-linear relationship using higher orders of x (e.g. a quadratic or cubic relationship). This is known as polynomial regression. We just introduce terms that represent the relevant higher orders of x into the equation. So, for example, if we have a cubic relationship, our estimated multiple linear regression equation is Y = a + b1x + b2x2 + b3x3. We fit this model, and proceed with the analysis in exactly the same way as if the quadratic and cubic terms represented different variables (x2 and x3, say) in a multiple regression analysis. For example, we may fit a quadratic model that comprises the explanatory ‘variables’ height and height2. We can test for linearity by comparing the LRS in the linear and quadratic models (Chapter 32), or by testing the coefficient of the quadratic term.
Selecting Explanatory Variables
Even if not saturated (Chapter 32), there is always the danger of over-fitting models by including a very large number of explanatory variables; this may lead to spurious results that are inconsistent with expectations, especially if the variables are highly correlated. For a multiple linear regression model, a usual rule of thumb is to ensure that there are at least 10 times as many individuals as explanatory variables. For logistic and Poisson regression, there should be at least 10 times as many responses or events in each of the two outcome categories as explanatory variables.
Often, we have a large number of explanatory variables that we believe may be related to the dependent variable. For example, many factors may appear to be related to systolic blood pressure, including age, dietary and other lifestyle factors. We should only include explanatory variables in a model if there is reason to suppose, from a biological or clinical standpoint, that they are related to the dependent variable. We can eliminate some variables by performing a univariable analysis (perhaps with a less stringent significance level of 0.10 rather than the more conventional 0.05) for each explanatory variable to assess whether it is likely to be related to the dependent variable, e.g. if we have a numerical dependent variable, we may perform a simple regression analysis if the explanatory variable is numerical or an unpaired t-test if it is binary. We then consider only those explanatory variables that were significant at this first stage for our multivariable model (see the Example in Chapter 31).
Automatic Selection Procedures
When we have a large number of potential explanatory variables and are particularly interested in using the model for prediction, rather than in gaining insight into whether an explanatory variable influences the outcome or in estimating its effect, computer-intensive automatic selection procedures provide a means of identifying the optimal model by selecting some of these variables.
- All subsets – every combination of explanatory variables is considered; that which provides the best fit, as described by the model R2 (Chapter 27) or LRS (Chapter 32), is selected.
- Backward selection – all possible variables are included; those that are judged by the model to be least important (where this decision is based on the change in R2 or the LRS) are progressively removed until none of the remaining variables can be removed without significantly affecting the fit of the model.
- Forward selection – variables that contribute most to the fit of the model (based on the change in R2 or the LRS) are progressively added until no further variable significantly improves the fit of the model.
- Stepwise selection – a combination of forward and backward selection that starts by progressing forward and then, at the end of each ‘step’, checks backward to ensure that all of the included variables are still required.
Disadvantages
Although these procedures remove much of the manual aspect of model selection, they have some disadvantages.
- It is possible that two or more models will fit the data equally well, or that changes in the data set will produce different models.
- Because of the multiple testing that occurs when repeatedly comparing one model to another within an automatic selection procedure, the Type I error rate (Chapter 18) is particularly high. Thus, some significant findings may arise by chance. This problem may be alleviated by choosing a more stringent significance level (say 0.01 rather than 0.05).
- If the model is refitted to the data set using the m, say, variables remaining in the final automatic selection model, its estimated parameters may differ from those of the automatic selection model. This is because the automatic selection procedure uses in its analysis only those individuals who have complete information on all the explanatory variables, but the sample size may be greater when individuals are included if they have no missing values only for the relevant m variables.
- The resulting models, although mathematically justifiable, may not be sensible. In particular, when including a series of dummy variables to represent a single categorical variable (Chapter 29), automatic models may include only some of the dummy variables, leading to problems in interpretation.
Therefore, a combination of these procedures and common sense should be applied when selecting the best-fitting model. Models that are generated using automatic selection procedures should be validated on other external data sets where possible (see Chapter 46).
Interaction
What is It?
Statistical interaction (also known as effect modification, Chapter 13) between two explanatory variables in a regression analysis occurs where the relationship between one of the explanatory variables and the dependent variable is not the same for different levels of the other explanatory variables, i.e. the two explanatory variables do not act independently on the dependent variable. For example, suppose that we want to assess the association between an individual’s body weight (the explanatory variable) and the amount of a particular drug in his or her blood (the dependent variable). If we believe that this association is different for men and women in the study, we may wish to investigate whether there is an interaction between body weight and sex. If statistical testing reveals that there is evidence of a significant interaction, we may be advised to describe the association between body weight and the amount of the drug in the blood separately in men and women.
Testing for Interaction
Testing for statistical interaction in a regression model is usually straightforward and many statistical packages allow you to request the inclusion of interaction terms. If the package does not provide this facility then an interaction term may be created manually by including the product of the relevant variables as an additional explanatory variable. Thus, to obtain the value of the variable which represents the interaction between two variables (both binary, both numerical or one binary and one numerical), we multiply the individual’s values of these two variables. If both variables are numerical, interpretation may be easier if we create an interaction term from the two binary variables obtained by dichotomizing each numerical variable. If one of the two variables is categorical with more than two categories, we create a series of dummy variables from it (Chapter 29) and use each of them, together with the second binary or numerical variable of interest, to generate a series of interaction terms. This procedure can be extended if both variables are categorical and each has more than two categories.
Interaction terms should only be included in the regression model after the main effects (the effects of the variables without any interaction) have been included. Note that statistical tests of interaction are usually of low power (Chapter 18). This is of particular concern when both explanatory variables are categorical and few events occur in the subgroups formed by combining each level of one variable with every level of the other, or if these subgroups include very few individuals.
Collinearity
When two explanatory variables are highly correlated, it may be difficult to evaluate their individual effects in a multivariable regression model. As a consequence, while each variable may be significantly associated with the dependent variable in a univariable model (i.e. when there is a single explanatory variable), neither may be significantly associated with it when both explanatory variables are included in a multivariable model. This collinearity (also called multi-collinearity) can be detected by examining the correlation coefficients between each pair of explanatory variables (commonly displayed in a correlation matrix and of particular concern if the coefficient, ignoring its sign, is greater than 0.8) or by visual impression of the standard errors of the regression coefficients in the multivariable model (these will be substantially larger than those in the separate univariable models if collinearity is present). The easiest solution, if collinearity is detected between two variables, is to include only one of the variables in the model. In situations where many of the variables are highly correlated, it may be necessary to seek statistical advice.
Confounding
When two explanatory variables are both related to the outcome and to each other so that it is difficult to assess the independent effect of each one on the outcome, we say that the explanatory variables are confounded. We discuss confounding in detail in Chapter 34.