Statistical modelling includes the use of simple and multiple linear regression (Chapters 27–29), logistic regression (Chapter 30), Poisson regression (Chapter 31) and some methods that deal with survival data (Chapter 44). All these methods rely on generating a mathematical model that best describes the relationship between an outcome and one or more explanatory variables. Generation of such a model allows us to determine the extent to which each explanatory variable is related to the outcome after adjusting for all other explanatory variables in the model and, if desired, to predict the value of the outcome from these explanatory variables.
The generalized linear model (GLM) can be expressed in the form
where:
- Y is the estimated value of the predicted, mean or expected value of the dependent variable which follows a known probability distribution (e.g. Normal, Binomial, Poisson);
- g(Y), called the link function, is a transformation of Y which produces a linear relationship with x1, …, xk, the predictor or explanatory variables;
- b1, …, bk are estimated regression coefficients that relate to these explanatory variables; and
- a is a constant term.
Each of the regression models described in earlier chapters can be expressed as a particular type of GLM (see Table 32.1). The link function is the logit of the proportion (i.e. the loge of the odds) in logistic regression and the loge of the rate in Poisson regression. No transformation of the dependent variable is required in simple and multiple linear regression; the link function is then referred to as the identity link. Once we have specified which type of regression we wish to perform, most statistical packages incorporate the link function into the calculations automatically without any need for further specification.
Type of outcome | Type of GLM commonly used | See Chapter |
Continuous numerical | Simple or multiple linear | 28, 29 |
Binary | ||
Incidence of disease in longitudinal study (patients followed for equal periods of time) | Logistic | 30 |
Binary outcome in cross-sectional study | Logistic | 30 |
Unmatched case–control study | Logistic | 30 |
Matched case–control study | Conditional logistic | 30 |
Categorical outcome with more than two categories | Multinomial or ordinal logistic regression | 30 |
Event rate or count | Poisson | 31 |
Time to event* | Exponential, Weibull or Gompertz models | 44 |
* Time to event data may also be analysed using a Cox proportional hazards regression model (Chapter 44).
Which Type of Model Do We Choose?
The choice of an appropriate statistical model will depend on the outcome of interest (see Table 32.1). For example, if our dependent variable is a continuous numerical variable, we may use simple or multiple linear regression to identify factors associated with this variable. If we have a binary outcome (e.g. patient died or did not die) and all patients are followed for the same amount of time, then a logistic regression model would be the appropriate choice.
Note that we may be able to choose a different type of model by modifying the format of our dependent variable. In particular, if we have a continuous numerical outcome but one or more of the assumptions of linear regression are not met, we may choose to categorize our outcome variable into two groups to generate a new binary outcome variable. For example, if our dependent variable is systolic blood pressure (a continuous numerical variable) after a 6-month period of anti-hypertensive therapy, we may choose to dichotomize the systolic blood pressure as high or low using a particular cut-off, and then use logistic regression to identify factors associated with this binary outcome. While dichotomizing the dependent variable in this way may simplify the fitting and interpretation of the statistical model, some information about the dependent variable will usually be discarded. Thus the advantages and disadvantages of this approach should always be considered carefully.
Likelihood and Maximum Likelihood Estimation
When fitting a GLM, we generally use the concept of likelihood to estimate the parameters of the model. For any GLM characterized by a known probability distribution, a set of explanatory variables and some potential values for each of their regression coefficients, the likelihood of the model (L) is the probability that we would have obtained the observed results had the regression coefficients taken those values. We estimate the coefficients of the model by selecting the values for the regression coefficients that maximize L (i.e. they are those values that are most likely to have produced our observed results); the process is maximum likelihood estimation (MLE) and the estimates are maximum likelihood estimates. MLE is an iterative process and thus specialized computer software is required. One exception to MLE is in the case of simple and multiple linear regression models (with the identity link function) where we usually estimate the parameters using the method of least squares (the estimates are often referred to as ordinary least squares (OLS) estimates (Chapter 27)); the OLS and MLE estimates are identical in this situation.
Assessing Adequacy of Fit
Although MLE maximizes L for a given set of explanatory variables, we can always improve L further by including additional explanatory variables. At its most extreme, a saturated model is one that includes a separate variable for each observation (i.e. individual) in the data set. While such a model would explain the data perfectly, it is of limited use in practice as the prediction of future observations from this model is likely to be poor. The saturated model does, however, allow us to calculate the value of L that would be obtained if we could model the data perfectly. Comparison of this value of L with the value obtained after fitting our simpler model with fewer variables provides a way of assessing the adequacy of the fit of our model. We consider the likelihood ratio, the ratio of the value of L obtained from the saturated model to that obtained from the fitted model, in order to compare these two models. More specifically, we calculate the likelihood ratio statistic (LRS) as
The LRS, often referred to as −2log likelihood (see Chapters 30 and 31) or as the deviance, approximately follows a Chi-squared distribution with degrees of freedom equal to the difference in the number of parameters fitted in the two models (i.e. n − k − 1, where n is the number of observations in the data set and k is the number of parameters, apart from the intercept, in the simpler model). The null hypothesis is that the extra parameters in the larger saturated model are all zero; a high value of the LRS will give a significant result indicating that the goodness of fit of the model is poor.
The LRS can also be used in other situations. In particular, the LRS can be used to compare two models, neither of which is saturated, when one model is nested within another (i.e. the larger model includes all of the explanatory variables that are included in the smaller model, in addition to extra variables). In this situation, the test statistic is the difference between the value of the LRS from the model which includes the extra variables and that from the model which excludes these extra variables. The test statistic follows a Chi-squared distribution with degrees of freedom equal to the number of additional parameters included in the larger model, and is used to test the null hypothesis that the extra parameters in the larger model are all zero. The LRS can also be used to test the null hypothesis that all the parameters associated with the covariates of a model are zero by comparing the LRS of the model which includes the covariates with that of the model which excludes them. This is often referred to as the model Chi-square or the Chi-square for covariates (see Chapters 30 and 31).
Regression Diagnostics
When performing any form of regression analysis, it is important to consider a series of regression diagnostics. These allow us to examine our fitted regression model and look for flaws that may affect our parameter estimates and their standard errors. In particular, we must consider whether the assumptions underlying the model are violated and whether our results are heavily affected by influential observations (Chapter 28).