Introduction
Logistic regression is very similar to linear regression; we use it when we have a binary outcome of interest (e.g. the presence/absence of a symptom, or an individual who does/does not have a disease) and a number of explanatory variables. We perform a logistic regression analysis in order to do one or more of the following:
- Determine which explanatory variables influence the outcome.
- Evaluate the probability that an individual with a particular covariate pattern (i.e. a unique combination of values for the explanatory variables) will have the outcome of interest.
- Use this probability to assign the individual to an outcome group that reflects the individual’s risk of the outcome (we usually use a cut-off of 0.5 for the probability for this purpose but we may choose a different cut-off if this better discriminates between the outcomes).
- Analyse an unmatched case–control study (Chapter 16) when the two outcomes are ‘case’ and ‘control’.
Reasoning
We start by creating a binary variable to represent the two outcomes (e.g. ‘has disease’ = 1, ‘does not have disease’ = 0). However, we cannot use this as the dependent variable in a linear regression analysis since the Normality assumption is violated, and we cannot interpret predicted values that are not equal to zero or one. So, instead, we take the probability, p, that an individual is classified into the highest coded category (i.e., has disease) as the dependent variable, and, to overcome mathematical difficulties, use the logistic or logit transformation (Chapter 9) of it in the regression equation. The logit of this probability is the natural logarithm (i.e. to base e) of the odds of ‘disease’, i.e.
The Logistic Regression Equation
An iterative process, called maximum likelihood (Chapter 32), rather than ordinary least squares regression (so we cannot use linear regression software), produces, from the sample data, an estimated logistic regression equation of the form
where:
- xi is the ith explanatory variable (i = 1, 2, 3, …, k);
- p is the estimated value of the true probability that an individual with a particular set of values for x1, …, xk has the disease. p corresponds to the proportion with the disease; it has an underlying Binomial distribution (Chapter 8);
- a is the estimated constant term;
- b1, b2, …, bk are the estimated logistic regression coefficients.
The exponential of a particular coefficient, for example, , is an estimate of the odds ratio (Chapter 16). For a particular value of x1, it is the estimated odds of disease for (x1 + 1) relative to the estimated odds of disease for x1, while adjusting for all other x’s in the equation (it is therefore often referred to as an adjusted odds ratio). If the odds ratio is equal to one (unity), then these two odds are the same, i.e. increasing the value of x1 has no impact on the odds of disease. A value of the odds ratio above one indicates an increased odds of having the disease, and a value below one indicates a decreased odds of having the disease, as x1 increases by one unit. When the disease is rare, the odds ratio can be interpreted as a relative risk.
We can manipulate the logistic regression equation to estimate the probability that an individual has the disease. For each individual, with a set of covariate values for x1, …, xk, we calculate
Then, the probability that the individual has the disease is estimated as
Generating a series of plots of these probabilities against the values of each of a number of covariates is often useful as an aid to interpreting the findings.
As the logistic regression model is fitted on a log scale, the effects of the xi’s are multiplicative on the odds of disease. This means that their combined effect is the product of their separate effects. Suppose, for example, x1 and x2 are two binary variables (each coded as 0 or 1) with estimated logistic coefficients b1 and b2, respectively, so that the corresponding estimated odds of disease for category 1 compared with category 0 for each variable is and . To obtain the estimated odds of disease for an individual who has x1 = 1 and x2 = 1, compared with an individual who has x1 = 0 and x2 = 0, we multiply OR1 by OR2 (see Example). This concept is extended for numerical explanatory variables. The multiplicative effect on the odds scale is unlike the situation in linear regression where the effects of the xi’s on the dependent variable are additive.
Note that some statistical packages will, by default, model the probability that p = 0 (does not have disease) rather than p = 1. This will lead to the estimates from the logistic regression model being inverted (i.e. the estimate provided will be 1/OR). If this is the case, it is usually straightforward to modify these settings to ensure that the correct estimates are displayed.
The Explanatory Variables
Computer output for a logistic regression analysis generally includes, for each explanatory variable, the estimated logistic regression coefficient with standard error, the estimated odds ratio (i.e. the exponential of the coefficient) and a confidence interval for its true value. We can determine whether each variable is related to the outcome of interest (e.g. disease) by testing the null hypothesis that the relevant logistic regression coefficient is zero, which is equivalent to testing the hypothesis that the odds ratio of ‘disease’ associated with this variable is unity. This is usually achieved by performing one of the following tests.
- The Wald test: the test statistic, which follows the Standard Normal distribution, is equal to the estimated logistic regression coefficient divided by its standard error. Its square approximates the Chi-squared distribution with 1 df.
- The likelihood ratio test (Chapter 32): the test statistic is the deviance (also referred to as the likelihood ratio statistic (LRS) or −2log likelihood) for the full model minus the deviance for the full model excluding the relevant explanatory variable – this test statistic follows a Chi-squared distribution with one degree of freedom.
These tests give similar results if the sample size is large. Although the Wald test is less powerful (Chapter 18) and may produce biased results if there are insufficient data for each value of the explanatory variable, it is usually preferred because it is generally included in the computer output (which is not usually the case for the likelihood ratio test).
As in multiple linear regression, automatic selection procedures (Chapter 33) can be used to select the best combination of explanatory variables. As a rule of thumb, we should not perform a multiple logistic regression analysis if the number of responses in each of the two outcome categories (e.g. has disease/does not have disease) is fewer than 10 times the number of explanatory variables1.
Assessing the Adequacy of The Model
Usually, interest is centred on examining the explanatory variables and their effect on the outcome. This information is routinely available in all advanced statistical computer packages. However, there are inconsistencies between the packages in the way in which the adequacy of the model is assessed, and in the way it is described. The following provides an indication of what your computer output may contain (in one guise or another) for a logistic model with k covariates and a sample size of n (full details may be obtained from more advanced texts2 and examples are also shown in Appendix C).
Evaluating the Model and its Fit
- The value of the deviance (or LRS or −2log likelihood): on its own (i.e. without subtracting the deviance from that of an alternative model), this compares the likelihood of the model with k covariates to that of a saturated (i.e. a perfectly fitting) model. This test statistic approximately follows a Chi-squared distribution with (n − k − 1) degrees of freedom: a significant result suggests the model does not fit the data well. Thus the deviance is a measure of poorness of fit.
- The model Chi-square, the Chi-square for covariates or G: this tests the null hypothesis that all k regression coefficients in the model are zero by subtracting the deviance of the model from that of the null model which contains no explanatory variables (Chapter 32). G approximately follows a Chi-squared distribution with k degrees of freedom; a significant result suggests that at least one covariate is significantly associated with the dependent variable.
- The Hosmer–Lemeshow test (recommended only if n is large, say > 400) assesses goodness of fit: see Chapter 46.
Indices of goodness of fit, such as and the Pseudo R2, similar to R2 in linear regression (Chapter 27), may also be determined although they are more difficult to interpret in logistic regression analysis.
Assessing Predictive Efficiency
- A 2 × 2 classification table: this illustrates the ability of the model to correctly discriminate between those who do and do not have the outcome of interest (e.g. disease): the rows often represent the predicted outcomes from the model (where an individual is predicted to have or not have the disease according to whether his/her predicted probability is greater or less than the (usual) cut-off of 0.5) and the columns represent the observed outcomes. The entries in all cells of the table are frequencies. If the logistic model is able to classify patients perfectly (i.e. there is no misclassification of patients), the only cells of the table that contain non-zero entries are those lying on the diagonal and the overall percent correct is 100%. Note that it is possible to have a high percent correctly predicted (say 70%) when, at its most extreme, 100% of the individuals are predicted to belong to the more frequently occurring outcome group (e.g. diseased) and 0% to the other group. Terms associated with the classification table are as follows (Chapter 38):
- Sensitivity: the percent correctly predicted to have the disease
- Specificity: the percent correctly predicted to be disease-free
- False positive rate: the percent incorrectly predicted to have the disease
- False negative rate: the percent incorrectly predicted to be disease-free.
- Sensitivity: the percent correctly predicted to have the disease
- A histogram: this illustrates the observed outcomes (e.g. disease or no disease) of patients according to their predicted probability (p) of belonging to the outcome category of interest, e.g. has disease. The horizontal axis, with a scale from 0 to 1, represents the predicted probability that an individual has the disease. The column (or bar) for a particular predicted probability comprises 1’s and/or 0’s, each entry representing the observed outcome for one individual (the codes 1 and 0 indicate whether the individual does or does not have the disease, respectively). A good model will separate the symbols into two groups with little or no overlap – i.e. most or all of the 0’s will lie on the far left of the histogram and most or all of the 1’s will lie on the far right. Any 1’s on the left of the histogram (where p < 0.5) or 0’s on the right (where p > 0.5) will indicate individuals who have been misclassified.
- A receiver operating characteristic (ROC) curve: this plots the sensitivity of the model against 1 minus the specificity (Chapter 38) for different cut-offs of the predicted probability, p. Lowering the cut-off increases the sensitivity and raising the cut-off increases the specificity of the model. The closer the curve is to the upper left corner of the diagram, the better the predictive ability of the model. The greater the area under the curve (upper limit = 1), the better the model is at discriminating between outcomes.
Investigating the Assumptions
We explain how to assess the linearity assumption in Chapter 33.
A logistic regression coefficient with a large standard error may indicate:
- collinearity (Chapter 33): the explanatory variables are highly correlated, or
- a zero cell count: this occurs when all of the individuals within a particular category for a qualitative explanatory variable have the same outcome (e.g. all have the disease), so that none of them has the other outcome (disease-free). In this situation, we should consider combining categories if the covariate has more than two categories or, if this is not possible, removing the covariate from the model. Similar procedures should be adopted when the data are ‘sparse’ (e.g. when the expected frequency is <5) in any category.
Deviance divided by the degrees of freedom (df = n − k − 1) is a ratio that has an expected value of 1 when the residual variance corresponds to that expected under a Binomial model. There is extra-Binomial variation indicating overdispersion if the ratio is substantially greater than 1 (the regression coefficients have standard errors which are underestimated, perhaps because of lack of independence – Chapters 41 and 42) and underdispersion if the ratio is substantially less than 1 (see also Chapters 31 and 42).
Logistic Regression Diagnostics
Outliers and influential points in logistic regression are usually identified by constructing appropriate diagrams and looking for points in them which appear to lie apart from the main body of the data. Note that a ‘point’ in these circumstances relates to individuals with the same covariate pattern, not to a particular individual as in multiple regression (Chapter 29). For example, outliers may be detected by plotting the logistic residual (e.g. the Pearson or deviance residual) against the predicted probability, and influential points may be detected by plotting an influence statistic (e.g. the change in the deviance attributable to deleting an individual from the analysis) against the predicted probability2.
Comparing the Odds Ratio and the Relative Risk
Although the odds ratio is often taken as an estimate of the relative risk, it will only give a similar value if the outcome is rare. Where the outcome is not rare, the odds ratio will be greater than the relative risk if the relative risk is greater than one, and it will be less than the relative risk otherwise. Although the odds ratio is less easily interpreted than the relative risk, it does have attractive statistical properties and thus is usually preferred (and must be used in a case–control study when the relative risk cannot be estimated directly (Chapter 16)).
Multinomial and Ordinal Logistic Regression
Multinomial (also called polychotomous) and ordinal logistic regression are extensions of logistic regression; we use them when we have a categorical dependent variable with more than two categories. When the dependent variable is nominal (Chapter 1) (e.g. the patient has one of three back disorders: lumbar disc hernia, chronic low-back pain, or acute low-back pain) we use multinomial logistic regression. When the dependent variable is ordinal or ranked (e.g. mild, moderate or severe pain) we use ordinal logistic regression. These methods are complex and so you should refer to more advanced texts3 and/or seek specialist advice if you want to use them. As a simple alternative, we can combine the categories in some appropriate way to create a new binary outcome variable, and then perform the usual two-category logistic regression analysis (recognizing that this approach may be wasteful of information). The decision on how to combine the categories should be made in advance, before looking at the data, in order to avoid bias.
Conditional Logistic Regression
We can use conditional logistic regression when we have matched individuals (as in a matched case–control study (Chapter 16)) and we wish to adjust for possible confounding factors. Analysis of a matched case–control study using ordinary logistic regression or the methods described in Chapter 16 is inefficient, may produce biased results and lacks power because neither acknowledges that cases and controls are linked to each other. Conditional logistic regression allows us to compare cases with controls in the same matched ‘set’ (i.e. each pair in the case of one-to-one matching). In this situation, the ‘outcome’ is defined by the patient being a case (usually coded 1) or a control (usually coded 0). While advanced statistical packages may sometimes allow us to perform conditional logistic regression directly, it may be necessary to use the Cox proportional hazards regression model (Chapter 44).