Statistical modelling includes the use of simple and multiple linear regression (Chapters 27–29), logistic regression (Chapter 30), Poisson regression (Chapter 31) and some methods that deal with survival data (Chapter 44). All these methods rely on generating a mathematical model that best describes the relationship between an outcome and one or more explanatory variables. Generation of such a model allows us to determine the extent to which each explanatory variable is related to the outcome after adjusting for all other explanatory variables in the model and, if desired, to predict the value of the outcome from these explanatory variables.

The generalized linear model (GLM) can be expressed in the form

where:

Y is the estimated value of the predicted, mean or expected value of the dependent variable which follows a known probability distribution (e.g. Normal, Binomial, Poisson);
g(Y), called the link function, is a transformation of Y which produces a linear relationship with x₁, …, x_k, the predictor or explanatory variables;
b₁, …, b_k are estimated regression coefficients that relate to these explanatory variables; and
a is a constant term.

Each of the regression models described in earlier chapters can be expressed as a particular type of GLM (see Table 32.1). The link function is the logit of the proportion (i.e. the log_e of the odds) in logistic regression and the log_e of the rate in Poisson regression. No transformation of the dependent variable is required in simple and multiple linear regression; the link function is then referred to as the identity link. Once we have specified which type of regression we wish to perform, most statistical packages incorporate the link function into the calculations automatically without any need for further specification.

Table 32.1 Choice of appropriate types of GLM for use with different types of outcome.

Type of outcome	Type of GLM commonly used	See Chapter
Continuous numerical	Simple or multiple linear	28, 29
Binary
Incidence of disease in longitudinal study (patients followed for equal periods of time)	Logistic	30
Binary outcome in cross-sectional study	Logistic	30
Unmatched case–control study	Logistic	30
Matched case–control study	Conditional logistic	30
Categorical outcome with more than two categories	Multinomial or ordinal logistic regression	30
Event rate or count	Poisson	31
Time to event*	Exponential, Weibull or Gompertz models	44

* Time to event data may also be analysed using a Cox proportional hazards regression model (Chapter 44).

Which Type of Model Do We Choose?

The choice of an appropriate statistical model will depend on the outcome of interest (see Table 32.1). For example, if our dependent variable is a continuous numerical variable, we may use simple or multiple linear regression to identify factors associated with this variable. If we have a binary outcome (e.g. patient died or did not die) and all patients are followed for the same amount of time, then a logistic regression model would be the appropriate choice.

Note that we may be able to choose a different type of model by modifying the format of our dependent variable. In particular, if we have a continuous numerical outcome but one or more of the assumptions of linear regression are not met, we may choose to categorize our outcome variable into two groups to generate a new binary outcome variable. For example, if our dependent variable is systolic blood pressure (a continuous numerical variable) after a 6-month period of anti-hypertensive therapy, we may choose to dichotomize the systolic blood pressure as high or low using a particular cut-off, and then use logistic regression to identify factors associated with this binary outcome. While dichotomizing the dependent variable in this way may simplify the fitting and interpretation of the statistical model, some information about the dependent variable will usually be discarded. Thus the advantages and disadvantages of this approach should always be considered carefully.

Likelihood and Maximum Likelihood Estimation

When fitting a GLM, we generally use the concept of likelihood to estimate the parameters of the model. For any GLM characterized by a known probability distribution, a set of explanatory variables and some potential values for each of their regression coefficients, the likelihood of the model (L) is the probability that we would have obtained the observed results had the regression coefficients taken those values. We estimate the coefficients of the model by selecting the values for the regression coefficients that maximize L (i.e. they are those values that are most likely to have produced our observed results); the process is maximum likelihood estimation (MLE) and the estimates are maximum likelihood estimates. MLE is an iterative process and thus specialized computer software is required. One exception to MLE is in the case of simple and multiple linear regression models (with the identity link function) where we usually estimate the parameters using the method of least squares (the estimates are often referred to as ordinary least squares (OLS) estimates (Chapter 27)); the OLS and MLE estimates are identical in this situation.

Assessing Adequacy of Fit

Although MLE maximizes L for a given set of explanatory variables, we can always improve L further by including additional explanatory variables. At its most extreme, a saturated model is one that includes a separate variable for each observation (i.e. individual) in the data set. While such a model would explain the data perfectly, it is of limited use in practice as the prediction of future observations from this model is likely to be poor. The saturated model does, however, allow us to calculate the value of L that would be obtained if we could model the data perfectly. Comparison of this value of L with the value obtained after fitting our simpler model with fewer variables provides a way of assessing the adequacy of the fit of our model. We consider the likelihood ratio, the ratio of the value of L obtained from the saturated model to that obtained from the fitted model, in order to compare these two models. More specifically, we calculate the likelihood ratio statistic (LRS) as

The LRS, often referred to as −2log likelihood (see Chapters 30 and 31) or as the deviance, approximately follows a Chi-squared distribution with degrees of freedom equal to the difference in the number of parameters fitted in the two models (i.e. n − k − 1, where n is the number of observations in the data set and k is the number of parameters, apart from the intercept, in the simpler model). The null hypothesis is that the extra parameters in the larger saturated model are all zero; a high value of the LRS will give a significant result indicating that the goodness of fit of the model is poor.

The LRS can also be used in other situations. In particular, the LRS can be used to compare two models, neither of which is saturated, when one model is nested within another (i.e. the larger model includes all of the explanatory variables that are included in the smaller model, in addition to extra variables). In this situation, the test statistic is the difference between the value of the LRS from the model which includes the extra variables and that from the model which excludes these extra variables. The test statistic follows a Chi-squared distribution with degrees of freedom equal to the number of additional parameters included in the larger model, and is used to test the null hypothesis that the extra parameters in the larger model are all zero. The LRS can also be used to test the null hypothesis that all the parameters associated with the covariates of a model are zero by comparing the LRS of the model which includes the covariates with that of the model which excludes them. This is often referred to as the model Chi-square or the Chi-square for covariates (see Chapters 30 and 31).

Regression Diagnostics

When performing any form of regression analysis, it is important to consider a series of regression diagnostics. These allow us to examine our fitted regression model and look for flaws that may affect our parameter estimates and their standard errors. In particular, we must consider whether the assumptions underlying the model are violated and whether our results are heavily affected by influential observations (Chapter 28).

Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

Tags: Medical Statistics at a Glance

May 9, 2017 | Posted by admin in GENERAL & FAMILY MEDICINE | Comments Off

Basicmedical Key

Fastest Basicmedical Insight Engine

Generalized linear models

Which Type of Model Do We Choose?

Likelihood and Maximum Likelihood Estimation

Assessing Adequacy of Fit

Regression Diagnostics

Like this:

Related

Stay updated, free articles. Join our Telegram channel

Full access? Get Clinical Tree

Basicmedical Key

Fastest Basicmedical Insight Engine

Generalized linear models

Which Type of Model Do We Choose?

Likelihood and Maximum Likelihood Estimation

Assessing Adequacy of Fit

Regression Diagnostics

Share this:

Like this:

Related

Related posts:

Stay updated, free articles. Join our Telegram channel

Full access? Get Clinical Tree