Specific Regression Techniques



Introduction





Regression is a tool used by statisticians and clinicians to establish or identify a relationship between one or more explanatory covariates and a response variable (outcome). Many textbooks are devoted to specific regression models and strategies. In particular, Regression Modeling Strategies with Applications to Linear Models, Logistic Regression and Survival Analysis by Harrell (1) is an excellent reference. In this chapter, we briefly summarize and exemplify the techniques that are fundamental to outcomes research.






First we address common, or “macro,” issues in regression modeling, regardless of the scale of outcome. We discuss the purpose of an analysis and the relationship to covariate selection, model complexity, and criteria for model performance. Focus is placed on interpretation of models on a “macro” scale. We then focus on features of specific models (linear, logistic, and hazards regression), paying particular attention to the hypotheses and data behind an analysis. Parameters and output are interpreted on a “micro” scale for each of the three regression methods.






General Strategies





Purpose



Before building a regression model, the statistician and clinician should discuss the modeling purpose, key hypotheses, and available data. Each of these will guide an appropriate regression analysis.



The purpose of an analysis may not be straightforward or singular. Some common purposes seen in practice are to establish a predictive model for an outcome (prediction), identify covariate associations that can inform understanding of disease etiology (association), or examine a specific variable of interest while adjusting for the effects of others (adjustment). If we knew the “true model,” including all important variables and their functional relationships with outcome, then model would be useful for all purposes. In practice, we do not know the truth, and our decisions have trade-offs. Understanding the purpose of the model helps us create a model that is useful, albeit imperfect. As its name suggests, a predictive model is used when clinicians are interested in predicting the outcome for future patients based on currently available data. Having a predicted risk of outcome can guide treatment and therapies as well as inform the patient. The goal is to achieve precise predictions of future outcomes, rather than interpreting specific parameter estimates. There is no benefit from including covariates for which the effect cannot be well estimated. Researchers might be willing to yield some bias in individual covariate estimates in exchange for greater precision. Hence, statisticians and clinicians must work to identify the covariates that are the most important to predicting outcome. Automated variable-selection techniques are commonly employed; such techniques are discussed below. However, any variable selection should still be guided, in part, by clinical practice. In a predictive study, interpretation of the model can be largely focused on performance statistics, such as the C-index, to evaluate the model prediction as a whole. Some of these predictive assessment tools are described below. Researchers often publish an entire predictive model or design a nomogram to aid practitioners in establishing patient risk using a predictive model (1).



Roe et al. (2) exemplify the dual purpose of studying association but emphasizing prediction. In this case, the outcome was long-term mortality in older patients with non-ST-segment elevation myocardial infarction (MI) enrolled in the Can Rapid risk stratification of Unstable angina patients Suppress ADverse outcomes with Early implementation of the ACC/AHA guidelines? (CRUSADE) registry. The authors initially emphasized a comparison of associations, and relative covariate importance, in a model with 22 statistically significant variables. Subsequently, they focused on prediction, using clinical judgment to define a reduced model with 13 covariates but nearly equivalent predictive performance (comparable discrimination [C-index] 0.75 for the derivation sample of the full model versus 0.73 for the reduced model). The reduced model was slightly inferior, and does not represent an optimal strategy for prediction; however, it achieves greater simplicity and convenience with relatively little compromise in performance. The focus on prediction was further emphasized by the creation of a risk score. Of interest, they did not use an automatic variable-selection procedure but instead included all 22 univariably significant (α = 0.05) covariates in the initial model.



Superior techniques exist for predictive modeling. However, in the very large CRUSADE registry, alternative techniques are unlikely to provide clinically different predictive performance because there is sufficient power to detect even marginally important covariates and little chance of overfitting.



At the other end of the spectrum from the prediction model is the adjustment model. Often arising when a clinician wants to examine the effect of a particular covariate, adjustment models need not employ automated variable selection. As discussed in Chapters 13 and 15, confounding can lead to biased coefficient estimates when the confounding variable is not properly adjusted for. Thus, in the adjustment model, focus is on minimizing the bias of the estimated coefficient by including all known confounders. Ideally, these may be identified from previous publications, including larger studies and clinically similar populations. These variables need not achieve statistical significance in the dataset being analyzed, because researchers may be willing to accept some additional “noise” to minimize bias. Overall, there is less concern for overfitting, except that excessive inappropriate covariates should not be included. For example, if a particular characteristic has been shown to be important in the literature but is not statistically significant in one’s sample, it should still be adjusted for. In an adjustment study, only the particular covariate of interest must be interpreted. In fact, adjustment studies need not publish the entire model but rather mention the adjustment covariates and display output only for the covariate of interest.



For an example of an adjustment model, consider Piccini et al. (3). The authors sought to identify the relationships that two drugs for treatment of ventricular arrhythmia—amiodarone and lidocaine—had with 30-day and 6-month mortality, using data from the Global Use of Strategies To Open occluded coronary arteries in acute coronary syndromes (GUSTO)-IIb and GUSTO-III randomized controlled trials. Partly because of the secondary nature of the analysis, the authors were concerned with confounding. Thus, attempting to identify the risk attributable to each medication required adjusting for clinical characteristics. A published model for mortality in a similar population (4), along with expert guidance from the investigators, was used to identify 17 adjustment covariates. The authors noted these adjustment variables but reported hazard ratios and results only for the treatment variables of interest.



Finally, some clinicians wish to explore associations among many covariates and outcome, with a focus on the biological processes driving outcomes. This common purpose is probably the hardest to achieve. As with adjustment models, researchers want unbiased estimates because they will interpret parameters, but as with prediction, they are concerned about minimizing overfitting and want to include only important covariates. Essentially, researchers want to avoid including numerous nonsignificant covariates that would add noise to the system while complicating interpretation. Multicollinearity, when two or more variables are highly correlated and affect outcome, is a concern and thought must be given to determining the appropriate causal and biological pathway to consider. Variable-selection techniques can be employed, but a firm understanding of how covariates interact and change together should be taken into consideration during selection instead of using an “out-of-the-box” variable-selection solution that might not be very interpretable or biologically consistent. Interpretation is of the whole model, perhaps with a focus on particular biological systems of interest.



The report from Forman et al. (5) provides an example of associative modeling. (Note, however, that associative purposes vary and can be presented in many different ways.) Using data from the Heart Failure: A Controlled Trial Investigating Outcomes of exercise traiNing (HF-ACTION) study, the researchers sought an associative model for measures of exercise capacity that included age and 34 other candidate variables. The primary conclusion regarded age as a key to the pathophysiology and clinical management of reduced exercise performance in older persons with heart failure. Stepwise selection was used to select variables for their model, followed by a procedure to further isolate the most significant factors in assessing exercise capacity. Finally, for some of their models, key covariates that were considered clinically relevant were added to the models. Interactions with age were explored for each outcome. In particular, the researchers were interested in interpreting whether various covariates affecting outcomes would have differential associations with outcome at increased or decreased age values or independent of age. Estimated effects were provided for all covariates for one of the primary outcomes.






Variable-Selection Techniques



Many studies collect numerous variable measurements on subjects, some of which may be related to a particular outcome while others are not. In the presence of many candidate variables, variable selection may be employed to determine which variables to include in a final model. Traditional variable-selection techniques include forward, backward, and stepwise selection. Modern methods use penalized regression and shrinkage methods, such as the least absolute shrinkage and selection operator (LASSO) regression, adaptive LASSO, and modern variations in traditional methods, such as fast false selection rate (FSR) variable selection.



Forward, backward, and stepwise (collectively, FBS) selections are among the most common traditional variable-selection methods. They are iterative procedures that systematically introduce or remove variables from the statistical model one at a time until criteria are met.



In forward selection, researchers must specify an entry requirement, such as P < 0.05. Criteria can be based on a variety of statistics, such as adjusted R2, Akaike information criterion (AIC), Bayesian information criterion (BIC), or Mallows’ Cp, but commonly an F-test is employed when a corresponding P value cutoff is used. The statistical software begins by assuming there are no variables in the model. One by one, it checks a model for each variable. The variable with the smallest P value is entered into the model if it meets the entry requirement. Then the remaining variables are checked again using a model conditional on the first selected variable. Again, the smallest is selected and entered if it meets the entry criteria. This process is repeated until no more candidate variables meet the entry criteria. One consequence of this process is that the final model can contain nonsignificant variables that met the criteria for entry but do not remain significant after other variables are added.



Backward selection uses the reverse process. First, researchers must specify an exit criterion, for example, P > 0.20 based on a conditional F-test. The software fits a model that includes all possible candidate variables. The variable with the largest P value is selected and, if the exit criterion is satisfied, is dropped from the model. The model is then rerun without the dropped variable, and selection continues until no variable in the model meets the exit criterion.



Finally, stepwise selection employs elements of both forward and backward selection and is increasingly common in the medical literature. Both entry and exit criteria are specified. Starting with the null model, each variable is tested for entry. After a variable is entered, the model is checked for variables that meet the exit criterion. The entry–exit procedure is repeated until no remaining candidate variables meet the entry criterion and no included variables meet the exit criterion.



Some recent developments have focused on penalization or shrinkage methods. Among these modern methods, variations in LASSO regression can be applied (6). In short, LASSO estimates coefficients while optimizing model criteria such as AIC or BIC. Some coefficients are not worth estimating, because their estimates would be no better than zero according to the criteria of AIC or BIC. As a result, some estimates are set to zero while others are “shrunk” toward zero. In general, the best predictors will see very little shrinkage of estimates, whereas poor predictors will be shrunk substantially toward or even completely to zero. Variables with nonzero coefficients are interpreted as important, in the sense that they add to prediction.



Other recent developments include fast FSR variable selection (7). Fast FSR is commonly applied as an adjunct to forward selection to improve model error and decrease false selection. The procedure intentionally adds noisy, uninformative variables to the list of potential covariates and tracks them to see when they enter the model. Doing so allows the procedure to stop before allowing many poor predictors into the model, which often occurs in traditional forward selection.



No variable-selection method is without criticism. LASSO regression and adaptive LASSO are often criticized because they provide neither P values nor confidence bounds for parameter estimates. Worse is that FBS and fast FSR regression methods provide improper P values and confidence bounds. Often overlooked in medical literature, the P value provided by these traditional methods is conditional on the final model and does not account for variation induced through iterative searches and multiple comparisons made along the way. Hence, reported P values are overly optimistic; the literal use of these P values and confidence bounds might result in higher false-discovery rates, and estimates might not be validated in future studies. Such improper P values should be used only as an exploratory tool for researchers, particularly if the number of covariates tested in variable selection is large.



Biased estimates are a concern for all variable-selection methods. Under LASSO methods, all estimates are “shrunk” toward zero, some more than others. Hence, even in the most ideal scenario, LASSO estimates are biased toward zero; however, recall that for good predictors, this bias (“shrinkage”) will be minimal. FBS methods can often lead to overfit models, although recent work, such as the fast FSR procedure, is starting to address this issue. If an FBS method or fast FSR identifies the true but unknown model, then the estimates would be unbiased; however, this is often unrealistic in practice.



Despite the increased use of stepwise selection in medical literature and the criticisms of each method noted above, statistical preference when developing a predictive model is typically for shrinkage methods such as LASSO and adaptive LASSO. The LASSO estimator has shown good predictive performance compared with traditional methods (8). When developing an associative model or an adjustment model, fast FSR may provide the best set of tools to minimize biased estimates while controlling for overfitting, more so than FBS methods alone. Again, any variable-selection method should not be used haphazardly, but rather guided by clinical and statistical knowledge.






Model Complexity



The study purpose also can influence other regression modeling assumptions, including model complexity. Consider two primary modeling assumptions related to linearity and interactions (additivity). First, consider whether the model will employ only linear effects or whether quadratic or linear splines should be considered. In many cases, linear effects are sufficient; but this assumption should be tested for adequacy and transformations should be made when appropriate. It is easy to show examples where a particular variable has no linear effect but a significant effect is present when splines are considered. For example, age can have a strong positive association with outcome for subjects younger than 50 but should be associated with worse/negative outcomes for those older than 50. The linear effect of age might be marginalized by the contradictory associations seen above and below the knot point.



Researchers should also decide if interactions between covariates should be considered. Interactions can quickly complicate a model that has many covariates, so interaction terms are frequently avoided unless the clinician has reason to anticipate unique subgroup associations. To illustrate complexity, suppose researchers are considering 10 covariates (age, weight, etc.) for their main effects. In this case, there are 45 possible second-order interactions (say, between age and weight), in addition to numerous higher-order interactions of 3 or more variables. Ignoring the multiple comparisons required to test the 45 interactions and refitting of the data for each would be grossly irresponsible. Thus, interactions should be considered only when expert knowledge or inquisition supports it.



When many different variables are tested and retested, multiple comparisons are being made. Without properly adjusting for the process, an overfit model can result. For example, consider 20 uncorrelated randomly distributed variables that are also completely uncorrelated with outcome. If all 20 variables were considered in a model for outcome, we would expect, by chance alone, an average of 1 of the 20 to have an observed P value <0.05. If one tests and retests many potential covariates in an unorganized and haphazard way, the probability of obtaining spurious results increases.



The big picture to consider is that trade-offs exist when making decisions regarding model complexity. Including interactions can lead to overfitting the model if no true interaction effect exists, but leaving interactions out of a model can result in some bias. Similarly, quadratic or spline terms are not always appropriate, but omission of important ones can lead to significant bias, say, by declaring age insignificant despite significant associations in certain subsets of age. At all times, one should be mindful of making multiple comparisons without proper adjustment. Again, expert opinion and biological processes should guide the use of interactions.


Jun 14, 2016 | Posted by in PUBLIC HEALTH AND EPIDEMIOLOGY | Comments Off on Specific Regression Techniques

Full access? Get Clinical Tree

Get Clinical Tree app for offline access