Performing a linear regression analysis


c28-fig-5002


The Linear Regression Line


After selecting a sample of size n from our population and drawing a scatter diagram to confirm that the data approximate a straight line, we estimate the regression of y on x as:
Y = a + bx


where Y is the estimated fitted or predicted value of y, a is the estimated intercept and b is the estimated slope that represents the average change in Y for a unit change in x (Chapter 27).


Drawing the Line


To draw the line Y = a + bx on the scatter diagram, we choose three values of x (i.e. x1, x2 and x3) along its range. We substitute x1 in the equation to obtain the corresponding value of Y, namely Y1 = a + bx1; Y1 is our estimated fitted value for x1 which corresponds to the observed value, y1. We repeat the procedure for x2 and x3 to obtain the corresponding values of Y2 and Y3. We plot these points on the scatter diagram and join them to produce a straight line.


Checking the Assumptions


For each observed value of x, the residual is the observed y minus the corresponding fitted Y. Each residual may be either positive or negative. We can use the residuals to check the following assumptions underlying linear regression.



1 There is a linear relationship between x and y: Either plot y against x (the data should approximate a straight line) or plot the residuals against x (we should observe a random scatter of points rather than any systematic pattern).

2 The observations are independent: the observations are independent if there is no more than one pair of observations on each individual.

3 The residuals are Normally distributed with a mean of zero: Draw a histogram, stem-and-leaf plot, box-and-whisker plot (Chapter 4) or Normal plot (Chapter 35) of the residuals and ‘eyeball’ the result.

4 The residuals have the same variability (constant variance) for all the fitted values of y: Plot the residuals against the fitted values, Y, of y; we should observe a random scatter of points. If the scatter of residuals progressively increases or decreases as Y increases, then this assumption is not satisfied.

5 The x variable can be measured without error.

Failure to Satisfy the Assumptions


If the linearity, Normality and/or constant variance assumptions are in doubt, we may be able to transform x or y (Chapter 9) and calculate a new regression line for which these assumptions are satisfied. It is not always possible to find a satisfactory transformation. The linearity and independence assumptions are the most important. If you are dubious about the Normality and/or constant variance assumptions, you may proceed, but the P-values in your hypothesis tests, and the estimates of the standard errors, may be affected. Note that the x variable is rarely measured without any error; provided the error is small, this is usually acceptable because the effect on the conclusions is minimal.


Outliers and Influential Points



  • An influential observation will, if omitted, alter one or both of the parameter estimates (i.e. the slope and/or the intercept) in the model. Formal methods of detection are discussed briefly in Chapter 29. If these methods are not available, you may have to rely on intuition.
  • An outlier (an observation that is inconsistent with most of the values in the data set (Chapter 3)) may or may not be an influential point, and can often be detected by looking at the scatter diagram or the residual plots (see also Chapter 29).

For both outliers and influential points, we fit the model with and without the suspect individual’s data and note the effect on the estimate(s). Do not discard outliers or influential points routinely because their omission may affect the conclusions. Always investigate the reasons for their presence and report them.


Assessing Goodness of Fit


We can judge how well the line fits the data by calculating R2 (usually expressed as a percentage), which is equal to the square of the correlation coefficient (Chapters 26 and 27). This represents the percentage of the variability of y that can be explained by its relationship with x. Its complement, (100 − R2), represents the percentage of the variation in y that is unexplained by the relationship. There is no formal test to assess R2; we have to rely on subjective judgement to evaluate the fit of the regression line.


Investigating the Slope


If the slope of the line is zero, there is no linear relationship between x and y: changing x has no effect on y. There are two approaches, with identical results, to testing the null hypothesis that the true slope, β, is zero.



  • Examine the F-ratio (equal to the ratio of the ‘explained’ to the ‘unexplained’ mean squares) in the analysis of variance table. It follows the F-distribution and has (1, n − 2) degrees of freedom in the numerator and denominator, respectively.
  • Calculate the test statistic c28ue001 which follows the t-distribution on n − 2 degrees of freedom, where SE(b) is the standard error of b.

In either case, a significant result, usually if P < 0.05, leads to rejection of the null hypothesis.


We calculate the 95% confidence interval for β as b ± t0.05 × SE(b), where t0.05 is the percentage point of the t-distribution with n − 2 degrees of freedom which gives a two-tailed probability of 0.05. This interval contains the true slope with 95% certainty. For large samples, say n ≥ 100, we can approximate t0.05 by 1.96.


Regression analysis is rarely performed by hand; computer output from most statistical packages will provide all of this information.


Using the Line for Prediction


We can use the regression line for predicting values of y for specific values of x within the observed range (never extrapolate beyond these limits). We predict the mean value of y for individuals who have a certain value of x by substituting that value of x into the equation of the line. So, if x = x0, we predict y as Y0 = a + bx0. We use this estimated predicted value, and its standard error, to evaluate the confidence interval for the true mean value of y in the population. Repeating this procedure for various values of x allows us to construct confidence limits for the line. This is a band or region that contains the true line with, say, 95% certainty. Similarly, we can calculate a wider region within which we expect most (usually 95%) of the observations to lie.





Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

May 9, 2017 | Posted by in GENERAL & FAMILY MEDICINE | Comments Off on Performing a linear regression analysis

Full access? Get Clinical Tree

Get Clinical Tree app for offline access