The Linear Regression Line
After selecting a sample of size n from our population and drawing a scatter diagram to confirm that the data approximate a straight line, we estimate the regression of y on x as:
Y = a + bx
where Y is the estimated fitted or predicted value of y, a is the estimated intercept and b is the estimated slope that represents the average change in Y for a unit change in x (Chapter 27).
Drawing the Line
To draw the line Y = a + bx on the scatter diagram, we choose three values of x (i.e. x1, x2 and x3) along its range. We substitute x1 in the equation to obtain the corresponding value of Y, namely Y1 = a + bx1; Y1 is our estimated fitted value for x1 which corresponds to the observed value, y1. We repeat the procedure for x2 and x3 to obtain the corresponding values of Y2 and Y3. We plot these points on the scatter diagram and join them to produce a straight line.
Checking the Assumptions
For each observed value of x, the residual is the observed y minus the corresponding fitted Y. Each residual may be either positive or negative. We can use the residuals to check the following assumptions underlying linear regression.
Failure to Satisfy the Assumptions
If the linearity, Normality and/or constant variance assumptions are in doubt, we may be able to transform x or y (Chapter 9) and calculate a new regression line for which these assumptions are satisfied. It is not always possible to find a satisfactory transformation. The linearity and independence assumptions are the most important. If you are dubious about the Normality and/or constant variance assumptions, you may proceed, but the P-values in your hypothesis tests, and the estimates of the standard errors, may be affected. Note that the x variable is rarely measured without any error; provided the error is small, this is usually acceptable because the effect on the conclusions is minimal.
Outliers and Influential Points
- An influential observation will, if omitted, alter one or both of the parameter estimates (i.e. the slope and/or the intercept) in the model. Formal methods of detection are discussed briefly in Chapter 29. If these methods are not available, you may have to rely on intuition.
- An outlier (an observation that is inconsistent with most of the values in the data set (Chapter 3)) may or may not be an influential point, and can often be detected by looking at the scatter diagram or the residual plots (see also Chapter 29).
For both outliers and influential points, we fit the model with and without the suspect individual’s data and note the effect on the estimate(s). Do not discard outliers or influential points routinely because their omission may affect the conclusions. Always investigate the reasons for their presence and report them.
Assessing Goodness of Fit
We can judge how well the line fits the data by calculating R2 (usually expressed as a percentage), which is equal to the square of the correlation coefficient (Chapters 26 and 27). This represents the percentage of the variability of y that can be explained by its relationship with x. Its complement, (100 − R2), represents the percentage of the variation in y that is unexplained by the relationship. There is no formal test to assess R2; we have to rely on subjective judgement to evaluate the fit of the regression line.
Investigating the Slope
If the slope of the line is zero, there is no linear relationship between x and y: changing x has no effect on y. There are two approaches, with identical results, to testing the null hypothesis that the true slope, β, is zero.
- Examine the F-ratio (equal to the ratio of the ‘explained’ to the ‘unexplained’ mean squares) in the analysis of variance table. It follows the F-distribution and has (1, n − 2) degrees of freedom in the numerator and denominator, respectively.
- Calculate the test statistic which follows the t-distribution on n − 2 degrees of freedom, where SE(b) is the standard error of b.
In either case, a significant result, usually if P < 0.05, leads to rejection of the null hypothesis.
We calculate the 95% confidence interval for β as b ± t0.05 × SE(b), where t0.05 is the percentage point of the t-distribution with n − 2 degrees of freedom which gives a two-tailed probability of 0.05. This interval contains the true slope with 95% certainty. For large samples, say n ≥ 100, we can approximate t0.05 by 1.96.
Regression analysis is rarely performed by hand; computer output from most statistical packages will provide all of this information.
Using the Line for Prediction
We can use the regression line for predicting values of y for specific values of x within the observed range (never extrapolate beyond these limits). We predict the mean value of y for individuals who have a certain value of x by substituting that value of x into the equation of the line. So, if x = x0, we predict y as Y0 = a + bx0. We use this estimated predicted value, and its standard error, to evaluate the confidence interval for the true mean value of y in the population. Repeating this procedure for various values of x allows us to construct confidence limits for the line. This is a band or region that contains the true line with, say, 95% certainty. Similarly, we can calculate a wider region within which we expect most (usually 95%) of the observations to lie.