# Bivariate Analysis

11 Bivariate Analysis

A variety of statistical tests can be used to analyze the relationship between two or more variables. Similar to Chapter 10, this chapter focuses on bivariate analysis, which is the analysis of the relationship between one independent (possibly causal) variable and one dependent (outcome) variable. Chapter 13 focuses on multivariable analysis, or the analysis of the relationship of more than one independent variable to a single dependent variable. (The term multivariate technically refers to analysis of multiple independent and multiple dependent variables, although it is often used interchangeably with multivariable). Statistical tests should be chosen only after the types of clinical data to be analyzed and the basic research design have been established. Steps in developing a research protocol include posing a good question; establishing a research hypothesis; establishing suitable measures; and deciding on the study design. The selection of measures in turn indicates the appropriate methods of statistical analysis. In general, the analytic approach should begin with a study of the individual variables, including their distributions and outliers, and a search for errors. Then bivariate analysis can be done to test hypotheses and probe for relationships. Only after these procedures have been done, and if there is more than one independent variable to consider, should multivariable analysis be conducted.

# I Choosing an Appropriate Statistical Test

Table 11-1 shows the numerous tests of statistical significance that are available for bivariate (two-variable) analysis. The types of variables and the research design set the limits to statistical analysis and determine which test or tests are appropriate. The four types of variables are continuous data (e.g., levels of glucose in blood samples), ordinal data (e.g., rankings of very satisfied, satisfied, and unsatisfied), dichotomous data (e.g., alive vs. dead), and nominal data (e.g., ethnic group). An investigator must understand the types of variables and how the type of variable influences the choice of statistical tests, just as a painter must understand types of media (e.g., oils, tempera, watercolors) and how the different media influence the appropriate brushes and techniques to be used.

# II Making Inferences (Parametric Analysis) From Continuous Data

Studies often involve one variable that is continuous (e.g., blood pressure) and another variable that is not (e.g., treatment group, which is dichotomous). As shown in Table 11-1, a t-test is appropriate for analyzing these data. A one-way analysis of variance (ANOVA) is appropriate for analyzing the relationship between one continuous variable and one nominal variable. Chapter 10 discusses the use of Student’s and paired t-tests in detail and introduces the concept of ANOVA (see Variation between Groups versus Variation within Groups).

If a study involves two continuous variables, such as systolic blood pressure and diastolic blood pressure, the following questions may be answered:

## A Joint Distribution Graph

The raw data concerning the systolic and diastolic blood pressures of 26 young, healthy, adult participants were introduced in Chapter 10 and listed in Table 10-1. These same data can be plotted on a joint distribution graph, as shown in Figure 11-1. The data lie generally along a straight line, going from the lower left to the upper right on the graph, and all the observations except one are fairly close to the line.

As indicated in Figure 11-2, the correlation between two variables, labeled x and y, can range from nonexistent to strong. If the value of y increases as x increases, the correlation is positive; if y decreases as x increases, the correlation is negative. It appears from the graph in Figure 11-1 that the correlation between diastolic and systolic blood pressure is strong and positive. Based on Figure 11-1, the answer to the first question posed previously is that there is a real relationship between diastolic and systolic blood pressure. The answer to the second question is that the relationship is positive and is almost linear. The graph does not provide quantitative information about how strong the association is (although it looks strong to the eye), and the graph does not reveal the probability that such a relationship could have occurred by chance. To answer these questions more precisely, it is necessary to use the techniques of correlation and simple linear regression. Neither the graph nor these statistical techniques can answer the question of how general the findings are to other populations, however, which depends on research design, especially the method of sampling.

## B Pearson Correlation Coefficient

As with every test of significance, for any given level of strength of association, the larger the sample size, the more likely it is to be statistically significant. A weak correlation in a large sample might be statistically significant, despite that it was not etiologically or clinically important (see later and Box 11-5). The converse may also be true; a result that is statistically weak still may be of public health and clinical importance if it pertains to a large portion of the population.

For purposes of showing the calculation of r and r2, a small set of data is introduced in Box 11-1. The data, consisting of the observed heights (variable x) and weights (variable y) of eight participants, are presented first in tabular form and then in graph form. When r is calculated, the result is 0.96, which indicates a strong positive linear relationship and provides quantitative information to confirm what is visually apparent in the graph. Given that r is 0.96, r2 is (0.96),2 or 0.92. A 0.92 strength of association means that 92% of the variation in weight is explained by height. The remaining 8% of the variation in this sample is presumed to be caused by factors other than height.

Box 11-1 Analysis of Relationship between Height and Weight (Two Continuous Variables) in Eight Study Participants

# Part 4 Calculation of the Slope (b) for a Regression of Weight (y) on Height (x)

Data from unpublished findings in a sample of eight professional persons in Connecticut.

## C Linear Regression Analysis

The formula for a straight line, as expressed in statistics, is y = a + bx (see Chapter 10). The y is the value of an observation on the y-axis; x is the value of the same observation on the x-axis; a is the regression constant (value of y when value of x is 0); and b is the slope (change in value of y for a unit change in value of x). Linear regression is used to estimate two parameters: the slope of the line (b) and the y-intercept (a). Most fundamental is the slope, which determines the impact of variable x on y. The slope can tell how much weight is expected to increase, on the average, for each additional centimeter of height.

When the usual statistical notation is used for a regression of y on x, the formulas for the slope (b) and y-intercept (a) are as follows:

Box 11-1 shows the calculation of the slope (b) for the observed heights and weights of eight participants. The graph in Box 11-1 shows the linear relationship between the height and weight data, with the regression line inserted. In these eight participants, the slope was 1.16, meaning that there was an average increase of 1.16 kg of weight for every 1-cm increase in height.

Linear regression analysis enables investigators to predict the value of y from the values that x takes. The formula for linear regression is a form of statistical modeling, where the adequacy of the model is determined by how closely the value of y can be predicted from the other variable. It is of interest to see how much the systolic blood pressure increases, on the average, for each added year of age. Linear regression is useful in answering routine questions in clinical practice, such as, “How much exercise do I need to do to raise my HDL 10 points, or lose 10 pounds?” Such questions involve the magnitude of change in a given factor, y, for a specific change in behavior, or exposure, x.

Just as it is possible to set confidence intervals around parameters such as means and proportions (see Chapter 10), it is possible to set confidence intervals around the parameters of the regression, the slope, and the intercept, using computations based on linear regression formulas. Most statistical computer programs perform these computations, and moderately advanced statistics books provide the formulas.2 Multiple linear regression and other methods involved in the analysis of more than two variables are discussed in Chapter 13.

# IV Making Inferences (Nonparametric Analysis) From Dichotomous and Nominal Data

As indicated in Table 11-1, the chi-square test, Fisher exact probability test, and McNemar chi-square test can be used in the analysis of dichotomous data, although they use different statistical theory. Usually, the data are first arranged in a 2 × 2 table, and the goal is to test the null hypothesis that the variables are independent.

## A 2 × 2 Contingency Table

Data arranged as in Box 11-2 form what is known as a contingency table because it is used to determine whether the distribution of one variable is conditionally dependent (contingent) on the other variable. More specifically, Box 11-2 provides an example of a 2 × 2 contingency table, meaning that it has two cells in each direction. In this case, the table shows the data for a study of 91 patients who had a myocardial infarction.9 One variable is treatment (propranolol vs. a placebo), and the other is outcome (survival for at least 28 days vs. death within 28 days).

Box 11-2 Chi-Square Analysis of Relationship between Treatment and Outcome (Two Nonparametric Variables, Unpaired) in 91 Participants