Distributions

CHAPTER 8 Distributions




The numbers in a data set represent values of certain characteristics. Every member of the population will have a value associated with that characteristic, and they will not all be the same value. However, the values often tend to cluster. For instance, in a population of 30-year-olds, the values for systolic blood pressure will vary but they will tend to cluster around 120 mmHg. This is the nature of biologic data. Variables are not uniformly the same but they do tend to congregate around a common value.


When we go on a vacation to a distant location, we want to be able to describe the experience to our friends. We remember the topography of the area, what the temperature was like, and the inhabitants. We take pictures to record different details such as the various plants and flowers, the relief of the land, and the way the native people look. It is much easier for our friends to get an idea of the type of place it is by looking at photos rather than listening to a verbal description.


Just as a collection of photos contains a lot of information about a place, a data set from a sample contains a lot of information about a population. However, it is difficult to see the way a characteristic is distributed by looking at a list of numbers. It is much easier to get a feel for the spread of the individual characteristics by transforming the data into pictures by creating graphs.


In the vacation metaphor, different types of photos are best-suited for representing a given characteristic about the place. For instance, close-ups of faces are more suitable for showing the features of the individuals, whereas wide-angle shots are better at capturing the geography of the area. In the same way, different types of graphs can be used to best represent the spread of the values of a characteristic in a sample. The graph illustrates the pattern of variability, which is known as the variable’s distribution.



There are several ways to demonstrate the pattern of a variable, depending on the type of variable, its scale of measurement, and the range of values. The following examples include the more common types of distributions and illustrate how the pattern of spread can be appreciated by a pictorial display. As we will see, distributions are an important link in the inference process between the data and the conclusion.



TYPES OF DISTRIBUTIONS


When we discussed the types of variables in Chapter 5, we stated that categorical variables may be assigned a number but the number represents a category rather than a true value. Table 8-1 is a data set that contains information on the marital status (a categorical variable) for all Americans age 18 or over. In this example the categories of marital status are listed instead of being assigned a number. The same data are displayed in a different format in Figure 8-1.


TABLE 8-1 Count and Percent of Marital Status of Americans Age 18 and Over























Marital Status Count (millions) Percent
Never married 43.9 22.9
Married 116.7 60.9
Widowed 13.4 7.0
Divorced 17.6 9.2

Data from Moore, D. S. and G. P. McCabe. 1999. Introduction to the practice of statistics, 3rd ed. New York: W. H. Freeman and Co., p. 6.


image

FIGURE 8-1 Pie chart of the same data as in Table 8-1. The human eye immediately compares the area contained within the different slices. The categories need to be mutually exclusive unless there is a slice devoted to a combination of the categories.


(From Moore, D. S. and G. P. McCabe. 1999. Introduction to the practice of statistics, 3rd ed. New York: W. H. Freeman and Co., p. 7, with permission.)


Another way to illustrate the distribution of these variables is with a bar graph, as in Figure 8-2. The height of the bar allows easy comparison among the different values of the characteristic. Keep in mind that the ordinate (y-axis) represents the frequency of the value. The higher the bar, the more variables in the data set that have that value. This format is known as a histogram.




Variables that use an ordinate system can also be displayed in the bar graph, as in Figure 8-3. Recall that these variables have numerical values that indicate their relative place within the group and that the mathematical differences between the numerical values are not consistent. The observed value for the variable has been replaced by a numerical rank, from lowest to highest. It is easy to see if and where the values tend to cluster.



Recall that interval variables have values that represent a continuous scale. These can be displayed most effectively as a graph that actually uses the interval scale as its abscissa (x-axis.) The values increase along the x-axis in a continuous fashion. Just as with the bar graph, the height of the curve represents the frequency with which we observe that value. This type of graph is called a frequency distribution. This particular type of graphic representation is a crucial component of inferential statistics. We will use this type of graph to plot probabilities, which are the basis of inferential statistics.



A histogram is a type of frequency distribution. The height of the individual bars represent the relative frequency at which each value occurs. When the bars are placed close together, a pattern emerges. We see in Figure 8-4 that as the values along the abscissa are divided into smaller units, the height of the bars can be connected to form a smooth line.



A stem and leaf plot (also called stemplot) is a creative way of displaying not only the relative frequency of a value, but the individual values as well. It allows a side-by-side comparison of the distributions of a variable in two groups. This type of plot takes raw data and arranges them into a continuum based on the relative value of each measurement (lowest to highest). The stem plots out what are called the leading digits—the higher numbers that encompass many of the values, such as the 100s or 10s. The 1s are plotted individually next to their stem to form the leaves. In Table 8-2, a comparison was made between students who attend class regularly versus those who do not. The total points for each individual in a psychology course are plotted on a stem and leaf diagram. On the right are the regular attendees and on the left are the truants.



The stem is the center of the graph. The first stem of 18 actually represents 180. The leaves represent the 1s. The first entry off to the left side is 8. When tacked on to the stem of 18, it represents a single entry of 188—the lowest total points anyone earned. This person happened to be in the truant group. There are two individuals who earned 195 points. These are plotted next to each other. You can see that they were both in the truant group. The lowest number of points in the attendee group was 241 points; the highest was 328, which was the highest overall.

Stay updated, free articles. Join our Telegram channel

Jun 18, 2016 | Posted by in BIOCHEMISTRY | Comments Off on Distributions

Full access? Get Clinical Tree

Get Clinical Tree app for offline access