NATURE
OF
DATA
AND
STATISTICS
CHAPTER THE FIRST
The Basics
In this chapter, we will introduce you to the concepts of variables and to the different types of data: nominal, ordinal, interval, and ratio.
STATISTICS: SO WHO NEEDS IT?
The first question most beginning students of statistics ask is, “Why do we need it?” Leaving aside the unworthy answer that it is required for you to get your degree, we have to address the issue of how learning the arcane methods and jargon of this field will make you a better person and leave you feeling fulfilled in ways that were previously unimaginable. The reason is that the world is full of variation, and sometimes it’s hard to tell real differences from natural variation. Statistics wouldn’t be needed if everybody in the world were exactly like everyone else;1 if you were male, 172 cm tall, had brown eyes and hair, and were incredibly good looking,2 this description would fit every other person.3 Similarly, if there were no differences and we knew your life expectancy, or whether or not a new drug was effective in eliminating your dandruff, or which political party you’d vote for in the next election (assuming that the parties finally gave you a meaningful choice, which is doubtful), then we would know this for all people.
Fortunately, this is not the case; people are different in all of these areas, as well as in thousands of other ways. The downside of all this variability is that it makes it more difficult to determine how a person will respond to some newfangled treatment regimen or react in some situation. We can’t look in the mirror, ask ourselves, “Self, how do you feel about the newest brand of toothpaste?” and assume everyone will feel the same way.
DESCRIPTIVE AND INFERENTIAL STATISTICS
It is because of this variability among people, and even within any one person from one time to another, that statistics were born. As we hope to show as you wade through this tome, statistics allow us to describe the “average” person, to see how well that description fits or doesn’t fit other people, and to see how much we can generalize our findings from studying a few people4 to the population as a whole. So statistics can be used in two ways: to describe data, and to make inferences from them.
Descriptive statistics are concerned with the presentation, organization, and summarization of data.
The realm of descriptive statistics, which we cover in this section, includes various methods of organizing and graphing the data to get an idea of what they show. Descriptive statistics also include various indices that summarize the data with just a few key numbers.
The bulk of the book is devoted to inferential stats.
Inferential statistics allow us to generalize from our sample of data to a larger group of subjects.
For instance, when a dermatologist gives a new cream, attar of eggplant, to 20 adolescents whose chances for true love have been jeopardized by acne, and compares them with 20 adolescents who remain untreated (and presumably unloved), he is not interested in just those 40 kids. He wants to know whether all kids with acne will respond to this treatment. Thus he is trying to make an inference about a larger group of subjects from the small group he is studying. We’ll get into the basics of inferential statistics in Chapter 6; for now, let’s continue with some more definitions.
VARIABLES
In the first few paragraphs, we mentioned a number of ways that people differ: gender,5 age, height, hair and eye color, political preference, responsiveness to treatment, and life expectancy. In the statistical parlance you’ll be learning, these factors are referred to as
A variable is simply what is being observed or measured.
Variables come in two flavors: independent and dependent. The easiest way to start to think of them is in an experiment, so let’s return to those acned adolescents. We want to see if the degree of acne depends on whether or not the kids got attar of eggplant. The outcome (acne) is the dependent variable, which we hope will change in response to treatment. What we’ve manipulated is the treatment (attar of eggplant), and this is our independent variable.
The dependent variable is the outcome of interest, which should change in response to some intervention.
The independent variable is the intervention, or what is being manipulated.6
Sounds straightforward, doesn’t it? That’s a dead giveaway that it’s too simple. Once we get out of the realm of experiments, the distinction between dependent and independent variables gets a bit hairier. For instance, if we wanted to look at the growth of vocabulary as a kid grows up, the number of different words would be the dependent variable and age the independent one. That is, we’re saying that vocabulary is dependent on age, even though it isn’t an intervention and we’re not manipulating it. So, more generally, if one variable changes in response to another, we say that the dependent variable is the one that changes in response to the independent variable.
Both dependent and independent variables can take one of a number of specific values: for gender, this is usually limited to either male or female; hair color can be brown, black, blonde, red, gray, artificial, or missing; and a variable such as height can range between about 25 to 40 cm for premature infants to about 200 cm for basketball players and coauthors of statistics books.
TYPES OF DATA
Discrete versus Continuous Data
Although we referred to both gender and height as variables, it’s obvious that they are different from one another with respect to the type and number of values they can assume. One way to differentiate between types of variables is to decide whether the values are discrete or continuous.
Discrete variables can have only one of a l imited set of values. Using our previous examples, this would include variables such as gender, hair and eye color, political preference, and which treatment a person received. Another example of a discrete variable is a number total, such as how many times a person has been admitted to hospital; the number of decayed, missing, or filled teeth; and the number of children. Despite what the demographers tell us, it’s impossible to have 2.13 children—kids come in discrete quantities.
Discrete data have values that can assume only whole numbers.
The situation is different for continuous variables. It may seem at first that something such as height, for example, is measured in discrete units: someone is 172 cm tall; a person slightly taller would be 173 cm, and a somewhat shorter person would measure in at 171 cm. In fact, though, the limitation is imposed by our measuring stick. If we used one with finer gradations, we may be able to measure in 1/2 cm increments. Indeed, we could get really silly about the whole affair and use a laser to measure the person’s height to the nearest thousandth of a millimeter. The point is that height, like weight, blood pressure, serum rhubarb, time, and many other variables, is really continuous, and the divisions we make are arbitrary to meet our measurement needs. The measurement, though, is artificial; if two people appear to have the same blood pressure when measured to the nearest millimeter of mercury, they will likely be different if we could measure to the nearest tenth of a millimeter. If they’re still the same, we can measure with even finer gradations until a difference finally appears.
Continuous data may take any value, within a defined range.
We can illustrate this difference between discrete and continuous variables with two other examples. A piano is a “discrete” instrument. It has only 88 keys, and those of us who struggled long and hard to murder Paganini learnt that A-sharp was the same note as B-flat. Violinists (“fiddlers” to y’all south of the Mason-Dixon line), though, play a “continuous” instrument and are able to make a fine distinction between these two notes. Similarly, really cheap digital watches display only 4 digits and cut time into lminute chunks. Razzle-dazzle watches, in addition to storing telephone numbers and your bank balance, cut time into Zoo -second intervals. A physicist can do even better, dividing each second into 9,192,631,770 oscillations of a cesium atom. Even this, though, is only an arbitrary division. Only the hospital administrator, able to buy a Patek Phillipe analogue chronometer, sees time as it actually is: as a smooth, unbroken progression.7
Many of the statistical techniques you’ll be learning about don’t really care if the data are discrete or continuous; after all, a number to them is just a number. There are instances, though, when the distinction is important. Rest assured that we will point these out to you at the appropriate times.
Nominal, Ordinal, Interval, and Ratio Data
We can think about different types of variables in another way. A variable such as gender can take only two values: male and female. One value isn’t “higher” or “better” than the other;8 we can list them by putting male first or female first without losing any information. This is called a nominal variable.
A nominal variable consists of named categories, with no implied order among the categories.
The simplest nominal categories are what Feinstein (1977) calls “existential” variables—a property either exists or it doesn’t exist. A person has cancer of the liver or doesn’t have it; someone has received the new treatment or didn’t receive it; and, most existential of all, the subject is either alive or dead. Nominal variables don’t have to be dichotomous; they can have any number of categories. We can classify a person’s marital status as Single/Married/Separated/ Widowed/Divorced/Common-Law (six categories); her eye color into Black/Brown/Blue/Green/Mixed (five categories9); and her medical problem into one of a few hundred diagnostic categories. The important point is that you can’t say brown eyes are “better” or “worse” than blue. The ordering is arbitrary, and no information is gained or lost by changing the order.
Because computers handle numbers far more easily than they do letters, researchers commonly code nominal data by assigning a number to each value: Female could be coded as 1 and Male as 2; or Single = 1, Married = 2, and so on. In these cases, the numerals are really no more than alternative names, and they should not be thought of as having any quantitative value. Again, we can change the coding by letting Male = 1 and Female = 2, and the conclusions we draw will be identical (assuming, of course, that we remember which way we coded the data).10
A student evaluation rating consisting of Excellent/Satisfactory/Unsatisfactory has three categories. It differs from a variable such as hair color in that there is an ordering of these values: “Excellent” is better than “Satisfactory,” which in turn is better than “Unsatisfactory.” However, the difference in performance between “Excellent” and “Satisfactory” cannot be assumed to be the same difference as exists between “Satisfactory” and “Unsatisfactory.” This is seen more clearly with letter grades; there is only a small division between a B+ and a B, but a large one, amounting to a ruined summer, between a D- and an F+. This is like the results of a horse race; we know that the horse who won ran faster than the horse who placed, and the one who showed came in third. But there could have been only a 1-second difference between the first two horses, with the third trailing by 10 seconds. So letter grades and the order of finishing a race are called ordinal variables.
An nominal variable consists of ordered categories, where the differences between categories cannot be considered to be equal.
Many of the variables encountered in the health care field are ordinal in nature. Patients are often rated as Much improved/Somewhat improved/ Same/Worse/Dead; or Emergent/Urgent/Elective.11 Sometimes numbers are used, as in Stage I through Stage IV cancer. Don’t be deceived by this use of numbers; it’s still an ordinal scale, with the numbers (Roman, this time, to add a bit of class) really representing nothing more than ordered categories. Use the difference test: Is the difference in disease severity between Stage I and Stage II cancer the same as exists between Stages II and III or between III and IV? If the answer is No, the scale is ordinal.
If the distance between values constant, we’ve graduated to what is called an interval variable.
An interval variable has equal distances between values, but the zero point is arbitrary.
Why did we add that tag on the end, “the zero point is arbitrary,” and what does it mean? We added it because, as we’ll see, it puts a limitation on the types of statements we can make about interval variables. What the phrase means is that the zero point isn’t meaningful and therefore can be changed. To illustrate this, let’s contrast intelligence, measured by some IQ test, with something such as weight, where the zero is meaningful. We all know what zero weight is.12 We can’t suddenly decide that from now on, we’ll subtract 10 kilos from everything we weigh and say that something that previously weighed 11 kilos now weighs 1 kilo. It’s more than a matter of semantics; if something weighed 5 kilos before, we would have to say it weighed -5 kilos after the conversion—an obvious impossibility.
An intelligence score is a different matter. We say that the average IQ is 100, but that’s only by convention. The next world conference of IQ experts can just as arbitrarily decide that from now on, we’ll make the average 500, simply by adding 400 to all scores. We haven’t gained anything, but by the same token, we haven’t lost anything; the only necessary change is that we now have to readjust our previously learned standards of what is average.
Now let’s see what the implications of this are. Because the intervals are equal, the difference between an IQ of 70 and an IQ of 80 is the same as the difference between 120 and 130. However, an IQ of 100 is not twice as high as an IQ of 50. The point is that if the zero point is artificial and moveable, then the differences between numbers are meaningful, but the ratios between them are not.
If the zero point is meaningful, then the ratios between numbers are also meaningful, and we are dealing with (not surprisingly) a ratio variable.
A ratio variable has equal intervals between values and a meaningful zero point.
Most laboratory test values are ratio variables, as are physical characteristics such as height and weight.A person who weighs 100 kilos is twice as heavy as a person weighing 50 kilos; even when we convert kilos to pounds, the ratio stays the same: 220 pounds to 110 pounds.
That’s about enough for the difference between interval and ratio data. The fact of the matter is that, from the viewpoint of a statistician, they can be treated and analyzed the same way.
Notice that each step up the hierarchy from ordinal data to ratio data takes the assumptions of the step below it and then adds another restriction:13
Variable type | Assumptions |
Named categories. | |
Ordinal | Same as nominal plus table.chapter-title |
Interval | Same as ordinal plus equal intervals. |
Ratio | Same as interval plus meaningful zero. |
Although the distinctions among nominal, ordinal, interval, and ratio data appear straightforward on paper, the lines between them occasionally get a bit fuzzy. For example, as we’ve said, intelligence is measured in IQ units, with the average person having an IQ of 100. Strictly speaking, we have no assurance that the difference between an IQ of 80 and one of 100 means the same as the difference between 120 and 140; that is, IQ most likely is an ordinal variable. In the real world outside of textbooks, though, most people treat IQ and many other such variables as if they were interval variables. As far as we know, they have not been arrested for doing so, nor has the sky fallen on their heads.
Despite this, the distinctions among nominal, ordinal, interval, and ratio are important to keep in mind because they dictate to some degree the types of s tatistical tests we can use with them. As we’ll see in the later chapters, certain types of graphs and what are called “parametric tests” can be used with interval and ratio data but not with nominal or ordinal data. By contrast, if you have nominal or ordinal data, you are, strictly speaking, restricted to “nonpar- ametric” statistics. We’ll get into what these obscure terms mean later in the book.
PROPORTIONS AND RATES
So far, our discussion of types of numbers has dealt with single numbers—blood pressure, course grade, or counts. Sometimes, though, we deal with fractions. Even though this is stuff we learned in grade school, there’s still some confusion, owing, at least in part, to the sloppy English used by some statisticians. But, being purists, we’ll try to clear the air.
A proportion is a type of fraction in which the numerator is a subset of the denominator. That is, when we write X, we mean that there are three objects, and we’re talking about one of them. Percentages are a form of proportions, where the denominator is jigged to equal 100. This may seem so elementary that you may wonder why we bother to mention it. There are two reasons. First, we’ll later encounter other fractions (e.g., odds) where the numerator is not part of the denominator; and second (here’s where statisticians often screw up), people sometimes call a proportion a “rate.”
But, strictly speaking, a rate is a fraction that also has a time component. If we say that 23% of children have blue eyes (a figure we just made up on the spot), that’s a proportion. But, if we say that 1 out of every 1,000 people will develop photonumeropho- bia this year, that’s a rate, because we’re specifying a time interval.
So, with that as background, on to statistics!
EXERCISES
1. For the following studies, indicate which of the variables are dependent (DVs), independent (IVs), or neither.
a. ASA is compared against placebo to see if it leads to a reduction in coronary events. The IV is The DV is
b. The relationship between hypocholes- terolemia and cancer.The IV is The DV is
c. We know that members of religious groups that ban drugs, alcohol, smoking, meat, and sex (because it may lead to dancing) live longer than the rest of us poor mortals, but is it worth it? How do they compare with us on a test of quality of life? The IV is The DV is
d. One study (a real one, this time) found that bus drivers had higher morbidity rates of coronary heart disease than did conductors. it leads to a reduction in coronary events.
The IV is ____ The DV is ____
2. State which of the following variables are discrete and which are continuous.
a. The number of hair-transplant sessions undergone in the past year.
b. The time since the last patient was grateful for what you did.
c. Your anticipated before-taxes income the year after you graduate.
d. Your anticipated after-taxes income in the same year.
e. The amount of weight you’ve put on in the last year.
f. The number of hairs you’ve lost in the same time.
3. Indicate whether the following variables are nominal, ordinal, interval, or ratio.
a. or Your income (assuming it’s more than $0).
b. A list of the different specialties in your profession.
c. The ranking of specialties with regard to income.
d. Bo Derek was described as a “10.” What type of variable was the scale?
e. A range of motion in degrees.
f. A score of 13 out of 17 on the Schmedlap Anxiety Scale.
g. Staging of breast cancer as Type I, II, III, or IV.
h. ST depression on the ECG, measured in millimeters.
i. ST depression, measured as “1” ± 1 mm, “2” = 1 to 5 mm, and “3” 5 mm.
j. ICD-9 classifications: 0295 = Organic psychosis, 0296 = Depression, and so on.
k. Diastolic blood pressure, in mm Hg.
l. Pain measurement on a seven-point scale.
4. Indicate whether the following are proportions or rates:
a. The increase in the price of household good last year.
b. The ratio of males to females.
c. The ratio of new cases of breast cancer last month to the total number of women in the population.
d. The ratio of the number of women who have breast cancer to the total number of women in the population.
1 We also wouldn’t need dating services because it would be futile to look for the perfect mate; he or she would be just like the person sitting next to you. By the same token, it would mean the end of extramarital affairs, because what’s the use? But that’s another story.
2 Coincidently, this perfectly describes the person writing this section.
3 Mind you, if everybody in the world were male (or female), we wouldn’t need statistics (or anything else) in about 70 years.
4 As we’ll see later, “a few” to a statistician can mean over 400,000 people, as in the Salk polio vaccine trial. So much for the scientific use of language.
5 Formerly referred to as “sex.”
6 These are different from the definitions offered by one of our students, who said that, “An undependable variable keeps changing its value, while a dependable variable is always the same.”
7 Actually, the escapement mechanism makes the second hand jump, but if you can afford a Patek, you’ll ignore this.
8 Although male chauvinist pigs and radical feminists would disagree, albeit for opposite reasons.
9 “Bloodshot” is usually only a temporary condition and so is not coded.
10 Other examples of numbers really being nominal variables and not reflecting measured quantities would be telephone numbers,social insurance or social security numbers,credit card and politicians’IQs.
11 This is similar to the scheme used to evaluate employees: Walks on water/Keeps head above water under stress/Washes with water/Drinks water/Passes water in emergencies.
12 It’s a state aspired to by “high fashion” models.
13 A good mnemonic for remembering the order of the categories is the French word NOIR. Of course, this assumes you know French. Anglophones will just have to memorize the order.