Why Do We Sample?
In statistics, a population represents the entire group of individuals in whom we are interested. Generally it is costly and labour-intensive to study the entire population and, in some cases, may be impossible because the population may be hypothetical (e.g. patients who may receive a treatment in the future). Therefore we collect data on a sample of individuals who we believe are representative of this population (i.e. they have similar characteristics to the individuals in the population), and use them to draw conclusions (i.e. make inferences) about the population.
When we take a sample of the population, we have to recognize that the information in the sample may not fully reflect what is true in the population. We have introduced sampling error by studying only some of the population. In this chapter we show how to use theoretical probability distributions (Chapters 7 and 8) to quantify this error.
Obtaining a Representative Sample
Ideally, we aim for a random sample. A list of all individuals from the population is drawn up (the sampling frame), and individuals are selected randomly from this list, i.e. every possible sample of a given size in the population has an equal probability of being chosen. Sometimes, we may have difficulty in constructing this list or the costs involved may be prohibitive, and then we take a convenience sample. For example, when studying patients with a particular clinical condition, we may choose a single hospital, and investigate some or all of the patients with the condition in that hospital. Very occasionally, non-random schemes, such as quota sampling or systematic sampling, may be used. Although the statistical tests described in this book assume that individuals are selected for the sample randomly, the methods are generally reasonable as long as the sample is representative of the population.
Point Estimates
We are often interested in the value of a parameter in the population (Chapter 7), such as a mean or a proportion. Parameters are usually denoted by letters of the Greek alphabet. For example, we usually refer to the population mean as μ and the population standard deviation as σ. We estimate the value of the parameter using the data collected from the sample. This estimate is referred to as the sample statistic and is a point estimate of the parameter (i.e. it takes a single value) as opposed to an interval estimate (Chapter 11) which takes a range of values.
Sampling Variation
If we were to take repeated samples of the same size from a population, it is unlikely that the estimates of the population parameter would be exactly the same in each sample. However, our estimates should all be close to the true value of the parameter in the population, and the estimates themselves should be similar to each other. By quantifying the variability of these estimates, we obtain information on the precision of our estimate and can thereby assess the sampling error. In reality, we usually only take one sample from the population. However, we still make use of our knowledge of the theoretical distribution of sample estimates to draw inferences about the population parameter.
Sampling Distribution of the Mean
Suppose we are interested in estimating the population mean; we could take many repeated samples of size n from the population, and estimate the mean in each sample. A histogram of the estimates of these means would show their distribution (Fig. 10.1); this is the sampling distribution of the mean. We can show that:
- If the sample size is reasonably large, the estimates of the mean follow a Normal distribution, whatever the distribution of the original data in the population (this comes from a theorem known as the Central Limit Theorem).
- If the sample size is small, the estimates of the mean follow a Normal distribution provided the data in the population follow a Normal distribution.
- The mean of the estimates is an unbiased estimate of the true mean in the population, i.e. the mean of the estimates equals the true population mean.
- The variability of the distribution is measured by the standard deviation of the estimates; this is known as the standard error of the mean (often denoted by SEM). If we know the population standard deviation (σ), then the standard error of the mean is given by