Rates
In any longitudinal study (Chapter 12) investigating the occurrence of an event (such as death), we should take into account the fact that individuals are usually followed for different lengths of time. This may be because some individuals drop out of the study or because individuals are entered into the study at different times, and therefore follow-up times from different people may vary at the close of the study. As those with a longer follow-up time are more likely to experience the event than those with shorter follow-up, we consider the rate at which the event occurs per person per period of time. Often the unit which represents a convenient period of time is a year (but it could be a minute, day, week, etc.). Then the event rate per person per year (i.e. per person-year of follow-up) is estimated by
Each individual’s length of follow-up is usually defined as the time from when he or she enters the study until the time when the event occurs or the study draws to a close if the event does not occur. The total follow-up time is the sum of all the individuals’ follow-up times.
The rate is called an incidence rate when the event is a new case (e.g. of disease) or the mortality rate when the event is death. When the rate is very small, it is often multiplied by a convenience factor such as 1000 and re-expressed as the rate per 1000 person-years of follow-up.
Features of the Rate
- When calculating the rate, we do not distinguish between person-years of follow-up that occur in the same individual and those that occur in different individuals. For example, the person-years of follow-up contributed by 10 individuals, each of whom is followed for 1 year, will be the same as that contributed by 1 person followed for 10 years.
- Whether we also include multiple events from each individual (i.e. when the event occurs on more than one occasion) depends on the hypothesis of interest. If we are only interested in first events, then follow-up must cease at the point at which an individual experiences his or her first event as the individual is no longer at risk of a first event after this time. Where multiple events from the same individual are included in the calculation of the rate, we have a special form of clustered data (Chapter 41), and appropriate statistical methods must be used (Chapters 41 and 42).
- A rate cannot be calculated in a cross-sectional study (Chapter 12) since this type of study does not involve time.
Comparing the Rate and the Risk
The risk of an event (Chapter 15) is simply the total number of events divided by the number of individuals included in the study at the start of the investigation, with no allowance for the length of follow-up. As a result, the risk of the event will be greater when individuals are followed for longer, since they will have more opportunity to experience the event. In contrast, the rate of the event should remain relatively stable in these circumstances, as the rate takes account of the duration of follow-up.
Relative Rates
We may be interested in comparing the rate of disease in a group of individuals exposed to some factor of interest (Rateexposed) with that in a group of individuals not exposed to the factor (Rateunexposed).
The relative rate (or rate ratio, sometimes referred to as the incidence rate ratio) is interpreted in a similar way to the relative risk (Chapter 15) and to the odds ratio (Chapters 16 and 30); a relative rate of 1 (unity) indicates that the rate of disease is the same in the two groups, a relative rate greater than 1 indicates that the rate is higher in those exposed to the factor than in those who are unexposed, and a relative rate less than one indicates that the rate is lower in the group exposed to the factor.
Although the relative rate is often taken as an estimate of the relative risk, the relative rate and the relative risk will only be similar if the event (e.g. disease) is rare. When the event is not rare and individuals are followed for varying lengths of time, the rate, and therefore the relative rate, will not be affected by the different follow-up times. This is not the case for the relative risk as the risk, and thus the relative risk, will change as individuals are followed for longer periods. Hence, the relative rate is always preferred when follow-up times vary between individuals in the study.
Poisson Regression
What Is It?
The Poisson distribution (named after a French mathematician) is a probability distribution (Chapter 8) of the count of the number of rare events that occur randomly over an interval of time (or space) at a constant average rate. This forms the basis of Poisson regression, which is used to analyse the rate of some event (e.g. disease) when individuals have different follow-up times. This contrasts with logistic regression (Chapter 30) which is concerned only with whether or not the event occurs and is used to estimate odds ratios. In Poisson regression, we assume that the rate of the event among individuals with the same explanatory variables (e.g. age and sex) is constant over the whole study period. We generally want to know which explanatory variables influence the rate at which the event occurs, and may wish to compare this rate in different exposure groups and/or predict the rate for groups of individuals with particular characteristics.
The Equation and its Interpretation
The Poisson regression model takes a very similar form to the logistic regression model (Chapter 30), each having a (usually) linear combination of explanatory variables on the right-hand side of the equation. Poisson regression analysis also mirrors logistic regression analysis in that we transform the outcome variable in order to overcome mathematical difficulties. We use the natural log transformation (ln) of the rate and an iterative process (maximum likelihood, Chapter 32) to produce an estimated regression equation from the sample data of the form
where:
- xi is the ith explanatory variable (i = 1, 2, 3, …, k);
- r is the estimated value of the mean or expected rate for an individual with a particular set of values for x1, …, xk;
- a is the estimated constant term providing an estimate of the log rate when all xi’s in the equation take the value zero (the log of the baseline rate);
- b1, b2, …, bk are the estimated Poisson regression coefficients.
The exponential of a particular coefficient, for example, , is the estimated relative rate associated with the relevant variable. For a particular value of x1, it is the estimated rate of disease for (x1 + 1) relative to the estimated rate of disease for x1, while adjusting for all other xi’s in the equation. If the relative rate is equal to one (unity), then the event rates are the same when x1 increases by one unit. A value of the relative rate above one indicates an increased event rate, and a value below one indicates a decreased event rate, as x1 increases by one unit.
As with logistic regression, Poisson regression models are fitted on the log scale. Thus, the effects of the xi’s are multiplicative on the rate of disease.
We can manipulate the Poisson regression equation to estimate the event rate for an individual with a particular combination of values of x1, …, xk. For each set of covariate values for x1, …, xk, we calculate
Then, the event rate for that individual is estimated as ez.
Use of an Offset
Although we model the rate at which the event occurs (i.e. the number of events divided by the person-years of follow-up), most statistical packages require the number of events occurring to be specified as the dependent variable rather than the rate itself. The log of each individual’s person-years of follow-up is then included as an offset in the model. Assuming that we are only interested in including a single event per person, the number of events occurring in each individual will either take the value 0 (if the event did not occur) or 1 (if the event did occur). This provides a slightly different formulation of the model which allows the estimates to be generated in a less computationally intensive way. The results from the model, however, are exactly the same as they would be if the rate were modelled.
Entering Data for Groups
Note that when all of the explanatory variables are categorical, we can simplify the data entry process by making use of the fact that the calculation of the rate does not distinguish between person-years of follow-up that occur in the same individual and those that occur in different individuals. For example, we may be interested in the effect of only two explanatory variables, sex (male or female) and age (<16, 16–20 and 21–25 years), on the rate of some event. Between them, these two variables define six groups (i.e. males aged < 16 years, females aged < 16 years, …, females aged 21–25 years). We can simplify the entry of these data by determining the total number of events for all individuals within the same sex/age group and the total person-years of follow-up for these individuals. The estimated rate in each group is then calculated as the total number of events divided by the person-years of follow-up in that group. Using this approach, rather than entering data for the n individuals one by one, we enter the data for each of the six groups, and do so by creating a model in which the explanatory variables are the binary and dummy variables (Chapter 29) for sex and age. Note that when entering data in this way, it is not possible to accommodate numerical covariates to define the groups or include an additional covariate in the model that takes different values for the individuals in a group.
Incorporating Variables That Change Over Time
By splitting the follow-up period into shorter intervals, it is possible to incorporate variables that change over time into the model. For example, we may be interested in relating the smoking history of middle-aged men to the rate at which they experience lung cancer. Over a long follow-up period, many of these men may give up smoking and their rates of lung cancer may be lowered as a result. Thus, categorizing men according to their smoking status at the start of the study may give a poor representation of the impact of smoking status on lung cancer. Instead, we split each man’s follow-up into short time intervals in such a way that his smoking status remains constant in each interval. We then perform a Poisson regression analysis, treating the relevant information in each short time interval for each man (i.e. the occurrence/non-occurrence of the event, his follow-up time and smoking status) as if it came from a different man.
Computer Output
Comprehensive computer output for a Poisson regression analysis includes, for each explanatory variable, the estimated Poisson regression coefficient with standard error, the estimated relative rate (i.e. the exponential of the coefficient) with a confidence interval for its true value, and a Wald test statistic (testing the null hypothesis that the regression coefficient is zero or, equivalently, that the relative rate of ‘disease’ associated with this variable is unity) and associated P-value. As with the output from logistic regression (Chapter 30), we can assess the adequacy of the model using −2log likelihood (LRS or deviance) and the model Chi-square or the Chi-square for covariates (see also Chapter 32).
Extra-Poisson Variation
One concern when fitting a Poisson regression model is the possibility of extra-Poisson variation, which usually implies overdispersion. This occurs when the residual variance is greater than would be expected from a Poisson model, perhaps because an outlier is present (Chapter 3), because an important explanatory variable has not been included in the model, or because the data are clustered (Chapters 41 and 42) and the clustering has not adequately been taken into account. Then the standard errors are usually underestimated and, consequently, the confidence intervals for the parameters are too narrow and the P-values too small. A way to investigate the possibility of extra-Poisson variation is to divide −2log likelihood (LRS or deviance) by the degrees of freedom, n − k − 1, where n is the number of individuals in the data set and k is the number of explanatory variables in the model. This quotient should be approximately equal to 1 if there is no extra-Poisson variation; values substantially above 1 may indicate overdispersion. If there is overdispersion, then it is possible to use the scale parameter (which is usually assumed to equal 1 when there is no extra-Poisson variation) to fit a Poisson regression model that is appropriate for overdispersed data. Alternatively, it may be advisable to fit a regression model based on the negative Binomial distribution (another type of probability distribution that can be used for counts) instead of the Poisson distribution. Underdispersion, where the residual variance is less than would be expected from a Poisson model and where the ratio of −2log likelihood to n − k − 1 is substantially less than 1, may also occur (e.g. if high counts cannot be recorded accurately). Underdispersion and overdispersion may also be a concern when performing logistic regression (Chapter 30), when they are referred to as extra-Binomial variation.
Alternative to Poisson Analysis
When a group of individuals is followed from a natural ‘starting point’ (e.g. an operation) until the time that the person develops an endpoint of interest, we may use an alternative approach known as survival analysis, which, in contrast to Poisson regression, does not assume that the ‘hazard’ (the rate of the event in a small interval) is constant over time. This approach is described in detail in Chapter 44.