Developing prognostic scores


c46-fig-5002


Why Do We Do It?


Given a large number of demographic or clinical features of an individual, we may want to predict whether that individual is likely to experience an event of interest. This event may either reflect a positive outcome for the individual (e.g. a good response to treatment, a cure) or a negative outcome (e.g. disease, death). We generate a prognostic score (often referred to as a prognostic index or, when predicting a negative outcome, a risk score) for each individual that provides a graded measure of the likelihood that the individual will experience the event.



  •  At its simplest, if considering an event with well-established risk factors (e.g. cardiovascular disease), a score can be generated by counting the number of risk factors possessed by each individual (e.g. male sex, older age, current smoker, family history of cardiovascular disease, diabetes mellitus, dyslipidaemia, hypertension) – this score should provide a crude indication of an individual’s risk of the event (with a higher number indicating a higher risk of cardiovascular disease). However, this approach assumes that each factor contributes equally to the chance of experiencing the event.
  •  A preferred alternative is to use a formal statistical analysis (often a logistic regression (Chapter 30) or a similar method known as discriminant analysis) which identifies factors that are significantly associated with the event and provides an assessment of the relative importance of each of these factors in determining the chance of experiencing the event. The prognostic score can then be calculated for an individual, using the coefficients from the model to provide a weighted sum of its components (i.e. z in Chapter 30). Although the range of values of this score depends on how the score is derived, a higher score generally indicates a greater chance of experiencing the event.

Sometimes patients are categorized by their scores, e.g. into those at low, moderate or high risk of experiencing the event. Alternatively, if a logistic regression has been performed, we can use the generated score for an individual to obtain a direct estimate of his or her predicted probability of the event (Chapter 30); as this is a probability, it takes a value from 0 to 1.


However, when using a regression model to generate a prognostic score, a model that explains a large proportion of the variability in the data may not necessarily be good at predicting which patients will develop the event. Furthermore, any score, even if based on known risk factors for the event, may provide misleading information on an individual’s prognosis. Therefore, once we have derived a predictive score based on a model, we should assess the validity of that score.


Assessing the Performance of a Prognostic Score


In order to demonstrate that our score will be useful, we should assess its performance by investigating whether it is accurate, able to discriminate between those who do and do not experience the event, correctly calibrated and transportable to other populations; we describe each of these qualities in the sections which follow (where we assume that a higher score indicates a greater chance of experiencing the event). In addition to good performance, a score should also demonstrate clinical value, i.e. it should lead to an improvement in the clinical management of patients. In other words, the score should provide prognostic information and demonstrate better performance than existing risk scores or the raw data. For example, a score based on a patient’s age, sex and blood pressure must demonstrate that it leads to clinical decisions that are different to (and more effective than) those that would have been made based on knowledge of these factors on their own.


1 How Accurate is the Score?


We wish to describe the extent to which the score is able to predict the event correctly.



  • We produce a classification table (Chapter 30 and Appendix C) showing the number of individuals in whom we correctly and incorrectly predict the event (similar to the table in Chapter 38) and calculate relevant measures such as:

    • the sensitivity and specificity;
    • the total accuracy of the score. This is equal to the number of individuals correctly predicted to experience or not experience the event, divided by the total number of individuals – the closer the value is to one, the better the accuracy (a perfect score would correctly predict 100% of individuals).

  • When we have used logistic regression to generate the score, we can calculate the mean Brier score for all n individuals in the sample. The Brier score for the ith individual is the squared difference between the predicted probability of that individual experiencing the event (Pi) and his or her observed outcome (Xi = 1 or 0 if he or she did or did not experience the event, respectively); the mean Brier score is Σ(Pi − Xi)2/n. It gives an indication of model accuracy, taking a value from 0 (able to predict the event perfectly) to 0.25 (of no value). The mean Brier score is closely related to the model R2 (Chapter 27).

2 How Well Can the Score Discriminate Between Those Who Do and Do not Experience the Event?


We wish to assess the ability of the score to rank individuals according to their chance of experiencing the event.



  • We categorize individuals according to their scores (e.g. into 5–10 equally sized groups determined by the relevant percentiles) and consider the event rates in each category (see Example). We should observe a trend towards increased event rates in those with higher scores.
  • We draw a receiver operating characteristic (ROC) curve, which is a plot of the sensitivity of the score against (1 − specificity). The curve for a score that has good discriminative ability lies in the upper left-hand quadrant of the plot and that for a score that is no better than chance at discriminating will lie along the 45° diagonal (Fig. 38.1, see also Chapters 30 and 38). The area under the ROC curve (sometimes referred to as AUROC) gives an indication of the ability of the score to discriminate between those who do and do not experience the event. If we randomly select two individuals from our sample, one of whom experiences the event and one of whom does not, AUROC gives the probability that the individual with the event has a higher score than the individual without the event; AUROC will equal 1 for a score which discriminates perfectly, but will equal 0.5 for a score that performs no better than chance.
  • We calculate Harrell’s c statistic, which is a measure of discrimination that is equivalent to AUROC. We select all ‘pairs’ of individuals in the sample with discordant events (i.e. we match every individual who experiences the event to every individual who does not experience the event) – the number of such pairs is our denominator – and calculate the percentage of these pairs in whom the predicted score is higher in the individuals with the event. Where the predicted score in the two individuals is equal, the numerator is increased by 0.5. The c statistic depends on the distribution of the score and/or predicted probabilities – if the sample is relatively homogeneous (i.e. the scores or predicted probabilities are all fairly similar to each other), then c will be close to 0.5.

3 Is the Score Correctly Calibrated?


Where we have used logistic regression to generate the predicted probabilities of the event, we may wish to know whether there is good agreement between these predicted probabilities and the observed probabilities (either 0 or 1) of the event occurring. It is possible for a prognostic score to discriminate well between individuals who do and do not experience the event (i.e. scores may be higher in those who experience the event) while still providing a poor estimate of the risk of the event occurring. This may occur when a prognostic score is applied in a different population to the one from which it was originally derived (e.g. when applying a cardiovascular risk score derived from a population in northern Europe to a population in southern Europe where the underlying risk of cardiovascular disease is much lower). This is of importance if clinical decisions are based on the predicted probability of the event, as poor calibration may result in patients receiving inappropriate care.


To determine model calibration we calculate the Hosmer–Lemeshow goodness of fit statistic which assesses the agreement between the observed event probabilities and those predicted by the score. Individuals in the sample are stratified into g groups (we usually take g = 10 and base the groups on the deciles of the distribution of predicted probabilities from the score; other classifications, e.g. using 8 groups, may result in different conclusions being drawn). The expected frequency of the event in each group is the sum of the predicted probabilities of the event for the individuals in that group. This is compared with the observed frequency of those with the event in the corresponding group by calculating a test statistic which follows a Chi-squared distribution with (g − 2) degrees of freedom (Chapter 8). A P-value < 0.05 suggests that the model is not well calibrated.


4 Is the Score Transportable or Generalizable?


We wish to know whether the score will work well in populations that are different from the one from which it was derived. Any prognostic score will always perform well on the data set that was used to derive the score and estimates of model performance (i.e. measures of accuracy, discrimination and calibration) from this data set (internal validation) will be overly optimistic. Thus, we generally require validation on at least one independent data set (external validation) to give a true assessment of the performance of the score; good performance on this independent data set provides evidence that the score is transportable or generalizable.


Where external validation is impractical, a number of alternative methods of internal validation may be used:



  • We separate the data into two subsamples – the training sample, used to derive the score, and the validation sample, used to validate the score. Generally, the training sample is larger than the validation sample (e.g. the training sample may contain 70% of the individuals in the original sample).
  • We perform crossvalidation where we partition the data set into subsets; we derive the risk score on a single subset initially and then validate it on the remaining subsets. When performing k-fold cross-validation, we split the data set into k subsets; we derive the score using one of the subsets and validate it on the remaining (k − 1) subsets. After repeating this process for each of the k subsets, we average the resulting risk score estimates and measures of model performance (e.g. AUROC) over all the subsets. Leave-one-out cross-validation (analogous to jackknifing – Chapter 11) is similar, but we remove each individual from the data set one at a time, and develop and validate the score on the remaining (n − 1) individuals in the sample. Again, we then average the estimates from the subsets.
  • We can use bootstrapping (Chapter 11) to estimate the prognostic score and assess its performance.
  • When the score is derived from a multicentre study (Chapter 12), we can perform an internal–external cross-validation which excludes a different centre from the data set for each analysis. Although the participating studies in a multicentre study generally follow the same study protocol, this approach will provide some evidence of model transportability as the centres are often in different settings.

Developing Prognostic Indices and Risk Scores for Other Types of Data


While many of the methods that we have described are most suitable for a binary outcome, using logistic regression or discriminant analysis to estimate the model and produce a risk score, it is possible to generate prognostic scores based on other types of data (e.g. survival data with censoring (Chapter 44), Poisson regression models (Chapter 31)). Many of the tests have been modified to deal with these other types of data although some tests (e.g. the Hosmer–Lemeshow test) are inappropriate when using different models.





Only gold members can continue reading. Log In or Register to continue

Stay updated, free articles. Join our Telegram channel

May 9, 2017 | Posted by in GENERAL & FAMILY MEDICINE | Comments Off on Developing prognostic scores

Full access? Get Clinical Tree

Get Clinical Tree app for offline access