Ethical Implications 85
What are the Potential Harms of Screening? 85
Criteria for Screening 86
If a Test is Available, Should it be Used? 86
Assessment of Test Effectiveness 87
Is the Test Valid? 87
Trade-Offs Between Sensitivity and Specificity 89
Where Should the Cut-Off for Abnormal be? 89
Prevalence and Predictive Values 89
Can Test Results be Trusted? 89
Tests in Combination 90
Should a Follow-Up Test be Done? 90
Benefit or Bias? 91
Does a Screening Programme Really Improve Health? 91
Lead-Time Bias 91
Length Bias 91
Guidelines for Test Evaluations 92
Despite extensive use of screening tests in contemporary practice, the underlying principles of screening are poorly understood by clinicians and the lay public alike. Screening is the testing of apparently well people to find those at increased risk of having a disease or disorder. Although an earlier diagnosis generally has intuitive appeal, earlier might not always be better or worth the cost. Four terms describe the validity of a screening test: sensitivity, specificity, predictive value positive, and predictive value negative. For tests with continuous variables (e.g., blood glucose), sensitivity and specificity are inversely related; where the cut-off for abnormal is placed should indicate the clinical effect of wrong results. The prevalence of disease in a population affects screening test performance: in low-prevalence settings, even very good tests have poor predictive value positives. Hence, knowledge of the approximate prevalence of disease is a prerequisite to interpreting screening test results. Tests are often done in sequence, as is true for syphilis and HIV-1 infection. Lead time, length, and other biases distort the apparent value of screening programmes; randomised controlled trials are the only way to avoid these biases. The STARD guidelines specify the steps needed to assess tests. Screening can improve health; for example, strong indirect evidence links cervical cytology programmes to declines in cervical cancer mortality. However, inappropriate application or interpretation of screening tests can rob people of their perceived health, initiate harmful diagnostic testing, and squander healthcare resources. Screening for ovarian cancer is a notable example.
Screening is a ‘double-edged sword’, sometimes wielded clumsily by the well-intended. Although ubiquitous in contemporary medical practice, screening remains widely misunderstood and misused. Indeed, most clinicians are unaware of the pitfalls of screening. Screening is defined as ‘The presumptive identification of unrecognised disease or defect by the application of tests, examinations, or other procedures which can be applied rapidly. Screening tests sort out apparently well persons who probably have a disease from those who probably do not’. Looking for additional illnesses in those with medical problems is termed case finding ; screening is limited to those apparently well.
One screening fallacy is that if we simply do enough testing, then we can eradicate a disease such as cervical cancer. Fishing for cod explains why this optimism is naïve. Cod were so abundant in the Georges Bank on the continental shelf of North America that the Basques of Spain had established trade routes by AD 1000. The fish were so plentiful that transoceanic expeditions were profitable centuries before Columbus wandered across. The cost of catching cod was negligible. As the Bank was aggressively overfished by traditional hand lines, then commercial trawlers, the fish population diminished greatly over the centuries. By the 1990s the Bank had few cod left ( Fig. 8.1 ) Correspondingly the cost of catching a fish escalated dramatically. As the frequency of cod (or a disease) decreases, the cost of finding one increases. To catch the last cod in the Atlantic Ocean (or the last case of cervical cancer worldwide) would take extraordinary resources.
The number needed to screen (NNS) reflects this aspect of screening effectiveness. For cancer, this is the number of persons who would have to be screened to prevent one premature death from cancer (usually in the range of 500–1110 persons). For mammography among women older than 50 years, an estimate is 543. For faecal occult blood testing for colorectal cancer, the corresponding number ranges from 600 to 1000 persons. For uncommon cancers, such as oral malignancies in developed countries, the NNS becomes prohibitively large. In the UK, the estimated NNS to prevent one death is >53,000, and to decrease oral cancer mortality rates by 1% is >1,125,000. As the disease becomes more rare, the false-positive results dwarf the true positives, often wreaking harm by chasing false positives and wasting money. Stated alternatively, finding the last case of any cancer (or fish in the sea) becomes impossibly expensive.
Screening can improve health. For example, strong indirect evidence supports cytology screening for cervical cancer. Insufficient use of this screening method accounts for a large proportion of invasive cervical cancers in industrialised nations. Other beneficial examples include screening for hypertension in adults; screening for hepatitis B and C virus antigen, HIV, chlamydia infection, and syphilis in pregnant women; routine urine culture in pregnant women at 12 to 16 weeks’ gestation; and phenylketonuria screening in newborns. However, inappropriate screening harms healthy individuals and squanders precious resources. Here, we review the purposes of screening, the selection of tests, measurement of validity, the effect of prevalence on test outcome, and some biases that can distort interpretation of tests.
What are the potential harms of screening?
Screening differs from the traditional clinical use of tests in several important ways. Ordinarily, patients consult with clinicians about complaints or problems; this prompts testing to confirm or exclude a diagnosis. Because the patient feels unwell and requests our help, the risk and expense of tests are usually deemed acceptable by the patient. By contrast, screening engages apparently healthy individuals who are not seeking medical help (and who might prefer to be left alone). Alternatively, consumer-generated demand for screening, such as for osteoporosis and ovarian cancer, might lead to expensive programmes of no clear value. Hence the cost, injury, and stigmatisation related to screening are especially important (though often ignored in our zeal for earlier diagnosis); the medical and ethical standards of screening should be, correspondingly, higher than with diagnostic tests. Bluntly put: every adverse outcome of screening is iatrogenic and inconsistent with the ethical principle of nonmaleficence.
Screening has a darker side that is often overlooked. It can be nauseating (oral glucose tolerance test for gestational diabetes), unpleasant (bowel preparation before colonoscopy), and both expensive and uncomfortable (mammography). Ovarian cancer screening is a prototype. The deadliest gynaecological cancer is usually detected when metastatic, and 5-year survival rates are grim. Hence some enthusiasts urged mounting screening programmes with vaginal ultrasound rather than waiting for empirical evidence of benefit. Fortunately, large randomised trials of ovarian cancer screening with ultrasound and CA-125 were subsequently done. In the UK trial, no significant survival benefit was found with screening. In the US trial among 78,216 women aged 55 to 74 years studied, no benefit in cancer mortality was found; this was confirmed by follow-up at a median of 15 years. Screening enthusiasts usually ignore the harms occasioned by screening. In the US trial, 3285 women had false-positive results; of these, 1080 had an operation as a result. Among these surgical patients, 163 had one or more serious complications, with a surgical morbidity rate of 21%. Stated alternatively, screening for ovarian cancer in this age group caused net harm to women and wasted resources. That is unethical.
Cervical cancer screening, although likely useful, has important harms as well. In the late 1990s liquid-based cytology was developed as an alternative to the venerable Pap smear. Based on unsubstantiated claims of better sensitivity than a Pap smear, the new and more expensive screening test soon dominated the US market. Claims of superiority over the traditional Pap smear subsequently were refuted, by which time the liquid-based cytology had become firmly entrenched in practice. This change to liquid-based cytology drove up the cost of finding cervical cancer, a clear setback in public health. Other harms of cervical cancer screening include the stigma of labelling, anxiety about cancer and loss of childbearing, extended and frequent future surveillance, and operations on the cervix. Excisional procedures on the cervix are linked with adverse pregnancy outcomes. One report estimated that in the United States in 2007, more than 4 million women had health problems related to cervical cancer screening, 800,000 experienced anxiety, and more than 3 million had adverse events related to biopsy or treatment. Moreover, cervical cancer screening practices may have led to an estimated 5300 preterm births.
The appropriate role of mammography screening remains in flux, despite its wide use and aggressive promotion. An evaluation of three decades of mammography in the United States was discouraging. The rate of early-stage cancers detected doubled, while that of late-stage disease decreased only 8%. A by-product of screening was substantial overdiagnosis of cancer. Because the natural history of ductal carcinoma in situ of the breast remains unclear, a recommendation has been made to delete the anxiety-provoking word ‘carcinoma’ from the term. Until it metastasises, it is not cancer. The same holds true for in situ lesions of the prostate. More men die with prostate cancer than of prostate cancer. About half of all diagnosed cases of prostate cancer do not benefit from treatment.
Prenatal testing for foetal chromosomal abnormalities poses another emerging crisis in screening practice. Noninvasive testing uses cell-free foetal DNA found in the pregnant woman’s plasma. These tests first became commercially available in 2011 and have been aggressively promoted to the lay public. Direct-to-consumer advertising has touted ‘near-perfect accuracy’, and uptake skyrocketed in the absence of adequate evaluation and genetic counselling regarding interpretation of results. What consumers do not understand is that predictive-value-positive results, even if the test is accurate, are poor for rare disease such as genetic abnormalities. Frightened women are now bypassing the requisite confirmatory tests and proceeding directly to abortion based on a screening test. One report found that 6% of women who were informed of a foetus at high risk aborted their pregnancies without a confirmatory amniocentesis and karyotype. The use of diagnostic testing plummeted after introduction of noninvasive screening, raising concerns about clinicians’ skill with amniocentesis and chorionic villus sampling dwindling as a result.
A second wave of injury can arise after the initial screening insult. Although the stigma associated with correct labelling of people as ill might be acceptable, those incorrectly labelled as sick suffer as well. For example, labelling productive steelworkers as being hypertensive led to increased absenteeism and adoption of a sick role, independent of treatment or disease severity. Women labelled as having gestational diabetes reported deterioration in their health and that of their children over the 5 years after diagnosis. Awareness, as opposed to ignorance of hypothyroidism, diabetes, and hypertension, is associated with worse self-reported health status. For some diseases, ‘ignorance may be bliss’.
Treatment of hyperlipidaemia with clofibrate several decades ago provides another sobering example. Treatment of the cholesterol count (a surrogate endpoint, rather than an illness itself) inadvertently led to a 17% increase in mortality among middle-aged men given the drug ( Chapter 18 ). This screening misadventure cost the lives of more than 5000 men in the United States alone. Because of these mishaps, screening practices should be more selective.
Criteria for Screening
If a test is available, should it be used?
The availability of a screening test does not imply that it should be used. Indeed, before screening is done, the strategy must meet stringent criteria. Many of these were included in early guidance from the World Health Organization. The disease should be medically important and clearly defined, and its prevalence reasonably well known. The natural history should be known, and an effective intervention must exist. The screening programme must be cost-effective, facilities for diagnosis and treatment must be readily available, and the course of action after a positive result must be generally agreed on and acceptable to those screened.
Finally, the test must do its job. It should be safe, have a reasonable cut-off level defined, and be both valid and reliable. The latter two terms, often used interchangeably, are distinct. Validity is the ability of a test to measure what it sets out to measure, usually differentiating between those with and without the disease. By contrast, reliability indicates repeatability. For example, a bathroom scale that consistently measures 2 kg heavier than a hospital scale (‘the gold standard’) provides an invalid but highly reliable result.
Although an early diagnosis generally has intuitive appeal, earlier might not always be better. For example, what benefit would accrue (and at what cost) from early diagnosis of Alzheimer disease, which to date has no effective treatment? What merit has earlier diagnosis of cervical cancer in developing countries if no treatment is available? The net effect of screening would be deprivation of one’s sense of well-being. Sackett and colleagues proposed a pragmatic checklist to help decide when (or if) seeking a diagnosis earlier than usual is worth the expense and bother. Does early diagnosis really benefit those screened, for example, in survival or quality of life? Can the clinician manage the additional time required to confirm the diagnosis and deal with those diagnosed before symptoms developed? Will those diagnosed earlier comply with the proposed treatment? Has the effectiveness of the screening strategy been established empirically rather than theoretically? Finally, are the cost and accuracy of the test clinically acceptable?
Assessment of Test Effectiveness
Is the test valid?
For over half a century, four indices of test validity have been widely used: sensitivity, specificity, predictive value positive, and predictive value negative. Although clinically useful (and far improved over clinical hunches), these terms are predicated on an assumption that is often clinically unrealistic (i.e., that all people can be dichotomised as ill or well). Indeed, one definition of an epidemiologist is a person who sees the entire world in a 2 × 2 table. Often, those tested simply do not fit neatly into these designations: they might be possibly ill, early ill, probably well, or some other variant. Likelihood ratios, which incorporate varying (not just dichotomous) degrees of test results, can be used to refine clinicians’ judgements about the probability of disease in a particular person ( Chapter 9 ).
For simplicity, however, assume a population has been tested and assigned to the four mutually exclusive cells shown in Fig. 8.2 . Sensitivity, sometimes termed the detection rate, is the ability of a test to find those with the disease. All those with disease are in the left column. Hence, the sensitivity is simply those correctly identified by the test ( a ) divided by all those sick ( a + c ). Specificity denotes the ability of a test to identify those without the condition. Calculation of this proportion is trickier, however. By analogy to sensitivity, some assume (incorrectly) that the formula here is b /( b + d ). However, the numerator for specificity is cell d (the true negatives), which is divided by all those healthy (b + d).