Concepts in assessment

J. Norcini and M. Friedman Ben-David

Even a cursory review of the assessment literature reveals a bewildering array of dichotomies and concepts. These are often overlapping, and authors sometimes use them with less precision than is desirable, especially for the clinical teacher attempting to make sense of assessment for the first time. The purpose of this chapter is to identify some of these concepts and provide background to their meaning, development and use.

Test theories

Test theories or psychometric models seek to explain what happens when an individual takes a test (Crocker & Algina 1986). They provide guidance about how to select items, how long tests need to be, the inferences that can be drawn from scores and the confidence that can be placed in the final results. Each model makes different assumptions, and, based on those assumptions, different benefits accrue. There are many psychometric models, but three deserve attention here because they are used frequently in medical education.

Classical test theory (CTT)

CTT has been the dominant model in testing for decades, with roots in the late 19th and early 20th century (Lord & Novick 1968). It assumes that an individual’s score on a test has two parts: true score (or what is intended to be measured) and error. To apply CTT to a practical testing situation, a series of very restrictive assumptions need to be made. The bad news is that these assumptions are often violated in practice, but the good news is that even when this happens, it seldom makes a practical difference (i.e. the model is robust with respect to violations of the assumptions).

A number of useful concepts and tools have been developed based on CTT (De Champlain 2010). Among the most powerful is reliability, which indicates the amount of error in observed scores. Also very useful has been the development of item statistics, which helps with the process of test development. CTT has contributed significantly to the development of a series of excellent assessments, and it continues to be used and useful today.

Generalizability theory (GT)

Although it has its roots in the middle of the 20th century, GT rose to prominence with the publication of a book by Cronbach et al in 1972. Like CTT (which is a special case of GT), it assumes that an individual’s score on a test has two parts: true score and error. However, compared to CTT, GT makes relatively weak assumptions. Consequently, it applies to a very broad range of assessments and, like CTT, even when these assumptions are violated it usually makes little practical difference.

GT offers a number of advantages over CTT (Brennan 2001). For example, GT allows the error in a test to be divided among different sources. So in a rating situation, GT allows the error associated with the rater to be separated from the error associated with the rating form they are filling out. Likewise, GT supports a distinction between scores that are intended to rank individuals as opposed to scores that are intended to represent how much they know. Given these advantages, GT has special applicability to the types of assessment situations found in medical education.

Item response theory (IRT)

With considerable interest starting in the 1970s, the use of IRT has grown substantially, especially among national testing agencies (Hambleton et al 1991). Unlike GT, IRT makes very strong assumptions about items, tests and individuals. These assumptions are difficult to meet, so there are a number of different IRT models, each with assumptions that are suitable for particular assessment situations.

When the assumptions are met, many practical benefits accrue (Downing 2003). For example, individual scores are independent of exactly which set of items are taken, and item statistics are independent of the individuals who take the test. So individuals can take completely different test questions, but their scores will still be comparable. As another example, IRT supports the creation of tests that are targeted to a particular score, often the pass–fail point. This permits a shorter test than would otherwise be the case.

Compared to CTT, GT and IRT provide different but powerful advantages. However, for most practical day-to-day work any of the test theories are sufficient.