Chapter 35 Test theories or psychometric models seek to explain what happens when an individual takes a test (Crocker & Algina 1986). They provide guidance about how to select items, how long tests need to be, the inferences that can be drawn from scores and the confidence that can be placed in the final results. Each model makes different assumptions, and, based on those assumptions, different benefits accrue. There are many psychometric models, but three deserve attention here because they are used frequently in medical education. CTT has been the dominant model in testing for decades, with roots in the late 19th and early 20th century (Lord & Novick 1968). It assumes that an individual’s score on a test has two parts: true score (or what is intended to be measured) and error. To apply CTT to a practical testing situation, a series of very restrictive assumptions need to be made. The bad news is that these assumptions are often violated in practice, but the good news is that even when this happens, it seldom makes a practical difference (i.e. the model is robust with respect to violations of the assumptions). A number of useful concepts and tools have been developed based on CTT (De Champlain 2010). Among the most powerful is reliability, which indicates the amount of error in observed scores. Also very useful has been the development of item statistics, which helps with the process of test development. CTT has contributed significantly to the development of a series of excellent assessments, and it continues to be used and useful today. Although it has its roots in the middle of the 20th century, GT rose to prominence with the publication of a book by Cronbach et al in 1972. Like CTT (which is a special case of GT), it assumes that an individual’s score on a test has two parts: true score and error. However, compared to CTT, GT makes relatively weak assumptions. Consequently, it applies to a very broad range of assessments and, like CTT, even when these assumptions are violated it usually makes little practical difference. GT offers a number of advantages over CTT (Brennan 2001). For example, GT allows the error in a test to be divided among different sources. So in a rating situation, GT allows the error associated with the rater to be separated from the error associated with the rating form they are filling out. Likewise, GT supports a distinction between scores that are intended to rank individuals as opposed to scores that are intended to represent how much they know. Given these advantages, GT has special applicability to the types of assessment situations found in medical education. With considerable interest starting in the 1970s, the use of IRT has grown substantially, especially among national testing agencies (Hambleton et al 1991). Unlike GT, IRT makes very strong assumptions about items, tests and individuals. These assumptions are difficult to meet, so there are a number of different IRT models, each with assumptions that are suitable for particular assessment situations. When the assumptions are met, many practical benefits accrue (Downing 2003). For example, individual scores are independent of exactly which set of items are taken, and item statistics are independent of the individuals who take the test. So individuals can take completely different test questions, but their scores will still be comparable. As another example, IRT supports the creation of tests that are targeted to a particular score, often the pass–fail point. This permits a shorter test than would otherwise be the case. A score is a letter or number that reflects how well an individual performs on an assessment. When a test is being developed, one of the first decisions to be made is how the scores will be interpreted: norm-referenced or criterion-referenced (Glaser 1963). This decision has implications for how the items or cases are chosen, what the scores mean when they are being used by students, teachers and institutions and how the reliability or reproducibility of scores is conceived (Popham & Husek 1969). There are also a variety of important assessment situations when scores are not equivalent but where adjustments can be made. For example, in an MCQ or SP examination, the test items or cases are often changed over time and, despite maintaining similar content, these versions or forms of the same assessment differ in difficulty. With these methods of assessment, the issue can be addressed through test equating (Kolen & Brennan 1995, van der Linden & Hambleton 1997). Equating is a set of procedures, designs and statistics that are used to adjust scores so it is as if everyone took the same test. Although this provides a way to adjust scores, it is complicated, time intensive, and labour intensive. Consequently, it is used often in national assessments and less frequently for locally developed tests. There are also a variety of important assessment situations where scores are not equivalent and where good adjustment is not practical or possible. For example, virtually all methods of assessment based on observed trainees’ encounters with real patients yield scores (or ratings) that are not equivalent because patients vary in the level of challenge they present and observers differ in how hard they grade. Attempts to minimize these unwanted sources of influence on scores usually take the form of wide sampling of patients and faculty (to hopefully balance out the easy and difficult patients), training of observers, and certain IRT-based methods that statistically minimize some of the differences among observers (Linacre 1989). None of these are wholly satisfactory, however, and although these types of assessments are essential to the training and credentialing of doctors, the results must be interpreted with some caution when used for summative purposes. They are well-suited to formative assessment. It is sometimes important to categorize performance on a test, usually as pass or fail (although there are times when more than these two categories are needed). The score that separates the passers from failures is called the standard or pass–fail point. It is an answer to the question, ‘How much is enough?’ There are two types of standards: relative and absolute (Norcini 2003).
Concepts in assessment
Test theories
Classical test theory (CTT)
Generalizability theory (GT)
Item response theory (IRT)
Score interpretation
Score equivalence
Standards
Concepts in assessment
