Concepts in assessment

Chapter 35


Concepts in assessment



Even a cursory review of the assessment literature reveals a bewildering array of dichotomies and concepts. These are often overlapping, and authors sometimes use them with less precision than is desirable, especially for the clinical teacher attempting to make sense of assessment for the first time. The purpose of this chapter is to identify some of these concepts and provide background to their meaning, development and use.



Test theories


Test theories or psychometric models seek to explain what happens when an individual takes a test (Crocker & Algina 1986). They provide guidance about how to select items, how long tests need to be, the inferences that can be drawn from scores and the confidence that can be placed in the final results. Each model makes different assumptions, and, based on those assumptions, different benefits accrue. There are many psychometric models, but three deserve attention here because they are used frequently in medical education.



Classical test theory (CTT)


CTT has been the dominant model in testing for decades, with roots in the late 19th and early 20th century (Lord & Novick 1968). It assumes that an individual’s score on a test has two parts: true score (or what is intended to be measured) and error. To apply CTT to a practical testing situation, a series of very restrictive assumptions need to be made. The bad news is that these assumptions are often violated in practice, but the good news is that even when this happens, it seldom makes a practical difference (i.e. the model is robust with respect to violations of the assumptions).


A number of useful concepts and tools have been developed based on CTT (De Champlain 2010). Among the most powerful is reliability, which indicates the amount of error in observed scores. Also very useful has been the development of item statistics, which helps with the process of test development. CTT has contributed significantly to the development of a series of excellent assessments, and it continues to be used and useful today.



Generalizability theory (GT)


Although it has its roots in the middle of the 20th century, GT rose to prominence with the publication of a book by Cronbach et al in 1972. Like CTT (which is a special case of GT), it assumes that an individual’s score on a test has two parts: true score and error. However, compared to CTT, GT makes relatively weak assumptions. Consequently, it applies to a very broad range of assessments and, like CTT, even when these assumptions are violated it usually makes little practical difference.


GT offers a number of advantages over CTT (Brennan 2001). For example, GT allows the error in a test to be divided among different sources. So in a rating situation, GT allows the error associated with the rater to be separated from the error associated with the rating form they are filling out. Likewise, GT supports a distinction between scores that are intended to rank individuals as opposed to scores that are intended to represent how much they know. Given these advantages, GT has special applicability to the types of assessment situations found in medical education.



Item response theory (IRT)


With considerable interest starting in the 1970s, the use of IRT has grown substantially, especially among national testing agencies (Hambleton et al 1991). Unlike GT, IRT makes very strong assumptions about items, tests and individuals. These assumptions are difficult to meet, so there are a number of different IRT models, each with assumptions that are suitable for particular assessment situations.


When the assumptions are met, many practical benefits accrue (Downing 2003). For example, individual scores are independent of exactly which set of items are taken, and item statistics are independent of the individuals who take the test. So individuals can take completely different test questions, but their scores will still be comparable. As another example, IRT supports the creation of tests that are targeted to a particular score, often the pass–fail point. This permits a shorter test than would otherwise be the case.




Score interpretation


A score is a letter or number that reflects how well an individual performs on an assessment. When a test is being developed, one of the first decisions to be made is how the scores will be interpreted: norm-referenced or criterion-referenced (Glaser 1963). This decision has implications for how the items or cases are chosen, what the scores mean when they are being used by students, teachers and institutions and how the reliability or reproducibility of scores is conceived (Popham & Husek 1969).



Norm-referenced score interpretation


When score are interpreted from a norm-referenced perspective they are intended to provide information about how individuals perform against a group of test takers. For example, saying that a student’s performance was one standard deviation above the mean means that he or she did better than 84% of those who took the test. It says nothing about how many questions the student answered correctly.


Norm-referenced score interpretation is especially useful in situations where there are a limited number of positions and the need to select the best (or most appropriate) of the test takers. For example, in admissions decisions, there are often a limited number of seats and the goal is to pick the best of the applicants. Norm-referenced score interpretation aids in these types of decisions. However, it is not useful when the goal is to identify how much each individual knows or can do.



Criterion-referenced score interpretation


When scores are interpreted from a criterion-referenced perspective (sometimes called domain-referenced) they are intended to provide information about how much an individual knows or can do given the domain of the assessment. For example, saying that a student got 70% of the items right on a test means that he or she knows 70% of what is needed. It says nothing about how the student performed in comparison to others.


Criterion-referenced score interpretation is particularly useful in competency testing. For example, an assessment designed to provide feedback intended to improve performance should yield scores interpreted from a criterion-referenced perspective. Likewise, an end-of-course assessment should produce scores that indicate how much material students have learned. However, criterion-referenced score interpretation is not useful when the goal is to rank students.


One common variation on criterion-referenced score interpretation is called a mastery test. In a mastery test, a binary score (usually pass or fail) connotes whether the individual’s performance demonstrates sufficient command of the material for a particular purpose.



Score equivalence


When an assessment is given, there are many instances when it is desirable to compare scores among trainees, against a common pass–fail point and/or over time. Clearly, if all the trainees take exactly the same questions or encounter exactly the same patients, it is possible to compare scores and make equivalent pass–fail decisions. Some methods, like multiple-choice questions (MCQs) and standardized patients (SPs), were created in part to ensure that all trainees would face exactly the same challenges and their scores would mean exactly the same thing.


There are also a variety of important assessment situations when scores are not equivalent but where adjustments can be made. For example, in an MCQ or SP examination, the test items or cases are often changed over time and, despite maintaining similar content, these versions or forms of the same assessment differ in difficulty. With these methods of assessment, the issue can be addressed through test equating (Kolen & Brennan 1995, van der Linden & Hambleton 1997). Equating is a set of procedures, designs and statistics that are used to adjust scores so it is as if everyone took the same test. Although this provides a way to adjust scores, it is complicated, time intensive, and labour intensive. Consequently, it is used often in national assessments and less frequently for locally developed tests.


There are also a variety of important assessment situations where scores are not equivalent and where good adjustment is not practical or possible. For example, virtually all methods of assessment based on observed trainees’ encounters with real patients yield scores (or ratings) that are not equivalent because patients vary in the level of challenge they present and observers differ in how hard they grade. Attempts to minimize these unwanted sources of influence on scores usually take the form of wide sampling of patients and faculty (to hopefully balance out the easy and difficult patients), training of observers, and certain IRT-based methods that statistically minimize some of the differences among observers (Linacre 1989). None of these are wholly satisfactory, however, and although these types of assessments are essential to the training and credentialing of doctors, the results must be interpreted with some caution when used for summative purposes. They are well-suited to formative assessment.



Standards


It is sometimes important to categorize performance on a test, usually as pass or fail (although there are times when more than these two categories are needed). The score that separates the passers from failures is called the standard or pass–fail point. It is an answer to the question, ‘How much is enough?’ There are two types of standards: relative and absolute (Norcini 2003).



Relative standards


For relative standards, the pass–fail point is chosen to separate individuals based on how well they performed compared to each other. For example, a cutting score might be selected to pass the top 80% of students (i.e. the 80% of students with the best scores).


Relative standards are most appropriate in settings where a group of a particular size needs to be selected. For instance, in an admissions setting where the number of seats is limited and the purpose is to pick the best students, relative standards make the most sense. Relative standards are much less appropriate for assessments where the intention is to determine competence (i.e. whether a student knows enough for a particular purpose).


Stay updated, free articles. Join our Telegram channel

Dec 9, 2016 | Posted by in GENERAL & FAMILY MEDICINE | Comments Off on Concepts in assessment

Full access? Get Clinical Tree

Get Clinical Tree app for offline access