Standard setting

Standard setting

J. Norcini and D.W. McKinley

Introduction

In medical education, it is common to need to identify knowledge or performance that is ‘just good enough’ for a particular purpose. One example is a pass or fail multiple-choice examination, where a single score is chosen as the cutoff. Passing examinees achieve that cutoff score or higher, while failing examinees do not. Passing implies sufficient knowledge or skill given the purpose of the test, while failing implies insufficient knowledge or skill. Standard setting is the process of demarcating the level of knowledge and skill indicating proficiency and identifying a score on the examination that corresponds to it.

Unlike many medical tests, educational assessments only rarely have a gold standard against which to establish the validity of a cutoff score. The nature of a ‘competent’ physician or ‘unsatisfactory’ medical student varies over time, place and many other factors. Consequently, standards on educational tests are the expression of judgement in the context of a particular assessment, its purpose and the wider social-professional environment.

Because standards are based on judgement, methods for selecting them are not intended to discover an underlying truth. Instead, they are a means for gathering a variety of perspectives, blending them together and expressing them as a single score on a particular assessment. Consequently, the methods do not differ in the correctness of the standards they yield, but in their credibility and defensibility. This chapter describes the types of standards, specifies the important characteristics of the standard setters and the methods, reviews some of the common methods for setting standards and provides a framework for evaluating their credibility (Norcini & Shea 1997, Norcini 2003, Norcini & Guille 2002).

Types of standards

There are two types of standards:

Relative standards are expressed in terms of the performance of a group of examinees. For instance, a relative standard may be that the 120 examinees with the highest scores are admitted to medical school. This type of standard is appropriate for assessments intended to select a certain number or percentage of examinees, such as tests for admissions or placement.

Absolute standards are expressed in terms of the performance of examinees against the test material. For instance, a passing score may be that any examinee who correctly answers 75% or more of the questions knows enough anatomy to pass. This type of standard is appropriate for assessments intended to determine whether examinees have the necessary knowledge or skills for a particular purpose, such as course completion or graduation from medical school.

Important characteristics of the standard setters and standard setting methods

The characteristics of the standard setters are likely to have the biggest impact on the credibility of a standard. The standard setters must understand the purpose of the test and the reason for establishing the cut score, know the content and be familiar with the examinees. In a low-stakes setting like a course, a single faculty member is credible, but standards will vary over time, and he or she has a conflict of interest in being both the teacher and assessor. In a high-stakes setting like licensure, a significant number of standard setters need to be involved because this increases the reproducibility of standards and reduces the effects of ‘hawks’ and ‘doves’. Ideally, the group would be free of conflicts of interest, include a mix of educators and practitioners and be balanced with regard to gender, race, geography and the like.

The specific method chosen to set standards is not as important as whether it produces results that are fit for the purpose of the test, relies on informed expert judgement, demonstrates due diligence, is supported by a body of research and is easy to explain and implement.

Fit for purpose

The method must produce standards that are consistent with the purpose of the assessment. Methods that turn out relative standards are to be used when the purpose is to select a specific number of examinees. Methods that turn out absolute standards are to be used when the purpose is to judge competence.

Based on informed judgement

Methods for setting standards can be based entirely on empirical results (e.g. consequences, performance on criteria), entirely on expert judgement or on a blend of the two. There are only rarely instances in which it is possible to base a standard entirely on empirical results in medical education, with the exception of a few admissions testing situations (where outcome data, like successful completion of a course, are available and relative standards are being used).

Instead, most of the methods allow a standard to be based solely on the judgement of experts, without reference to performance data (e.g. the difficulty of the questions, the pass rate). Moreover, standard setters sometimes become uncomfortable when data are presented, thinking that it ‘biases’ their judgements.

In fact, methods for setting standards are not intended to discover an essential truth but to create a credible standard out of the judgements of experts. Such credibility derives from decisions that are based on all of the available information. Consequently, methods that permit and encourage expert judgement in the presence of performance data are preferable.

Demonstrates due diligence

Methods that require the standard setters to expend thoughtful effort will demonstrate due diligence and this lends credibility to the final result. In contrast, methods that require quick, global judgements are less credible, and methods requiring several days of effort are unnecessary.

Supported by research

Methods supported by a research literature will produce more credible results. Ideally, studies should show that standards are reasonable compared to those produced by other methods, reproducible over groups of judges, insensitive to potentially biasing effects and sensitive to differences in test difficulty and content.

Easily explained and implemented

Credibility is enhanced if the method is easy to explain and implement. This decreases the amount of training required for the judges, increases the likelihood of their compliance and consistency and assures examinees that they are being treated fairly.

Methods for setting standards

There is a host of methods for setting standards, and many have variations. Reviews and descriptions are available elsewhere (Berk 1986, Cusimano 1996), but according to Livingston and Zieky (1982) they fall into four categories:

• relative methods

• absolute methods based on judgements about assessment content (assessment-centred)

• absolute methods based on judgements about individual examinees (examinee-centred)

• compromise methods.

All of the methods require that several standard setters be selected and that they meet as a group. As the name implies, relative methods produce relative standards and thus judgements are made about what proportion of the examinees should pass. The two groups of methods for setting absolute standards differ in the type of judgements that are being collected. In one group, the standard setters consider whether individual examinees should pass, and these judgements are aggregated to derive the cutoff. In the other group, the standard setters consider individual test questions, and these judgements are combined to calculate the cutting score. The compromise methods require judgements about both what proportion of the examinees should pass and what score they need to achieve to do so. The final result is a compromise between these two types of judgements.

Relative methods

In the fixed-percentage method, each standard setter announces what percentage (or number) of examinees is qualified to pass. Their judgements are recorded for all to see, and they then engage in a discussion, often led off by those with the highest and lowest estimates. All are free to change, and when the discussions are over the estimates are averaged. The standard is that score which passes the average percentage (or number) of examinees.

In the reference group method, the process is exactly the same except that the standard setters have a particular group of examinees in mind (e.g. graduates of a certain set of schools or examinees with specific educational experiences). The selection of this reference group is based on the fact that the standard setters are most familiar with them and able to make good judgements about them. The cutting score established for this reference group is applied without modification to all other examinees.

These methods are quick and easy to use, they only have to be repeated occasionally, the standard setters are comfortable making the required judgements and they apply equally well to all different test formats. However, the standards will vary over time with the ability of the examinees, and they are independent of how much examinees know and the content of the test.

Absolute methods based on judgements about test questions (test-centred)

The two most popular methods in this category have been proposed by Angoff and Ebel. Both methods require that the standard setters specify the characteristics of a borderline group of examinees. The borderline group excludes examinees who would clearly pass or fail and is composed of those about whom the standard setters are uncertain.

In Angoff’s method, the standard setters estimate the proportion of the borderline group that would respond correctly to an item. These are discussed with all being free to change their estimates, and the process is repeated for all items on the test. To calculate the standard, the estimates for each item are averaged and the averages are summed (see Table 36.1). Often, as a ‘reality check’, examinee performance is provided as well. In this example, the percentage of all examinees choosing the correct option (p value) is provided.

Table 36.1

Application of Angoff’s method to an eight-item test

The meeting of five standard setters begins with a discussion of the characteristics of a borderline group of students. When the standard setters reach consensus, they turn to a consideration of the first item. The standard setters each estimate aloud what proportion of the hypothetical borderline group would respond correctly to the question. Their estimates are written on a board for all to see and a discussion ensues, led by the standard setters with the highest and lowest estimates. All standard setters are free to change their estimates. The standard setters proceed in this manner through all of the items on the test. The cut score is taken as the sum of the standard setters’ mean estimates for each question.

Only gold members can continue reading. Log In or Register to continue

Related

Stay updated, free articles. Join our Telegram channel

Tags: A Practical Guide for Medical Teachers

Dec 9, 2016 | Posted by admin in GENERAL & FAMILY MEDICINE | Comments Off

Full access? Get Clinical Tree

Get Clinical Tree app for offline access

Get Clinical Tree app for offline access