html xmlns=”http://www.w3.org/1999/xhtml” xmlns:mml=”http://www.w3.org/1998/Math/MathML” xmlns:epub=”http://www.idpf.org/2007/ops”>
Introduction
Over the past few decades, the focus on professionalism has increased dramatically throughout the continuum of medical education. It is now considered an essential component of competence in a variety of countries including Canada (CanMEDS roles), India (Medical Council of India Regulations on Graduate Medical Education, 2012), the United Kingdom (Good Medical Practice), and the United States (ACGME competencies).1–4 The emphasis on competencies such as professionalism reflects a shift to a model that starts with the desired educational outcomes and works backward to define the educational process.
Assessment is central to outcomes-based education. It constitutes the means by which stakeholders are assured that learners have achieved the competencies necessary to meet the needs of the community. For students, it offers guidance regarding milestones in their development.5 In addition, there is a growing appreciation of the critical role of formative assessment and feedback in both learning and identity formation.6 This chapter will address the assessment of professionalism by (1) outlining the challenges, (2) citing reasons for assessing it, (3) using Miller’s pyramid as a framework for describing some of the methods of assessing professionalism and the research that supports them, and (4) suggesting some principles for developing an assessment system for professionalism. We conclude with brief consideration of lessons learned and future directions.
At the outset we note that many chapters in this book are focused on professional identity formation, rather than professionalism. Identity formation as a process has been studied for decades.7,8 Recently, the literature has explicitly addressed the question of how the process of medical education supports and enhances professional and clinical identity formation.9–11 We see professionalism and professional identity formation as two largely overlapping bodies of work. Because our task in this chapter is to address assessment, and there is a rich literature in the area, we chose that body of work for our focus. The goals and methods of assessment generally align with how students perform – as of yet they are not well developed (and perhaps they should not be?) for judging what a person is. Nevertheless, many of the tools and processes we review and the ideas behind them are readily adaptable to the concerns underlying professional identity formation, such as reflections, attitudes, and behaviors.
Challenges in assessing professionalism and professional identity
The assessment of professionalism is challenging for a number of reasons, but four are particularly salient.12 First, as noted elsewhere in this book, professionalism can be difficult to define. The lack of a single, well-accepted conceptualization and definition creates a significant challenge since good formative or summative assessment starts with a clear understanding of what is to be measured. Hodges et al.,13 in generating a consensus statement for the Ottawa Conference, identified three separate discourses for professionalism, each of which resulted in a somewhat different set of recommendations for its assessment. Clearly, some accommodations among these discourses must be made before developing an assessment system for professionalism.
Second, given a shared definition, actual opportunities to observe professionalism might be relatively rare. Certainly, some of them can be created in the form of test material for written exams or simulations. However, learners are on their best behavior under these conditions and, especially in an area like professionalism, social desirability will play a significant role. Of course, if examinees cannot behave well in circumstances when they know they are being assessed, there is little chance that they can behave well when unobserved. Regardless, creating and finding opportunities to make judgments about professionalism will be a significant challenge in the development of an assessment system.
Third, even when the definition is clear and relevant observations are made, assessors might operationalize the definition in different ways. They might not agree about whether particular behaviors signify a lapse in professionalism or whether they are appropriate for a specific point along the developmental course of a professional identity. In addition to the need for faculty development, this implies that it will be essential to involve multiple observers or assessors over multiple occasions for each learner.
Fourth, the current reconceptualization of professionalism within an identity formation framework offers a powerful way to support both the teaching and the learning of this competency. At its highest level, this identity is composed of values, attributes, and behaviors, and multiple identities are to be reinforced. This poses particular challenges for assessment and requires a system that draws on multiple assessment methods and is capable of capturing change over time. It also implies the need to develop new methods that offer different perspectives on this competency.
Reasons for assessing professionalism
There are at least four reasons to assess professionalism. First, it is widely accepted that assessment drives learning, and that including professionalism as part of high-stakes decisions signals a message of value. It motivates learners to prepare in order to be successful and it indicates to all stakeholders that professionalism is important.
Second, assessing professionalism supports the development of professional identity. Formative assessment, in particular, offers feedback intended to direct and catalyze learning. Properly done, this form of assessment will help students discover who they are and will offer guidance on how they can become who they want and need to be.
Third, it is an essential element of a quality improvement cycle. Assessment is needed at the start of such a cycle to recognize strengths and identify weaknesses that must be addressed as part of the planning phase. It is also needed at the end of the cycle to determine whether the educational interventions that have been implemented were successful.
Fourth, the assessment of professionalism is critical to the identification and remediation of learners whose performance may endanger patients; they have not yet incorporated into their professional identities the values and behaviors that support optimal patient care. This function is essential to protect patients and to improve the safety and quality of care rendered by healthcare systems. In doing this, assessment also establishes the accountability of the institution and the profession.
Methods of assessing professionalism
As with other competencies, there are a number of different ways of assessing professionalism and, like the assessment of other competencies, they all have strengths and weaknesses. Consequently, it is important to develop a system of assessment, composed of different methods, at different times, and for different contexts. The system needs to be responsive to the developmental needs of the learner, establish the accountability of the institution, and protect patients and the profession.
A useful way to think about the individual methods of assessment is through the lens of the traditional version of Miller’s pyramid.14 Miller developed the pyramid to classify methods of assessment based on what they require of the learner. The pyramid is composed of four levels: the lowest level is knows, followed upward by knows how, shows how, and does. Each level builds on the lower levels. For instance, learners need to know before they can know how, show how, and do. Similarly, learners must know, know how, and show how before they can do. The fact that Miller arranged the levels in a pyramid is not meant to imply that those methods at the top are better or more valued than those at the bottom. It depends on the purpose of assessment and if, for example, the purpose of the assessment is to ensure knowledge, methods higher than knows on the pyramid will generally be less appropriate and less efficient.
Our use of the traditional pyramid is not inconsistent with the identity formation perspective of this book. We do recognize that the focus of professional identity is on is rather than any of Miller’s four levels. In fact, in a recent paper,15 the editors of this volume argued that Miller’s pyramid should be modified to have a fifth level of is. We agree that this adaptation might be beneficial to education-focused endeavors. However, when the lens is on assessment, identity is a composite of attitudes, values, and behaviors that can best be assessed in the context of a system composed of several different methods consistent with Miller’s original pyramid.
Knows
Assessing knowledge is a common endeavor at all levels of professional education. Indeed, the bulk of assessment that we have done historically fits squarely into the knows level of the pyramid. Traditional assessments of knowledge focused on recall and recognition of basic science facts and topics such as abnormal and normal physiology and pathology. In the 1980s, there was a strong movement to require higher cognitive skills of test-takers, including the ability to synthesize information and exercise appropriate judgment.16 In addition, the content of these assessments was expanded to include topics such as basic biostatistics, epidemiology, public health, professionalism, and ethics.
While there is widespread consensus that the bulk of the assessment of professionalism should focus on what a person does and his or her attitudes and values, a good case can be made that there is also a body of core knowledge a physician must possess.17 For example, many of the attributes of professionalism outlined by Veloski18 such as ethics, multiculturalism, and confidentiality of patient data, can efficiently be measured by knowledge assessments. Kao19 outlines the interplay between ethics, law, and professionalism, and discusses topics such as securing informed consent, protecting patient confidentiality, disclosing difficult information, and withholding or withdrawing care. In each of these areas, there are ethical principles that can be assessed; these principles are the foundation for required certifications for clinical investigators such as the widely subscribed Collaborative Institutional Training Initiative (CITI) program.20 In each area, there are also legal standards. Healthcare professionals should be aware of the standards, and such awareness is well-suited to assessment by traditional multiple-choice examinations. Naturally, there are also instances when the ethical principle does not align with the legal standard. In these cases, it is important for a provider to know what to do. Patient scenarios and standardized patients are well-suited to these types of assessments (though, admittedly, when the prompt is what would or should you do rather than what is the best thing to do, the former questions slide into the levels of knows how and shows how).
For any assessment of knowledge, it would be desirable to follow standard procedures for instrument development, including a blueprint that specifies the content of the examination, good item-development practices, and pilot testing. After administration, score reliability should be calculated and evidence for validity should be gathered following procedures such as those outlined by Nunnally and Bernstein21 and Streiner and Norman.22 Each of the examples below shows strengths in some of these areas. While none of the instruments are widely used yet, each has utility as a model in prescribed circumstances.
Barry Challenges to Professionalism questionnaire
This tool is a survey comprised of six patient-based multiple-choice items, each providing four options.23 The item topics cover the domains of conflict of interest, confidentiality, physician impairment, sexual harassment, honesty, and acceptance of gifts. These topics were chosen because they represent common issues that learners at all levels might encounter; three were adapted from the American Board of Internal Medicine Project Professionalism.24 Candidate items were reviewed by a multidisciplinary panel of experts. Importantly, these panel members agreed that each of the six scenarios had a “best answer” and five of the six had a second “acceptable” answer.
In the initial tool development, participants were all student and postgraduate trainees at a large medical center and a random sample of physicians in the state. Item-level responses to the questions were within a reasonable range, and scores varied in expected ways across different levels of experience. Thus, this tool might be used as a brief assessment of professionalism knowledge for multiple domains and multiple levels of learners.
Test of Residents’ Ethics Knowledge for Pediatrics (TREK-P)
The TREK-P is a relatively new tool that has promise as a means to assess ethical knowledge.25 Although it is specialty-specific and thus content would need to be adapted for other specialties, the processes followed in its development might serve as a model for other domains. Content for the tool was derived from earlier work on ethical challenges supplemented by the ACGME definition of professionalism. The initial list was cross-validated against statements endorsed by the American Academy of Pediatrics. The first version of the TREK-P consisted of thirty-six knowledge questions testing professionalism, adolescent medicine, genetic testing and diagnosis, neonatology, end-of-life decisions, and decision-making for minors. American Academy of Pediatric guidelines were used to establish the correct response for each item. Each question was asked as a true or false statement and items were developed and reviewed by experts in education and ethics.
The first version of the TREK-P was given to novices (first-year medical students), third-year pediatric residents (the population of interest), and experts (pediatric clinical ethicists).25 The investigators removed thirteen poorly performing items. The resulting twenty-three-item TREK-P had reasonable reliability. Importantly, scores improved with increasing experience: sixty-five percent for students, eighty-three percent for residents, and ninety-six percent for ethicists. Future use might focus on smaller sets of items that generalize across specialties or conversely, on developing parallel forms across specialties.
Matriculating medical students’ knowledge of professionalism
A study by Blue et al.26 aimed to examine three classes of matriculating students’ knowledge and attitudes toward professionalism at two different medical schools. The investigators relied on the Swick27 definition of professionalism and targeted five dimensions believed to be important and appropriate for medical students: subordinating self-interest, ethics and moral values, humanistic values, accountability, and self-reflection. All dimensions are clearly also a part of professional identity. Specific to the current discussion, knowledge was assessed with two instruments: medical vignettes and multiple-choice questions.
Analysis of the data revealed that there were five factors underlying the scores that are consistent with definitions of professionalism and professional identity: (1) subordinating self-interest, (2) professional responsibility, (3) managing complexity and uncertainty, (4) professional commitment, and (5) humanism. The highest knowledge score was for the attribute of humanism, followed by professional responsibility, subordinating self-interest, managing complexity and uncertainty, and professional commitment.
Reflection/critical incident technique
The critical incident technique (CIT) is widely used in the education and assessment of healthcare providers. In general, CIT begins by asking participants to reflect on an incident – often related to professionalism or professional identity – that they observed or participated in, and write a brief summary.28 For example, Niemi asked first-year medical trainees to keep a learning log after each visit to a primary care health center during the first year of medical school. The purpose was to record thoughts, feelings, and emotions, and to provide “a forum to reflect and wonder about your experiences.” Some users of the CIT give more explicit instructions such as directing students to use a prescribed format for guided reflection.29 Then, if aimed at professionalism, there is an analysis and often group discussion of what tenets of professionalism were involved. When formally graded, this instrument does provide an assessment of a student’s knowledge and understanding of professionalism.29 In this context, CIT fits well under knows.
Knows how
Knows how is the second level of Miller’s pyramid and is comprised of a heterogeneous group of methods, usually written, that attempt to predict how the learner will act in situations that reflect professional values and identities. Some of these methods focus on measuring the attitudes and values of the learner; thus, they are important parts of a system of assessment aimed at identity formation. A systematic review identified many such methods30 and the attitudes they assessed fell into several categories including professionalism as a whole, ethical issues, personal values, physician-patient relationships, sociocultural issues, and interprofessional relationships. Some investigators have argued that the development of a profile of attitudes has the advantage of providing detailed information that supports the provision of feedback.31 Other investigators believe strongly that there is a need for a single global measure of attitudes that captures an overall viewpoint.32 In either instance, there are many different instruments with acceptable characteristics.
Another set of methods seeks to understand whether the learner can reason in a sophisticated fashion about ways to behave in particular situations. As such, these methods are often predicated on a developmental model, and they paint a scenario to which the examinee must respond. Some of the methods have grown out of the moral judgment literature. Included are the Moral Judgment Interview (MJI), a semi-structured interview with complex scoring, the Sociomoral Reasoning Measure, which is a paper version of the MJI with simplified scoring, and the Defining Issues Test (DIT), which is an MCQ-based version of the MJI.33–35 Some of these instruments have been built on refinements to the growth of epistemic cognition, and others, such as the Situational Judgment Test, eschew the focus on development and come out of an industrial–organizational psychology tradition.
As a group, the knows how methods are of particular importance within the discourse of professionalism as an interpersonal process or effect.13 Professional behavior grows out of the interaction of individuals’ attitudes and problem-solving skills within particular learning and practice contexts. Consequently, great stress is placed on understanding and developing attitudes and thinking in a variety of situations. This can also be said for identity formation. However, within a few broad categories there are so many of these scales, and the differences among them are so minor, that future work should be focused on refining what is available rather than developing new assessment methods. Additional focus on qualitative methods, such as the MJI or those proposed by Rees and Knight,36 might also be of interest for supporting the formative parts of a system of assessment aimed at identity formation.
The Jefferson scales
Educators at Jefferson Medical College have created a series of instruments designed to assess attitudes toward empathy, teamwork, and lifelong learning.37–39 All of these constructs have been identified as important aspects of professionalism, and all of the instruments are similar in that they are composed of a series of questions for which the respondent supplies a rating.32
These instruments were developed in a similar and careful fashion. Starting from a definition and previous work in the area, a content outline and a pool of questions were developed. These went through an iterative process in which the questions were reviewed and edited by groups of faculty experts. This process resulted in the creation of one or more forms, some of which were tailored to specific populations. For example, there was a version of the empathy scale made especially for students, and one made especially for healthcare professionals. These scales were field tested and refined further.
Extensive research has been done with the Jefferson scales and they performed well.32 Factor analyses have produced meaningful results, and measures of internal consistency have been high. More importantly, the scales have reasonable relationships with a variety of criterion measures. For example, a cross-cultural study showed that nurses and physicians from countries with a complementary model of professional roles had more positive attitudes toward collaboration.40
Moral Judgment Interview (MJI) and the Defining Issues Test (DIT)
In the 1950s, Kohlberg refined the work of Piaget and developed a six-stage model of the development of moral judgment.41,42 To assess development through these stages, he created the MJI.43 In this method, an interviewer presents the student with a number of moral dilemmas and the student is asked to resolve them. Each dilemma is followed by a series of open-ended questions that probe the reasoning of the student. The session is recorded and transcribed; the performances are assigned scores that are associated with the stages of moral development.
The DIT was developed by Rest41 and is similar to the MJI in that students are presented moral dilemmas to which they must respond. Instead of open-ended questions, however, the students are offered a series of multiple-choice questions that they must answer. Their answers are scored as the percentage of responses from each stage of development. Because of its MCQ format, the DIT can more easily be administered in a variety of settings.
Baldwin and Self44 point out that there are hundreds of studies supporting the use of both the MJI and the DIT as valid and reliable measures of moral reasoning. They also indicate that these assessments are related to many of the attributes of professionalism. Consequently, they suggest that these measures can be a useful component of an evaluation system for professionalism. It is not a stretch to argue that tools are also relevant to assessment of professional identity.
Situational Judgment Test (SJT)
The SJT is predicated on the idea that there is more to doing a job well than simply knowing what to do. Regardless of profession, there is a need to interact well with other people, solve problems, work in teams, organize and plan, cope with pressure, and so on. The SJT offers realistic scenarios that are ideally based on “critical incidents,” typically presented in a paper-and-pencil or video format, and followed by a series of potential responses. Respondents are asked to select the best of these or rank them.
As reported by McDaniel et al.,45 assessments of situational judgment date back nearly a century. Through the late 1940s, they were criticized as measures of general intelligence and not unique assessments of social intelligence. Starting in the 1950s, they were used to assess the potential to supervise and to predict managerial success. They were studied extensively during this period; the data provided modest support for the method and it continues to be used and studied across a variety of occupations.
By the early 1990s, the SJT found its way into medicine. Of particular note is its use in admissions settings, where it is now being used as one basis for selection in a variety of countries (e.g., United Kingdom, Belgium, Canada, Australia). This is justified by specific studies indicating incremental validity over cognitive measures.46
Groningen Reflection Ability Scales
In a different type of assessment, Aukes et al.47 developed a scale to measure personal reflection ability, the Groningen Reflection Ability Scales (GRAS). The GRAS assesses skills related to self-reflection, empathetic reflection, and reflective communication. Review of the item content suggests this scale falls somewhere between knows and knows how; for example, “I can see an experience from different standpoints” is an example of the former, whereas “I test my own judgments against those of others” is an example of the latter.
Professional self-identity questionnaire
Consistent with the notion of identity formation, this questionnaire was developed to monitor how curricula contribute to identity development across a range of healthcare professions.48 Crossley and Vivekananda-Schmidt conducted a content analysis of curricula in several fields to develop the initial content for a questionnaire. The resulting nine-item form was tested in ten professional groups, refined, and then given to a group of student doctors. Those who had more experience in healthcare or social care roles had higher scores than their peers. In addition, the scores increased in conjunction with clinical experiences.
Shows how
The third level of the pyramid focuses on how the learner actually performs when being observed, and it incorporates a variety of simulation and workplace-based methods of assessment. Many of these are predicated on the fact that the encounter between the healthcare provider and the patient or another healthcare provider is one of the critical situations in which professionalism manifests itself. Both workplace-based assessments and simulations have strengths and weaknesses, and neither offers an assessment of all aspects of professionalism. In the following paragraphs, the relative strengths of each class of methods are reviewed, followed by discussion of a commonly used simulation method – standardized patients, and a commonly used workplace method – mini-CEX.
In workplace-based assessment, learners are observed in real encounters with patients or colleagues. The observer makes judgments about the quality of the performance. In simulation, real patients or colleagues are replaced with realistic but artificial experiences. The person being assessed interacts with the recreations and judgments are made about his or her performance. The methods can be stratified by how faithful they are to reality, with some having very high fidelity (e.g., human-patient simulators, virtual reality, standardized patients).
There is a significant research literature indicating that, regardless of method, performance is case or patient specific. How one interacts with one patient or team is not necessarily related to interactions with the next patient or team. Consequently, good assessment requires broad sampling across different encounters. For simulation, this is usually accomplished by creating a test composed of several different encounters through which learners rotate. For workplace-based assessment, different encounters are observed over a period of time, often weeks or months.49,50 Simulation provides good fidelity and good content coverage and has the advantage of allowing the assessment of unusual circumstances. It also ensures that no harm comes to patients – this is especially relevant when learners are near the beginning of their training. Workplace methods have excellent fidelity and excellent content coverage. In addition, they allow difficult-to-simulate encounters to be included in assessment.
For examinations in which it is important to compare learners, simulation has a distinct advantage. Different examinees can be given the same cases and assessors and, when different forms of the test must be administered (e.g., over time), statistical techniques can be used to adjust the scores and make it, ‘as if’ all students took exactly the same test.50 In contrast, variability due to different encounters and assessors can be reduced to some degree in workplace-based methods but it cannot be eliminated. In high-stakes settings, security is a concern for simulation but not workplace methods.
Where there are significant resource constraints, workplace methods have an advantage. Faculty development is required and the logistics can sometimes be challenging but these methods can be feasibly implemented, even in the setting of relatively small training programs. Simulation, on the other hand, often requires access to considerable amounts of space, equipment, and personnel. In addition, the development of test material requires significant resources.49
Standardized patients (SPs)
SPs are actors who have been trained to play the role of a patient.51 They are given scripts and expected to perform the same each time they play the role (within the confines of the actions of the person being assessed). A string of encounters, often ten to twenty-five minutes each, is typically administered in round-robin format to as many examinees as there are SPs. A score is developed across all stations.5
SPs cannot assess all aspects of professionalism, but they can get at the attributes associated with it, as well as professional identity. Van Zanten et al.53 lined up the competencies measured by an SP-based examination with the definitions of professionalism offered by the American Board of Internal Medicine and the Medical School Objectives Project (MSOP). They found overlap in areas such as honor, integrity, abuse of power, respect for patients, and so on. Similarly, Klamen and Williams54 argued that the core professional values of compassion, responsibility, and integrity could all be demonstrated through communication skills, and assessment of these skills is a strength of SPs.
In testing a large group of international medical graduates, Van Zanten et al.53 found that SPs offered a reliable and valid means of assessing certain aspects of professionalism. Incorporating a much broader group of studies, Klamen and Williams reached a similar conclusion – SPs provide a valid and reliable measure of communication skills among a variety of different examinees (e.g., medical students, postgraduates, practicing doctors both individually and in teams).54
Mini-CEX and the Professionalism-Mini Clinical Evaluation Exercise (P-MEX)
The mini-CEX can be used to assess whether examinees behave in a professional manner.55 The trainee conducts a brief, observed clinical encounter in any one of a number of different clinical settings. At the end of the encounter, the observer completes a rating form and offers the trainee feedback and a plan for improvement. A number of different rating forms have been developed, but the original form asked for an assessment of humanistic qualities and professionalism on a nine-point scale.
The mini-CEX captures some information on professionalism but is not solely focused on it. Consequently, Cruess et al.56 developed a variation on the mini-CEX called the Professionalism Mini-Evaluation Exercise. At a workshop, the authors identified 142 observable behaviors that were reflective of professionalism. A subset of these was converted into a rating form that was designed for use in a variety of different settings. As in the mini-CEX, this instrument is used in the context of an observed clinical encounter.
There is extensive research concerning the mini-CEX, with several reviews of the literature having been published. For example, Kogan et al.57 concluded that among the methods of direct observation, the mini-CEX had the strongest validity evidence. Likewise, Ansari and Donnon58 conclude that the validity of the mini-CEX is supported by small to large effect sizes. Using Kane’s framework for validity, Hawkins et al.59 found that the scoring component yielded the most concerns but that evidence for other aspects of validity (i.e., generalization and extrapolation) was supportive.
Although less work has been done with the P-MEX, that which has been done is supportive of the measure. Cruess et al.56 found evidence of construct and content validity as well as reasonable reliability with multiple encounters. A multi-center trial in Japan replicated these findings across cultures.60 And of course, the findings associated with the mini-CEX are relevant to the P-MEX as well.