[PDF] CRESST examined learning and assessment issues in Navy - Free Download PDF (2024)

Download CRESST examined learning and assessment issues in Navy...

NAVY BUREAU OF MEDICINE & SURGERY DEPARTMENT OF VETERANS AFFAIRS US PUBLIC HEALTH SERVICE ARM MEDICAL DEPARTMENT AIR FORCE MEDICAL SERVICE NAVY BUREAU OF MEDICINE & SURGERY DEPARTMENT VETERANS AFFAIRS US PUBLIC HEALTH SERVICE ARMY MEDICAL DEPARTMENT AIR FORCE MEDICAL SERVICE NA BUREAU OF MEDICINE & SURGERY DEPARTMENT OF VETERANS AFFAIRS US PUBLIC HEALTH SERVICE ARMY MEDIC DEPARTMENT AIR FORCE MEDICAL SERVICE NAVY BUREAU OF MEDICINE & SURGERY DEPARTMENT OF VETERA AFFAIRS US PUBLIC HEALTH SERVICE ARMY MEDICAL DEPARTMENT AIR FORCE MEDICAL SERVICE NAVY BUREAU MEDICINE & SURGERY DEPARTMENT OF VETERANS AFFAIRS US PUBLIC HEALTH SERVICE ARMY MEDICAL DEPARTME AIR FORCE MEDICAL SERVICE NAVY BUREAU OF MEDICINE & SURGERY DEPARTMENT OF VETERANS AFFAIRS US PUBL HEALTH SERVICE ARMY MEDICAL DEPARTMENT AIR FORCE MEDICAL SERVICE NAVY BUREAU OF MEDICINE & SURGE DEPARTMENT OF VETERANS AFFAIRS US PUBLIC HEALTH SERVICE ARMY MEDICAL DEPARTMENT AIR FORCE MEDIC SERVICE NAVY BUREAU OF MEDICINE & SURGERY DEPARTMENT OF VETERANS AFFAIRS US PUBLIC HEALTH SERVI ARMY MEDICAL DEPARTMENT AIR FORCE MEDICAL SERVICE NAVY BUREAU OF MEDICINE & SURGERY DEPARTMENT VETERANS AFFAIRS US PUBLIC HEALTH SERVICE ARMY MEDICAL DEPARTMENT AIR FORCE MEDICAL SERVICE NA BUREAU OF MEDICINE & SURGERY DEPARTMENT OF VETERANS AFFAIRS US PUBLIC HEALTH SERVICE ARMY MEDIC DEPARTMENT AIR FORCE MEDICAL SERVICE NAVY BUREAU OF MEDICINE & SURGERY DEPARTMENT OF VETERA

Designing and Using Computer Simulations in Medical Education and Training

SPECIAL ISSUE October 2013 Supplement to Military Medicine Volume 178, Number 10

Guest Editors

Harold F. O’Neil | Kevin Kunkler | Karl E. Friedl | Ray S. Perez

The National Center for Research on Evaluation, Standards, and Student Testing (CRESST) at the University of California, Los Angeles (UCLA) has extensive experience in research, development, program management, and evaluation. A core capability is an integrated view of learning, instruction, human performance assessment, and psychometrics. CRESST is a pioneer of ontology-based designs for domain knowledge, assessment, and 21st century skills. CRESST is creating such ontologies for game-based STEM (Science, Technology, Engineering, and Mathematics) learning interventions funded by DARPA. For the Telemedicine and Advanced Technology Research Center (TATRC), CRESST has completed a state-of-the-art assessment of the use of medical simulation in medical education and training, which is the topic of this ZWLJPHSPZZ\L>P[O6MÄJLVM5H]HS9LZLHYJO659M\UKPUN CRESST examined learning and assessment issues in Navy [YHPUPUNLN:\YMHJL>HYMHYL6MÄJLYZ:JOVVSZPT\SHtion-based training).

;OL+LWHY[TLU[VM[OL5H]`»Z6MÄJLVM5H]HS9LZLHYJO659 provides the science and technology necessary to maintain the Navy’s and Marine Corps’ technological advantage. ;OYV\NOP[ZHMÄSPH[LZ659PZHSLHKLYPUZJPLUJLHUK[LJOUVSogy with engagement in 50 states, 70 countries, 1,035 institutions of higher learning, and 914 industry partners. ONR employs approximately 1,400 people, comprising uniformed, civilian, and contract personnel, with additional employees at the Naval Research Lab in Washington, D.C.

The Armed Forces Simulation Institute for Medicine (AFSIM) was established in 2010 as the laboratory activity for medical simulation at the Telemedicine and Advanced Technology Research Center (TATRC). AFSIM runs the TATRC Innovation Lab for prototype evaluation and conducts science activities PU[OLÄLSKVMTLKPJHSLK\JH[PVUHUKTLKPJHSZPT\SH[PVU technology. The AFSIM also has advisory groups to bring industry, academia, and tri-service input into medical simulation technology development.

Military Medicine International Journal of AMSUS

1891-2013 PUBLISHER VADM Mike Cowan, MC, USN (Ret), Executive Director

ASSOCIATE PUBLISHER CDR John Class, MSC, USN (Ret), Deputy Executive Director

EDITOR *CAPT William H.J. Haffner, M.D. USPHS (Ret)

ASSOCIATE EDITOR *CAPT Trueman W. Sharp, MC, USN

ASSISTANT EDITOR CAPT Melvin Lessing, USPHS (Ret)

JOURNAL ADMINISTRATOR Tonya Lira

ISSN 0026-4075 ADVERTISING REPRESENTATIVE SLACK Incorporated 6900 Grove Rd Thorofare, NJ 08086 (800) 257-8290 (856) 848-1000 Fax (856) 848-6091 National Account Manager – Kathy Huntley Recruitment Manager – Monique McLaughlin Administrator – Michele Lewandowski MILITARY MEDICINE is the official monthly journal of AMSUS - The Society of Federal Health Professionals. The objective of the Journal is to advance the knowledge of federal medicine by providing a forum for responsible discussion of common ideas and problems relevant to federal healthcare. Its mission is: healthcare education; to bring scientific and other information to its readers; to facilitate communication; and to offer a prestige publication for members’ writings. MILITARY MEDICINE is available on the Internet via the AMSUS website (www.amsus.org) to members and subscribers who purchase online access. MILITARY MEDICINE is indexed by the United States National Library of Medicine (NLM) and is included in the MEDLARS system, ISSN 0026-4075. MILITARY MEDICINE is also available online through Ingenta web site: www.ingentaconnect.com/content/amsus.

EDITORIAL BOARD RDML Michael S. Baker, MC, USNR (Ret) CAPT Philip Coyne, USPHS COL Robert A. De Lorenzo, MC, USA LTC Lawrence V. Fulton, USA (Ret) COL Joel C. Gaydos, MC, USA (Ret) Lt Col Anthony Gelish, USAF, MSC (Ret) Col James L. Greenstone, CAP Maureen N. Hood, PhD, RN RADM Joyce M. Johnson, USPHS (Ret) Col Joseph F. Molinari, USAFR, BSC (Ret) Frances M. Murphy, MD, MPH CDR Karen Near, USPHS (Ret.) RADM Carol A. Romano, USPHS (Ret) Col Laura Talbot, USAFR, NC (Ret)

INTERNATIONAL CONSULTANTS TO THE EDITORIAL BOARD Col Sergei Bankoul, Switzerland Col Marcel de Picciotto, MC, France

AMSUS BOARD OF MANAGERS Terms Expiring in 2013 Maj Gen Barbara Brannon, USAF, NC (Ret) MG David A. Rubenstein, MS, USA (Ret) CMSgt Charles R. Cole, USAF (Ret)

Terms Expiring in 2014 Col Ben P. Daughtry, USAF, MSC (Ret) Maj Gen Gar Graham, USAF, DC (Ret) MG Robert Kasulke, MC, USAR (Ret)

Terms Expiring in 2015 BG Michael J. Kussman, MC, USA (Ret) RADM William McDaniel, MC, USN (Ret) MG George Weightman, MC, USA (Ret) Col James Young, USAF, BSC (Ret)

REPRINTS for article reprints and Eprints, contact Tamara Smith at Sheridan Reprints, [emailprotected] ; Fax: (717) 633-8929 PHOTOCOPYING PERMISSION Prior to photocopying items for internal or personal use, the internal or personal use of specific clients, or for educational classroom use, please contact the Copyright Clearance Center, Customer Service, (978) 750-8400, 222 Rosewood Drive, Danvers, MA 01923 USA, or check CCC Online at the following address: www.copyright.com MEMBERSHIP INFORMATION – To update contact information or to discontinue delivery of Military Medicine please contact AMSUS at [emailprotected]. Please include your five-digit member number from the mailing label. SUBSCRIPTION INFORMATION – All subscription rates are listed on our website, www.amsus.org. Checks should be made payable to AMSUS. Publisher reserves the right to restrict subscribers to those in the healthcare field. The addresses of members and subscribers are not changed except upon request. Requests for change of address must reach the Association office 15 days before change is to be effective. Subscription rates are subject to change without notice. CLAIMS FOR MISSING ISSUES – Claims for missing issues should be sent to [emailprotected] within 3 months of the issue date. After the three-month period, payment of $30 is required to replace the issue. AMSUS - The Society of Federal Health Professionals Founded 1891, Incorporated by Act of Congress 1903 9320 Old Georgetown Road Bethesda, Maryland 20814-1653 Telephone: (301) 897-8800 or (800) 761-9320 FAX: (301) 530-5446 E-mail: [emailprotected] (journal); [emailprotected] (other) Copyright © AMSUS - The Society of Federal Health Professionals, 2013 Printed in U.S.A. • All rights reserved.

ex officio (non-voting) VADM Mike Cowan, MC, USN (Ret) - Secretary

LEGAL COUNSEL COL Herbert N. Harmon, USMCR (Ret.)

http://www.amsus.org

*Under a cooperative enterprise agreement between AMSUS and USUHS

MILITARY MEDICINE (ISSN 0026-4075) is published monthly by AMSUS - The Society of Federal Health Professionals, 9320 Old Georgetown Rd., Bethesda, MD 20814-1653. Periodicals postage paid at Bethesda, MD, and additional mailing offices. POSTMASTER: Send address changes to AMSUS, 9320 Old Georgetown Rd., Bethesda, MD 20814-1653.

MILITARY MEDICINE VOLUME 178

OCTOBER 2013

SUPPLEMENT

DESIGNING AND USING COMPUTER SIMULATIONS IN MEDICAL EDUCATION AND TRAINING Guest Editors Harold F. O’Neil University of Southern California/CRESST

Kevin Kunkler Telemedicine and Advanced Technology Research Center

Karl E. Friedl Telemedicine and Advanced Technology Research Center

Ray S. Perez Off ce of Naval Research

Foreword

iv

Ray S. Perez

Introduction Designing and Using Computer Simulations in Medical Education and Training: An Introduction Karl E. Friedl and Harold F. O’Neil

1

Design Issues Cognitive Task Analysis-Based Design and Authoring Software for Simulation Training

7

Using Cognitive Task Analysis to Develop Simulation-Based Training for Medical Tasks

15

Use of Cognitive Task Analysis to Guide the Development of Performance-Based Assessments for IntraOperative Decision Making

22

Balancing Physiology, Anatomy and Immersion: How Much Biological Fidelity Is Necessary in a Medical Simulation?

28

Cost Considerations in Using Simulations for Medical Training

37

Allen Munro and Richard E. Clark

Jan Cannon-Bowers, Clint Bowers, Renee Stout, Katrina Ricci, and Annette Hildabrand

Carla M. Pugh and Debra A. DaRosa

Thomas B. Talbot

J. D. Fletcher and Alexander P. Wind

Assessment and Evaluation Issues Assessment Methodology for Computer-Based Instructional Simulations

47

Application of National Testing Standards to Simulation-Based Assessments of Clinical Palpation Skills

55

Evaluation of Medical Simulations

64

Alan Koenig, Markus Iseli, Richard Wainess, and John J. Lee

Carla M. Pugh

William L. Bewley and Harold F. O’Neil

MILITARY MEDICINE VOLUME 178

OCTOBER 2013

SUPPLEMENT

Instructional Strategies Issues Prevention of Surgical Skill Decay

76

Effects of Simulation-Based Practice on Focused Assessment With Sonography for Trauma (FAST) Window Identif cation, Acquisition, and Diagnosis

87

Adaptive and Perceptual Learning Technologies in Medical Education and Training

98

Ray S. Perez, Anna Skinner, Peter Weyhrauch, James Niehaus, Corinna Lathan, Steven D. Schwaitzberg, and Caroline G. L. Cao

Gregory K. W. K. Chung, Ruth G. Gyllenhammer, Eva L. Baker, and Eric Savitsky

Philip J. Kellman

Psychometric Issues Evidence-Centered Design for Simulation-Based Assessment

107

Potential Applications of Latent Variable Modeling for the Psychometrics of Medical Simulation

115

Use of the Assessment–Diagnosis–Treatment–Outcomes Model to Improve Patient Care

121

Special Issue Editors and Acknowledgements

132

Robert J. Mislevy

Li Cai

Kevin F. Spratt

MILITARY MEDICINE, 178, 10:iv, 2013

Foreword The use of advanced education and training technologies has been common and widespread in the military for decades. Over two decades ago, the Department of Defense initiated large continuing investments in developing modeling and simulation (M&S) programs to improve training effectiveness while reducing their reliance on actual, costly hardware and software systems. The medical and health sciences community relatively recently began the systematic use of modeling and simulation for education and training of medical personnel. The Defense Advanced Research Projects Agency (DARPA) and the Office of Naval Research (ONR) were some of the very first agencies to provide substantial government funding for this purpose. Industry, recognizing its value quickly, picked up on this initiative and began to offer medical procedure simulators for specific training applications. The use of M&S in medical and health sciences training and education has progressed rapidly and this community is now one of the more enthusiastic adopters of M&S technologies for initial training as well as skill maintenance. Skill retention has recently, in the scientific literature and among medical and health sciences trainers, become a critical issue of concern regarding patient safety. M&S provides the means to educate military personnel, families, and colleagues about the impact of physical and psychological injury. We are seeing evidence that these technologies can also be deployed on stand-alone PCs, the Internet, game consoles, and a variety of mobile devices. This use of simulation is also necessary across both civilian and military domains to refresh medical and first responder skills, to test their competencies, and to provide training anywhere and anytime. For the military and health sciences care professionals, medical simulators provide the unique opportunity to train skills that are not readily available within the civilian health care systems. Today’s medical simulators can serve as on-the-job training in combat zones and during humanitarian relief efforts. As M&S technologies continue to become more robust, acceptable, and affordable, the development of valid representations of the human patient will also expand to meet this demand. The medical and health sciences community recognizes these advances in M&S and the potential for use of computer graphics, natural language processing, and artificial intelligence technologies to deliver high-fidelity training. While these advances in M&S for medical education and training have supported development of medical simulations, there has not been a similar focus on measuring and assessing performance while using these tools. There has been less research in the evaluation of simulation-based interventions, and validation of metrics to assess training outcomes and transfer of training from virtual to real world tasks. This research will require extensive attention to educational design principles, human factors issues, and rigorous attention to validation to ensure that M&S platforms are both safe and efficacious. Such evaluation of the performance outcomes is a critical focus to ensure positive medical and health sciences procedure outcomes and patient safety. ONR and the Telemedicine and Advanced Technology Research Center (TATRC) asked the National Center for Research on Evaluation, Standards, and Student Testing (CRESST) to conduct a series of workshops focused on the state of the art in teaching and assessing medical skills using M&S technology. The results of these workshops are presented in this supplement of Military Medicine. It is our belief that these findings are just the beginning of a robust military and civilian enterprise approach to examining and implementing meaningful, effective, and affordable metrics for assessing medical simulations and simulation-based training outcomes. Ray S. Perez, PhD Program Officer, Cognitive Science of Learning Program Office of Naval Research

iv

MILITARY MEDICINE, Vol. 178, October Supplement 2013

MILITARY MEDICINE, 178, 10:1, 2013

Designing and Using Computer Simulations in Medical Education and Training: An Introduction COL Karl E. Friedl, PhD, MS USA*; Harold F. O’Neil, PhD† ABSTRACT Computer-based technologies informed by the science of learning are becoming increasingly prevalent in education and training. For the Department of Defense (DoD), this presents a great potential advantage to the effective preparation of a new generation of technologically enabled service members. Military medicine has broad education and training challenges ranging from first aid and personal protective skills for every service member to specialized combat medic training; many of these challenges can be met with gaming and simulation technologies that this new generation has embraced. However, comprehensive use of medical games and simulation to augment expert mentorship is still limited to elite medical provider training programs, but can be expected to become broadly used in the training of first responders and allied health care providers. The purpose of this supplement is to review the use of computer games and simulation to teach and assess medical knowledge and skills. This review and other DoD research policy sources will form the basis for development of a research and development road map and guidelines for use of this technology in military medicine.

INTRODUCTION Computer-based technologies informed by the science of learning are becoming increasingly prevalent in education and training. For the Department of Defense (DoD), this presents a great potential advantage to the effective and efficient preparation of a new generation of service members. Military medicine, in particular, has broad education and training challenges that range from combat medic training to personal protective measures required for every service member. Many of these challenges can be met with gaming and simulation technologies that this new generation has embraced.1,2 The purpose of this supplement is to review the state of the science in computer games and simulations that could be applied to military education and training in medicine. Based on this supplement and DoD research policy, we plan to develop a road map for research and development in this area as well as develop a set of What Works guidelines for the design and evaluation of these technologies in military medicine. THE NEED FOR MILITARY MEDICAL SIMULATION TRAINING Key reasons why the DoD is invested in computer games and simulation technology go beyond the obvious advantages of modern training effectiveness and patient safety.3,4 There is also a need to dramatically reduce training costs, especially *Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, Maryland 21702. †University of Southern California Rossier School of Education/National Center for Research on Evaluation, Standards, and Student Testing (CRESST), 15366 Longbow Drive, Sherman Oaks, CA 91403. The findings, opinions, and assertions in the article are those of the authors and do not constitute an official position or view of the Department of the Army, the Department of the Navy, or the Department of Defense. This work was partially supported by a grant from the Office of Naval Research (Award Number N00014-10-1-0978). doi: 10.7205/MILMED-D-13-00209

MILITARY MEDICINE, Vol. 178, October Supplement 2013

through better preparation of personnel before costly field training exercises and reduction in total training time. This technology should deliver training in a form that the current generation of recruits expects. There is a need to have justin-time training and refresher training, which potentially would reduce the decay of critical skills. There is a need to reach individuals wherever they are (e.g., away from their duty station, as they mobilize for disaster relief, in remote areas,; etc.), including team training of individuals even before the team is geographically assembled. The use of virtual patients is particularly important to replace standardized patients. Such virtual patients could also be used to teach interpersonal skills and diagnostics and treatment skills based on symptom presentation and interviews.5 The use of virtual patients could broadly be applied, from pharmacy technicians to psychiatry residents learning to interact with standardized patients. It can improve the quality of medical training by providing training adapted to the needs and ability of the individual, update relevant scenarios to provide adaptive training in response to changing health threats, and provide a large number of variations of conditions and scenarios to expand the military capability of the individual or team. These militarily unique computer training scenarios are critical to the integration of classroom and other medical training that will have to be applied in high-stress complicated environments (e.g., noisy, lowlight, airframe or vehicle vibration, threat of attack). An example is the training provided through the Army’s Medical Simulation Training Centers currently established at 18 military installations (http://www .peostri.army.mil/PRODUCTS/MSTC/). Most importantly, if appropriate competency outcome measures have been determined, computer-based games and simulation training can test and objectively score individuals for their competence. There are other challenges in the application of computerbased training technologies that are important to the DoD. 1

Designing and Using Computer Simulations

The service members cannot become dependent on these technologies in an operational environment rather the technology should truly improve the independent capabilities of the trained individual. This becomes important for agility and survivability, especially with modern day threats aimed at the technology itself (e.g., electromagnetic pulse weapons that would shut down all electronics). Thus, this military medical training must balance the use of technology in training context, with its probable absence in operational environments. This is especially important for the DoD and others who may operate in remote and austere environments where technology including electronic systems of all kinds, including telementoring and decision support tools, may not be available. For the DoD, there is also a need to provide elements of science, technology, engineering, and math (STEM) training for the pool of motivated recruits who may have inadequate background education in many areas critical to their skill set (e.g., math and physics for radiology technicians). Thus, STEM proficiency6 is also a military medical need (aside from the national security implications of a generation of underperforming U.S. high school students). One example of engaging game-based STEM training would be a game that incorporates positive role models and military scenarios. For example, use of physics and mathematical principles to secretly capture an enemy submarine and its code machines, revolving around a display such as the U-505 at the Chicago Museum of Science and Industry (http://www.msichicago .org/whats-here/exhibits/u-505/learning-tools/learning-games/). These games would help prepare future generations of K-12 students for critical STEM technical jobs including possible interest in military jobs. The DoD must also produce qualified individuals within a constrained period of time, to reduce manpower costs and to improve availability of the needed specialists. Over-the-horizon concepts of more efficient and effective learning through electrical stimulation and pharmacological activation of specific brain centers may ultimately serve to explain how current technologies used in the entertainment industry are so effective in promoting behaviors that facilitate learning. For example, the same learning activation in classrooms may be purposefully accomplished through the use of engaging game technology if we understand the scientific underpinnings. The requirement to reduce training time through more effective training strategies also requires an appreciation of the training “dosage” as well as evaluation criteria that will determine the probability of properly trained medical personnel. Such medical training for enlisted service members is conducted at San Antonio, Texas (http://www.metc.mil/). This command supports the largest single medical training campus in the world, i.e., Medical Education & Training Center. There are 21,000 students trained annually in more than 60 major medical specialty programs to serve the needs of the Army, Navy, and Air Force.7,8 In addition to this training campus for enlisted service members, the Uniformed Services University of the Health Sciences (USUHS) in Bethesda, Maryland trains 2

physicians, nurses, and allied medical professionals (http:// www.usuhs.mil/). The Army alone has approximately 5,000 physicians and 11,000 nurses, along with many other medical professionals. Since 1980, USUHS has trained 4,700 physicians, providing approximately 20% of the military medical corps physician accessions each year. These service members, and many more medical and allied science officers recruited from other sources, require specialty training in military medicine and refresher training within each of their specialties throughout their careers. Beyond the specific specialty training needs of DoD medical personnel, all service members require common skills training in health topics to maximize their readiness status (i.e., their preparedness to perform their trained mission today) as well as to instill individual responsibility for their own health. This training includes a wide range of topic areas such as understanding signs and symptoms associated with chemical, biological, radiological, nuclear, and explosives attacks, recognizing behavioral health issues in themselves and their buddies, proper use of effective personal protective measures against disease-carrying vectors (e.g., malarial mosquitoes), and avoidance of personal health damaging behaviors (e.g., smoking and excessive alcohol consumption). One example of a computerized approach to providing this fundamental training are the small business grants awarded under the topic “Micro Games for Proactive Preventive Medicine” (http:// www.sbir.gov/sbirsearch/detail/242049). This latter topic of promoting individual responsibility for one’s own health is a key theme of the Army Surgeon General (LTG Horoho), who has described the need to develop training tools to help develop the “LifeSpace” of the soldier—that majority of the soldier’s time when they are not in contact with the medical health care system.9 This LifeSpace can be filled in, in part, through the use of medical education and training delivered via mobile health technologies especially for health behaviors such as weight regulation, activity and fitness habits, and smoking cessation.10 Everyday off-the-shelf technologies such as personally owned smart phones and laptop computers can be an effective means to reach and educate today’s young service member. Technologies developed first in the entertainment industry, such as individual and multiplayer electronic games,11,12 interactive virtual worlds,13,14, virtual human technologies,5,15 and 3-dimensional interactive capabilities such as new motion sensing technologies that create a Star Trek-like “holodeck” (e.g., Kinect),16 can be harnessed and tested for military medical training. The challenge is not just the technology but the science of learning that should underlie technology use in medical education and training. Military use of medical simulation technologies has to date relied heavily on a family of commercially available off-the-shelf human manikins and part-task training simulators (e.g., chest tube simulators, intubation and colonoscopy simulators, anesthesia simulators, etc.).17,18 Less than a decade ago, everyone had a “simulator in a closet,” representing MILITARY MEDICINE, Vol. 178, October Supplement 2013

Designing and Using Computer Simulations

primarily an expensive, difficult to operate and maintain simulator with unknown training effectiveness, and thus, kept in the storage closet rather than actually used in any curriculum. Many military medical training simulation centers still have a myriad of different systems intended for different medical task training, most of which do not operate on other hardware or software systems and which are in various stages of development and validation. Unlike the Food and Drug Administration approval for new medical devices, simulation systems used in medical education and training do not fall under one governing body that would ensure that such training systems are efficient and effective. If it is not intended for medical diagnosis or treatment, but rather for training, these systems are not reviewed by the Food and Drug Administration. Thus, medical education and training simulators are currently a wild west of unregulated simulation systems and claims. There should be guidelines for the use of computer simulation in education and training. Medical training goals, curricula, and performance standards as well as test standards19,20 should dictate the appropriate insertion points where more affordable and usable training simulations could greatly improve medical training. This overall goal is, in part, the specific goal of a new DoD intramural consortium led by the USUHS National Capital Area Medical Simulation Center (http://simcen.usuhs.edu/ Pages/default.aspx). For example, standards for physician training are most advanced in the surgical community where testing and training of the fundamentals of laparoscopic surgery have been pioneered.21–23 Laparoscopic surgery is well suited to the early adoption of virtual training and testing technologies because of the similarity between virtual and real procedures.4 The training emphasizes the development of psychom*otor skills and dexterity necessary to master laparoscopic surgery. Another DoD goal in the use of simulation technologies in medicine is to dramatically reduce the use of live animals in military medical training, an objective of the DoD medical simulation training research program.24 Simulations now provide a high level of training engagement and realism,25,26 which may allow some tasks with simulation to replace training with live animals. Development of such simulations needs to be guided by a science of learning approach27 that has been lacking in the earlier development of medical training including that involving live animals and cadavers. Other significant benefits of simulation vs. live animals are the ability to train to multiple presentations of a medical problem and with many repeated practice trials that would not be practical with animals. We believe that this simulation technology is most useful in the initial phase of skills development that leads up to apprentice-level competence after which a trainee can then participate and be mentored in real medical procedures using real patients. The current state of the art in either simulation or life tissue training is not an adequate substitute for the patients in the training of fully competent medical providers.28 MILITARY MEDICINE, Vol. 178, October Supplement 2013

For all of the reasons above, it is important for the DoD to determine what criteria must be applied in the design and evaluation of computer-based medical training systems. The goal is to use new simulation opportunities to change the old discomfiting concept of training medical procedures from “see one, do one, teach one” to “see one, simulate many, do one competently, teach many”.29 This supplement is based on the first of three workshops co-organized with the Office of Naval Research and the Telemedicine and Advanced Technology Research Center (TATRC) by UCLA/CRESST. The first workshop reviewed the state-of-the-science in medical simulation training, design, and evaluation from researchers in these areas. The second workshop focused on specific technology applications that could be designed and delivered to support the specific needs of the Medical Education & Training Center in combat medic training and other enlisted medical specialty training as well as training for Aeromedical Evacuation personnel. This second workshop focused on the development of a road map and topics for what works in computer simulations in gaming and simulations. It is expected that this latter book of guidelines would be published by a commercial publisher. The third workshop will finalize the road map and recommended guidelines for simulations developed for military medical education and training. This work directly supports DoD training requirements as dictated by DoD Instruction 1322.24 (Medical Readiness Training), which states that “Medical readiness training programs shall include realistic individual and collective medical skills training and shall maximize the use of emerging technology, including distance learning, simulation, and virtual reality.” It also supports another DoD Instruction, DoDI 3216.01 (Use of Animals in DoD Programs), which states that methods other than animal use shall be considered and used whenever possible to attain the objectives of training “if such alternative methods produce scientifically or educationally valid or equivalent results.” CURRENT DOD RESEARCH INVESTMENTS IN MEDICAL SIMULATION The current investment in military medical simulation education and training research is summarized in detail elsewhere30,31 and some military studies on medical simulation training have been conducted and reported.32–34 In brief, new core funding has been made available in the DoD to support an organized research effort that includes at least four principal thrust areas. The most mature of these four areas is a fully funded multiyear academically based effort under the Combat Casualty Training Consortium initiative.31 This carefully crafted consortium specifically addresses issues that are currently highest priority in military medical training—i.e., combat medic skills training. To the disappointment of some, however, this funding was not directed to high-powered laboratories specializing in surgical specialty simulation training, but instead focused 3

Designing and Using Computer Simulations

on uniquely military problems to improve medics training and also to reduce reliance on live animals. In the civilian sector, there has been a great deal of good analytic work on the tradeoffs of using live animals. For example, the impact on the neurosciences by the International Animal Research Regulations was documented by the Institute of Medicine and the National Research Council.35 Three other DoD initiatives are partially funded and in development, including the Medical Practice Initiative for medical specialty training (e.g., mobile platforms for just-in-time training and refresher training for humanitarian mission and disaster response in the Mobile Learning Environment project) (http://www.mole-project.net/the-project/ global-medaid-app); the Patient Focused Initiative that addresses the Army Surgeon General’s high-priority LifeSpace initiative (e.g., virtual human interactive behavioral health coaching [SimCoach] project)5,9; and the Developer Tools Initiative that will fill a critical need for open source physiology engines (models and artificial intelligence tools) to drive realistic and accurate medical simulation trainers.31 In addition to new DoD funding support, high interest in medical simulation and training is spurred in part by the recognition that current technologies can be brought to bear to address critical needs in medical education and training. Service members today tend to be equipped with personal smart phones and expect to find and receive information through these everyday technologies. The Defense Advanced Research Projects Agency has entered the field to promote development of game-based medical training software that will include development of deeper understanding of physiological principles that promote a broader ability to deal with real-life medical problems.36 In addition, the DoD has partnered with allies in a NATO workgroup on Advanced Training Technologies for Medical Healthcare Professionals (NATO HFM-215); the results of this workshop will be reported elsewhere. There is also much to learn from the rest of the DoD and the Human Systems Integration community that has embraced and continues to develop new training technologies as well as define how electronic systems integrate into military internet security systems.37 In particular, the Office of Naval Research has been a leader in advancing research in more effective education and training technologies, and the results of this effort provide a springboard to the development of guidelines for standardization and evaluation of new military medical simulations for education and training. Such military medical applications will also have many dual-use benefits in civilian medical training. The science of learning specifically targeted toward the use of computer simulations in medical education and training is also a focus of this supplement. Mayer27 conceptualized the science of learning as the scientific study of how people learn. It is supported by the science of instruction (how to teach) where teaching is accomplished by humans or technologies. Mayer conceptualizes the science of assess4

ment as the determination of what people know. Baker38 provides an excellent overview of assessment with a focus on reliable and valid measures. The classification of learning outcomes as a family of cognitive demands has been provided by Baker and Mayer.39 These cognitive outcomes are the following: content or domain understanding, problem solving, self-regulation, communication, and collaboration/teamwork. The utility of this classification for computer games is provided by O’Neil et al.40 A complementary taxonomy of learning outcomes is provided by Anderson and Krathwohl.41 This taxonomy replaces Bloom’s taxonomy.42 An application of the science of learning to science education is provided by the National Research Council.43 An interesting metalook of how to conceptualize learning is provided by Bransford et al.44 They suggest an integration of three types of learning, i.e., informal learning, implicit learning, and formal learning. They define informal learning as learning that happens in non-school public settings such as museums, zoos, and after-school clubs or the learning that occurs in homes, on playgrounds, among peers, and in other situations (p. 216). Their definition of implicit learning is information that is acquired effortlessly and sometimes without conscious recollection of the learned information or having acquired it (p. 210). Formal learning is the learning that goes on in formal environments such as schools where the science of instruction should be explicit. ORGANIZATION OF THE SUPPLEMENT The organization of this supplement was partially based on the organization of a workshop organized by TATRC/ONR/ CRESST on “Designing and Using Computer Simulations in Medical Education and Training.” This supplement is the documentation of this workshop that involved presentations, followed by the development of manuscripts. The rationale for selection of authors was based on three underlying challenges, namely1 there are different research communities in simulation research in medical education and training, e.g., researchers from medical vs. nonmedical environments and defense vs. civilian research communities2; the lack of infusing research in simulation for medical education and training from a science of learning perspective; and 3 the lack of lessons learned from the use of simulation in other nonmedical fields (e.g., aviation). Unfortunately, there is minimal scientific contact between these different communities. For example, researchers tend to have specific professional identities contextualized in either military or civilian research organizations that result in different journals and conferences to report their research. To partially address these challenges, we organized this journal supplement. Thus, this supplement presents work by authors from multiple disciplines whose expertise is using computer simulation to teach or assess in both medical and nonmedical environments. Experts from both military and civilian organizations are also represented. Two disciplines, MILITARY MEDICINE, Vol. 178, October Supplement 2013

Designing and Using Computer Simulations

i.e., science of learning and psychometrics, were specifically overrepresented as there exists minimum impact of science of learning in the medical uses of simulation and there is little psychometric knowledge for estimating the reliability and validity of simulations used for teaching or assessment. The articles in the supplement also include original studies, reviews, and conceptual analyses. The supplement itself is organized into 4 major sections, representing 4 different issues, respectively: Design Issues, Assessment and Evaluation Issues, Instructional Strategies Issues, and Psychometric Issues. Design Issues We viewed critical design issues in computer simulation for medical education and training as the use of cognitive task analysis, how much fidelity is desirable, and costs. This cognitive task analysis focus is represented with three different but complementary cognitive task methodologies (Munro et al, Cannon Bowers et al and Pugh et al). Fidelity and cost considerations are represented by articles by Talbot and by Fletcher et al respectively. Assessment and Evaluation Issues We viewed these issues as another area where nonmedical research has useful lessons learned for military medicine education and training. This section is represented by three articles: an assessment methodology derived from a research project focused on Navy but nonmedical training (Koenig et al), the application of national test standards to simulation of clinical palpation skill (Pugh), and lessons learned in nonmedical simulation evaluations applied to the evaluation of medical simulations (Bewley et al). Instructional Strategies Issues How to teach and assess are represented by three articles: what instructional and assessment strategies are effective in reducing skill decay (Perez et al); an instructional strategy using virtual trainers on FAST Window identification, acquisition, and diagnosis (Chung et al); and the strategies effective in adaptive and perceptual learning (Kellman). Psychometric Issues Psychometrics (the study of psychological measurement) deals with the technical quality of measures, which is the overall basis of the trustworthy interpretation of measurement situation(s). It includes validity, reliability, comparability, equating, scaling, and standardization as well as theoretical frameworks such as classical test theory, item response theory, and generalizability theory.45 This section is represented by three different approaches: an evidence-centered design one (Mislevy), a latent modeling one (Cai), and the AssessmentDiagnosis-Treatment-Outcome model that includes psychometric issues (Spratt). MILITARY MEDICINE, Vol. 178, October Supplement 2013

In summary, this supplement presents an overview of the key considerations in simulation design and assessment for medical education and training. Multiple experts from multiple disciplines have contributed their specific ideas on how thoughtful simulation design should be planned and executed. These articles suggest how medical education and training simulations can be developed that are scientifically based, outcomes driven, and cost conscious. REFERENCES 1. Ahn J: Digital divides and social network sites: which students participate in social media? J Educ Computing Res 2011; 45: 147–63. 2. Orvis KA, Moore JC, Belanich J, Murphy JS, Horn DB: Are soldiers gamers? Videogame usage among soldiers and implications for the effective use of serious videogames for military training. Mil Psychol 2010; 22: 143–57. 3. Networking and Information Technology Research and Development Program (NITRD): High-Confidence Medical Devices: Cyber-Physical Systems for 21st Century Health Care—A Research and Development Needs Report, February 2009, 88 pp. Available at http://www.whitehouse .gov/files/documents/cyber/NITRD%20-%20High-Confidence%20Medical %20Devices.pdf; accessed October 14, 2012. 4. Kunkler K: The role of medical simulation: an overview. Int J Med Robot 2006; 2: 203–10. 5. Rizzo A, Lange B, Buckwalter JG, et al: SimCoach: an intelligence virtual human for providing healthcare information and support. Proceedings of 8th International Conference on Disability, Virtual Reality and Associated Technologies. Vina del Mar, Valparaiso, Chile. August 31– September 2, 2010. Available at http://www.dtic.mil/cgi-bin/GetTRDoc? AD=ADA541005; accessed September 3, 2012. 6. Aleven VA, Koedinger KR: An effective metacognitive strategy: learning by doing and explaining with a computer-based Cognitive Tutor. Cogn Sci 2002; 26: 147–79. 7. Kiser WR, Hanson L, Stevens A, Miller R, Denton-Price T: Joint enlisted training through DoD—the new paradigm. Presentation at the 2011 Military Health System Conference, National Harbor, Maryland, January 26, 2011. Available at http://www.dtic.mil/cgi-bin/GetTRDoc? AD=ADA556307; accessed May 12, 2012. 8. Kirby SN, Marsh JA, Thie HJ: Establishing a Research and Evaluation Capability for the Joint Medical Education and Training Campus. RAND Corporation, Santa Monica, CA, NTIS ADA 545349, 2011. 132 pp. 9. Horoho P: Plenary remarks at the 2012 Military Health System Conference, National Harbor, Maryland, January 31, 2012. Available at http://www.armymedicine.army.mil/news/docs/MHS2012Plenary LTGHorohoArmySurgeonGeneral_31_JAN_12_Remarks.pdf; accessed May 3, 2013. 10. Ershow AG, Friedl KE, Peterson CM, Riley WT, Rizzo A, Wansink B (editors): Virtual reality technologies for research and education in obesity and diabetes. A National Institutes of Health–Department of Defense Symposium. J Diab Sci Technol 2011; 5: 212–344. 11. Tobias S, Fletcher JD: Reflections on “a review of trends” in serious gaming. Rev Educ Res 2012; 82: 233–7. 12. Tobias S, Fletcher JD (editors): Computer Games and Instruction. Charlotte, NC, Information Age Publishing, 2011. 13. Heinrichs WL, Youngblood P, Harter PM, Dev P: Simulation for team training and assessment: case studies of online training with virtual worlds. World J Surg 2008; 32: 161–70. 14. Riedl MO, Stern A, Dini DM, Alderman JM: Dynamic experience management in virtual worlds for entertainment, education, and training. Int Trans Syst Sci Appl 2008; 3: 23–42. 15. Kenny P, Hartholt A, Gratch J, et al: Building interactive virtual humans for training environments. Paper No. 7105, 16 pp. Interservice/Industry

5

Designing and Using Computer Simulations

16.

17.

18.

19.

20.

21.

22.

23.

24.

25. 26. 27. 28.

6

Training, Simulation, and Education Conference (I/ITSEC), Orlando, Florida, 2007. Available at: http://ict.usc.edu/pubs/Building%20Interactive %20Virtual%20Humans%20for%20Training%20Environments.pdf; accessed September 2, 2012. Lange B, Koenig S, McConnell E, et al: Interactive game-based rehabilitation using the Microsoft Kinect. In: Virtual Reality Workshops, 2012 IEEE Transactions on Visualization and Computer Graphics, Washington DC: IEEE Computer Society, 2012, pp 171–2. Available at http://ieeexplore .ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6180935&content Type=Conference+Publications; accessed October 14, 2012. Cooper JB, Taqueti VR: A brief history of the development of mannequin simulators for clinical education and training. Qual Saf Health Care 2004; 13(Suppl 1): i11–i18. Passiment M, Sacks H, Huang G: Medical Simulation in Medical Education: Results of an AAMC Survey. Washington, DC, Association of American Medical Colleges, 2011. Available at https://members .aamc.org/eweb/upload/Medical%20Simulation%20in%20Medical% 20Education%20Results%20of%20an%20AAMC%20Survey.pdf; accessed June 20, 2012. American Educational Research Association (AERA), American Psychological Association (APA), and National Council for Measurement in Education (NCME): Standards for Educational and Psychological Testing. Washington, DC, American Educational Research Association, 1999. Baker EL: Standards for educational and psychological testing. In: Encyclopedia of Diversity in Education. Edited by JA Banks. Thousand Oaks, CA, Sage, 2012. Rehrig ST, Powers K, Jones DB: Integrating simulation in surgery as a teaching tool and credentialing standard. J Gastrointest Surg 2008; 12: 222–33. Cotin S, Stylopoulos N, Ottensmeyer M, Neumann P, Rattner D, Dawson S: Metrics for laparoscopic skills trainers: the weakest link. In: Medical Image Computing and Computer-Assisted Intervention. Lecture Notes in Computer Science, Vol 2488, pp. 35–43. SpringerVerlag, London, UK, 2002. Rosen J, Chang L, Brown JD, Hannaford B, Sinanan M, Satava R: Minimally invasive surgery task decomposition—etymology of endoscopic suturing. In: Studies in Health Technology and Informatics— Medicine Meets Virtual Reality, Newport Beach, CA, January 2003, 7 pp. Available at http://brl.ee.washington.edu/BRL_Pubs/Pdfs/Rep166 .pdf; accessed September 3, 2012. House Report 112-078, National Defense Authorization Act for Fiscal Year 2012: Title VII. Health Care Provisions. Items of Special Interest. Use of Simulation Technology in Medical Training, 2011. Available at http://www.gpo.gov/fdsys/pkg/CRPT-112hrpt78/pdf/CRPT-112hrpt78 .pdf; accessed May 3, 2013. Gordon JA, Oriol NE, Cooper JB: Bringing good teaching cases “to life”: a simulator-based medical education service. Acad Med 2004; 79: 23–7. Gordon JA: As accessible as a book on a library shelf: the imperative of routine simulation in modern health care. Chest 2012; 141: 12–6. Mayer RE: Applying the Science of Learning. Boston, MA, Pearson, 2011. Sohn VY, Miller JP, Koeller CA, et al: From the combat medic to the forward surgical team: the Madigan model for improving trauma readiness of brigade combat teams fighting the Global War on Terror. J Surg Res 2007; 138: 25–31.

29. Vozenilek J, Huff JS, Reznek M, Gordon JA: See one, do one, teach one: advanced technology in medical education. Acad Emerg Med 2004; 11: 1149–54. 30. Pugh CM, Bevan MG, Duve RJ, White HL, Magee JH, Wiehagen GB: A retrospective review of TATRC funding for medical modeling and simulation technologies. J Soc Sim Healthcare 2011; 6: 218–25. 31. Friedl KE, Talbot TB, Steffensen S: Information science and technology— a new paradigm in military medical research. Technical Report 12-1. Telemedicine and Advanced Technology Research Center, Fort Detrick, Maryland, 2012. 32. Holcomb JB, Dumire RD, Crommett JW, et al: Evaluation of trauma team performance using an advanced human patient simulator for resuscitation training. J Trauma 2002; 52: 1078–86. 33. Ritter EM, Bowyer MW: Simulation for trauma and combat casualty care. Minim Invasive Ther Allied Technol 2005; 14: 224–34. 34. Hackett M, Norfleet J, Pettitt B: Usability analysis of prototype partial task tourniquet trainers. Presentation at The Interservice/Industry Training, Simulation & Education Conference (I/ITSEC), 2011. Available at http://ntsa.metapress.com/link.asp?id=44g7p077w7617224; accessed October 14, 2012. 35. Institute of Medicine (IOM), National Research Council (NRC): International animal research regulations: impact on neuroscience research: Workshop summary. Washington, DC, The National Academies Press, 2012. 36. Kenyon H: DARPA wants gamers to design medical training software. Government Computer News, May 3, 2012. Available at http://gcn.com/ articles/2012/05/03/darpa-game-designers-medical-training-software.aspx; accessed October 14, 2012. 37. Fletcher JD: Education and training technology in the military. Science 2009; 323: 72–5. 38. Baker EL: The Chimera of Validity. Teachers Coll Rec 2013;115. Available at http://www.tcrecord.org/library ID Number: 17106; accessed August 29, 2013. 39. Baker EL, Mayer RE: Computer-based assessment of problem solving. Comput Human Behav 1999; 15: 269–82. 40. O’Neil HF, Wainess R, Baker EL: Classification of learning outcomes: evidence from the computer games literature. The Curriculum Journal 2005; 16(4): 455–74. 41. Anderson LW, Krathwohl DR (editors): A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives. Boston, MA, Allyn & Bacon, 2001. 42. Bloom BS (editor): Taxonomy of Educational Objectives: Book I: Cognitive Domain. New York, Longmans, Green, 1956. 43. National Research Council: Learning Science Through Computer Games and Simulations. Washington, DC, The National Academies Press, 2011. 44. Bransford J, Vye N, Stevens R, et al: Learning theories and education: Toward a decade of synergy. In: Handbook of Educational Psychology, Ed 2, pp 209–44. Edited by PA Alexander, PH. Winne. Mahwah, NJ, Erlbaum, 2006. 45. Jones LV, Thissen DA: History and Overview of Psychometrics. In: Handbook of Statistics, Vol 26: Psychometrics, pp 45–79. Edited by CR Rao, S Sinharay. Amsterdam, The Netherlands, Elsevier Science B.V., 2007.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

MILITARY MEDICINE, 178, 10:7, 2013

Cognitive Task Analysis-Based Design and Authoring Software for Simulation Training Allen Munro, PhD; Richard E. Clark, EdD ABSTRACT The development of more effective medical simulators requires a collaborative team effort where three kinds of expertise are carefully coordinated: (1) exceptional medical expertise focused on providing complete and accurate information about the medical challenges (i.e., critical skills and knowledge) to be simulated; (2) instructional expertise focused on the design of simulation-based training and assessment methods that produce maximum learning and transfer to patient care; and (3) software development expertise that permits the efficient design and development of the software required to capture expertise, present it in an engaging way, and assess student interactions with the simulator. In this discussion, we describe a method of capturing more complete and accurate medical information for simulators and combine it with new instructional design strategies that emphasize the learning of complex knowledge. Finally, we describe three different types of software support (Development/Authoring, Run Time, and Post Run Time) required at different stages in the development of medical simulations and the instructional design elements of the software required at each stage. We describe the contributions expected of each kind of software and the different instructional control authoring support required.

INTRODUCTION The traditional medical teaching method used in the past was based on an outdated apprenticeship model.1 Hand-in-hand with this limitation was the necessity for experts to be the teachers and for the teaching moments to be moved along by the progress of a case, often without the option of redoing an error or stopping at a critical moment to ask questions. Recent advances in medical pedagogy, such as problem-based learning, have apparently not addressed these concerns.2 Increasing concerns about medical mistakes, patient safety, and demands on teaching faculty have led the medical community to supplement mentor-based training practices with medical simulators.3,4 The introduction of simulators provided some relief for medical faculty, by providing a convenient place for students to gain medical experience without putting patients at risk or delaying treatment for a teaching moment.5 As a result, medical educators have tended to simply insert simulators into medical training wherever possible. As simulators have been put into place and tested, we have realized that we must improve the way we design the software that supports or delivers the simulations and put more careful emphasis on the development and coordination of instructional content and teaching methods presented by simulators to produce maximum learning efficiency and effectiveness.5,6 Simulators are susceptible to the GIGO (garbage in, garbage out) problem when they present wrong or incomplete information and, in some cases, present it in ways that risk cognitively overloading students and so make learning ineffective and inefficient.7 Center for Cognitive Technology, Rossier School of Education, University of Southern California, 250 N. Harbor Drive, Suite 309, Redondo Beach, CA 90277. The findings and opinions expressed here do not necessarily reflect the positions or policies of the Office of Naval Research. doi: 10.7205/MILMED-D-13-00265

MILITARY MEDICINE, Vol. 178, October Supplement 2013

The development of more effective medical simulators requires a collaborative team effort where three kinds of expertise are carefully coordinated: (1) exceptional medical expertise focused on providing complete and accurate information about the medical challenges to be simulated; (2) instructional expertise focused on the design of simulation-based training and assessment methods that produce maximum learning and transfer to patient care; and (3) software development expertise that permits the efficient design and development of the software required to capture expertise, present it in an engaging way, and assess student interactions with the simulator. This article focuses on each of these three essential components of the design of effective and efficient medical simulations. MEDICAL EXPERTISE FOCUSED ON THE MEDICAL CHALLENGES TO BE SIMULATED Medicine relies heavily on instructional content drawn from the clinical expertise of medical experts. Physicians with successful clinical experience both teach and provide the medical expertise necessary for simulations. If their descriptions of medical procedures are complete and accurate, the designers of simulations would only have to incorporate them faithfully. Yet Clark and Elen8 have provided compelling evidence that the experiential knowledge acquired by all experts is largely “implicit,” automated, and nonconscious. When experts attempt to describe medical protocols to students or to the designers of simulations, evidence suggests that they unintentionally omit approximately 70% of the critical decisions and analysis required to succeed. A number of experiments in various medical areas have validated this “70% rule.”8 These omissions are the natural result of the way that all expertise is learned. To circumvent the brain’s limits on the amount of information that can be consciously processed at any one time, when decisions and analysis strategies are 7

Cognitive Task Analysis-Based Design and Authoring Software

practiced successfully, they gradually automate to make way for novel information that must be processed consciously.8,9 As experts describe what they do during a procedure to students or simulation specialists, their descriptions tend to be accurate about actions that they can observe themselves or others perform because they are stored in declarative memory—but decisions are not directly observable and so can only be inferred because they are stored in nonconscious procedural memory. As the decisions experts make tend to be largely implicit and nonconscious, they are only about 30% accurate when describing the critical decisions they must make to succeed at a medical procedure. Yet it appears that different experts are consciously aware of somewhat different decisions and so interviewing multiple experts with cognitive task analysis (CTA) gradually increases the capturing of critical decisions and reduces, but does not completely eliminate, omitted decisions.10 Omissions may be the source of medical mistakes as students “fill in the blanks” through trial and error when treating patients and implicit expert knowledge may also cause incomplete information to be programmed into medical simulations. The only exception to the 70% rule identified to date was found in an experiment on a trauma procedure that was controversial because of documented medical mistakes and so was being discussed openly.9 Previous studies of this procedure had found the 70% omissions11 but in a later follow-up study conducted after the textbook approach to the procedure had been questioned and was being discussed widely, the omissions fell from 70% to 40%.9 Even in this unusual case, many critical decisions were not reported. Medicine obviously requires a strategy for identifying and capturing critical but nonconscious expert knowledge and one viable option that is compatible with the design of simulators is a technique called CTA.12 CTA CTA refers to over 100 interview and observation techniques used to elicit and represent the knowledge, goals, strategies, and decisions that underlie observable task performance.13 Although there are many types of CTA methods, all share a common goal of capturing the knowledge of subject-matter experts (SMEs) who have shown reliable proficiency in performing a task over a long period of time. Yates13 reviewed all published CTA approaches and identified only five that seemed viable for capturing both the conceptual and implicit procedural knowledge required by medical simulators. An example of one of the five viable methods was developed and has been validated in multiple studies by one of the authors.10 The validation of other viable methods has been described by Catrambone14 and Lipsh*tz et al.15 The CTA approach described here is called the Concepts, Processes, and Principles CTA10,16 and has proven to be effective in both capturing expert knowledge and contributing to the development of training simulators that reduce the 8

time trainees need to meet their learning objectives and decrease the errors they make during transfer after training (e.g., see the study by Velmahos et al11). Concepts, Processes, and Principles CTA is most commonly performed in the following six stages—Interview SMEs with recent and consistently successful experience: (1) Ask about sequence of tasks that must be performed to complete the medical procedure to “outline” the procedure. (2) Capture “When and How” decision steps and all action steps for all tasks captured in step 1 and develop a document describing each step in the sequence they are performed. (3) Ask SMEs to edit their own and other expert procedure documents to correct and add missing steps so that analysts can combine different approaches into one “gold standard” approach that can be learned and performed by students. (4) For each task and procedure identified, describe critical concepts, processes, or principles that need to be learned to explain and perform the procedure. (5) Collect additional information about the various types of problems the procedure will solve; the material and equipment required as well as the indications, and contraindications needed to support their use with patients. (6) Pull the CTA protocol into an instructional design system that will reside inside the simulation. The effort required to produce a viable final gold standard CTA is determined in large part by two factors: (1) the time required to conduct and analyze full interviews with each SME to capture all of the knowledge required for training novices; and (2) the number of SMEs who must experience full interviews to capture the most complete and accurate description of the skills and knowledge required to execute a given task. Research conducted by Chao and Salvendy17 and more recently by Crispen18 and Bartholio19 has provided evidence that three to four experts are necessary to capture the optimal amount of knowledge before reaching “the point of diminishing utility.” The payoff from this level of investment is an explicit record of nearly all critical decisions and actions, described at a level appropriate for students with adequate background knowledge. The CTA can be changed easily as the procedure is improved over time without the need for medical faculty to invest wasted effort repeatedly describing (an incomplete version of) the procedure. A CTA-based protocol is the first stage in the design of an effective simulation. In the second stage, we need to format the expert-based CTA information into an instructional design that will transmit the content to students in a simulation. INSTRUCTIONAL SIMULATION METHODS THAT PRODUCE MAXIMUM LEARNING AND TRANSFER TO PATIENT CARE Whether simulators are used for instruction and/or for practice of skills that have been taught before the use of the MILITARY MEDICINE, Vol. 178, October Supplement 2013

Cognitive Task Analysis-Based Design and Authoring Software

simulator must influence the design of the simulations. If they are used to teach new skills and knowledge, the evidence clearly supports the use of fully guided, explicit instruction.2,20–22 Most evidence-based guided instructional design processes implement the list of features described by Merrill23 include the following: (1) Provide realistic field-based problems for students to solve (note: available from a CTA). (2) Give students analogies and examples that relate their relevant prior knowledge to new learning as the procedure is being introduced. (3) Offer clear and complete demonstrations of how to perform key tasks and solve authentic problems (note: available from a CTA). (4) Insist on frequent practice opportunities during training to apply what is being learned (by performing tasks and solving problems) while receiving corrective feedback. (5) Require application practice that includes “part task” (practicing small chunks of larger tasks) but also “whole tasks” (applying as much of what is learned as possible to solve the complex problems that represent challenges encountered in operational environments) both during and after instruction. Items 1 to 3 are required for simulators that offer instruction, whereas items 4 and 5 are necessary for simulators that only provide opportunities for practice, feedback, testing, and transfer. The problems and tasks provided during practice exercises must be representative of the population of problems and tasks they will be expected to tackle after instruction. As most transfer environments require task performance rather than the recall of facts, practice must follow a demonstration or worked example and require the application of the demonstrated procedure to complete a task and/or solve a problem. Rosenshine24 has also described the research base supporting guided practice and have provided guidelines for constructing demonstration and practice exercises such as the following “17 Principles for Effective Instruction” (p. 19): (1) Begin a lesson with a short review of previous learning. (2) Present new material in small steps with student practice after each step. (3) Limit the amount of material students receive at one time. (4) Give clear and detailed instructions and explanations. (5) Ask a large number of questions and check for understanding. (6) Provide a high level of active practice for all students. (7) Guide students as they begin to practice. (8) Think aloud and model steps. (9) Provide models of worked out problems. (10) Ask students to explain what they have learned. (11) Check the responses of all students. (12) Provide systematic feedback and corrections. (13) Use more time to provide explanations. (14) Provide many examples. MILITARY MEDICINE, Vol. 178, October Supplement 2013

(15) Reteach material when necessary. (16) Prepare students for independent practice. (17) Monitor students when they begin independent practice.” With elements of Rosenshine’s 17 principles and Merrill knowledge types in place, a “blueprint” of the simulation is available to support all of the stages in the development of computer or live simulations. We turn next to a discussion of the ways that simulation software can incorporate this kind of design and support learning, practice, and transfer.

SOFTWARE TO SUPPORT THE DEVELOPMENT OF MEDICAL SIMULATIONS FOR TRAINING Classically the development of simulations could be divided into five largely sequential phases: requirements analysis, design, implementation, testing, and maintenance. This model of development arose from common practice in systems analysis and software development, where it was sometimes called the systems development lifecycle. It was first formally described by Royce,25 who pointed out that the approach “is risky and invites failure” (p 2). More recent approaches have emphasized rapid prototyping, using methods of successive approximation and partial products, with the goal of integrating the customers into the development process throughout. Extreme Programming, or XP,26 is a prominent example of a rapid prototyping approach to software development. In practice, many large-scale development projects, including the development of simulations for military training, are driven by the procurement process to include elements of the classic systems development lifecycle approach. When a rapid prototyping approach is used, the phases are repeated, in varying degrees of formality, for each prototype. These phases are ordinarily all carried out (except in the case of single-use simulations, which would not have a maintenance phase), whether the simulations are computer-based or are conducted by human actors. — Requirements Analysis. This phase of development should be guided by a CTA or by a similar characterization of targeted performance goals. The CTA determines what must be observable to the learner, what actions the learner can take, what actions the learner should take, what conditions may pertain to the medical task that is being learned, and what the consequences of possible actions would be under the various conditions. — Design. The CTA guides the selection or construction of learning experiences based on the types of knowledge or skill that are required to perform a task. In simulation contexts, practice opportunities can be provided that appropriately exercises the learner’s decision-making opportunities. — Implementation. In the case of actor-delivered simulations or simulated patients, implementation requires the development of scripts that may include conditional 9

Cognitive Task Analysis-Based Design and Authoring Software

branching, so that the simulated patient (and, in some cases, the simulated laboratory) understands how to appropriately respond to most of the possible actions that the learner can take in a simulated medical episode. The actors must then learn the script. In computer-based simulations, the CTA-based design guides the development of an interactive simulation that provides ways to carry out all the necessary observations during a simulated medical episode, together with ways to carry out all the actions (medical interventions, etc.) that could take place. It also guides the development of appropriate causal representations of the effects of those actions. Implementation of computer-based simulations can be accomplished using general-purpose programming languages, but there are also specialized tools for building interactive graphical simulations for training.27–36 — Testing. The implementation phase of a computerbased simulation typically includes repeated cycles of detailed design, coding, and in-house testing. Once such a simulation is determined to be “running,” it is time for testing with representative learners. This phase typically reveals aspects of the user interface or the behavior of the simulation that are not well understood by some portion of the learner population. This can also happen with actor-based simulations, and the implementation phase may have to be revisited to make modifications based on experience with the actual learner population. — Maintenance. If simulations are to be used more than once, they will require maintenance. Medical practices and procedures evolve, and medical simulations must evolve with them. Tools that support the maintenance of simulations can contribute to a cost-effective simulation life cycle. Specialized software tools can support each of the four areas of simulation development (design, implementation, testing, and maintenance). For some of these phases of development, useful tools are already available.34,36 For other phases, new tools are called for. SOFTWARE TOOLS FOR SIMULATION DESIGN In an ideal world, software tools for simulation design would be tightly integrated with an effective system for CTA. The analysts producing a CTA for a medical procedure, for example, would enter information about the procedure into a database for CTAs. The CTA data would include the goals and subgoals of the procedure, the steps required to achieve each subgoal, the conditions under which each step can be taken, the alternative steps to be used when a condition is not met, and the observations that need to be made to check each condition and each step outcome. A simulation design system would tap into this database to create lists of observables that the simulation must have to support the procedure, lists 10

of basic actions that must be possible to carry out the steps of the procedure, and the combinations of observable values that define the conditions that are important for the procedure. At a minimum, the implementers of the simulation would then have comprehensive lists of essential objects and states required to realistically implement the behaviors of the simulation that would be essential to carrying out the procedure that is to be learned. It would be even better, of course, if these automated outputs of the simulation design tool could be fed directly into the next phase tool, the implementation system, to directly support the production of the simulation from the CTA-based design. We are not aware of any tools that offer this feature at this time. A tool, such as “iRides Author”34,36 would be significantly enhanced if it automatically used elements of the work product from a CTA in the simulation development process, rather than requiring the simulation developer to manually confirm that elements identified as necessary by the CTA are in fact present in the simulation. SOFTWARE TOOLS FOR THE IMPLEMENTATION OF SIMULATIONS There are a wide variety of tools that support the implementation of simulations for training. One way of characterizing these tools is in terms of their specificity of purpose. The least special-purpose implementation tools are general purpose programming languages, such as C, LISP, Smalltalk, or Java. Specialized programming languages for developing interactive simulations are more targeted. Some languages are particularly oriented to event-based simulation models, rather than interactive training simulations. These include Simula GPSS and SIMSCRIPT, which are chronicled by Nance.37 There are also authoring tools that permit the specification of simulation behaviors without conventional computer programming, such as programming by example systems, also known as programming-by-demonstration systems.38,39 There are also authoring tools that are designed to support the development of a special domain of simulations, such as surface warfare tactics simulations.40 iRides Author 36,41 is a tool for creating training simulations that is a tool of intermediate specialization. It supports the development of interactive graphical simulations for a variety of purposes. Example products have included simulations for air traffic ground control, C-30 missile attack countermeasures, and Navy surface warfare tactics planning.42 iRides Author is a tool that is not specific to a particular simulation domain, so it requires that developers who make use of it are able to think abstractly and code relational expressions that express the values of attributes as functions of the values of other attributes. Some attribute values are automatically set when users take actions, such as pressing a mouse button while pointing to a simulation object, and all the expressions that refer to those attributes are automatically evaluated when such actions are taken. The iRides simulation language is an example of a constraint-based language.43 MILITARY MEDICINE, Vol. 178, October Supplement 2013

Cognitive Task Analysis-Based Design and Authoring Software

This makes “programming” a simulation in iRides Author something like writing the formulas of a spreadsheet. Just as a spreadsheet author does not have to worry under what condition a formula will be evaluated, neither does a developer using iRides Author have to be concerned with the flow of control in a simulation. When a user takes an action, the flow of evaluation of relational expressions is automatic. There are several motivations for special-purpose implementation tools for simulation development. The more adapted the tool is to a subject-matter domain, the more likely it is that development, or some significant subset of development, can be carried out by SMEs, rather than only by a professional cadre of development experts, who may understand little of the target domain. This development approach has several advantages. First, the risk of communication errors is reduced, because there is no problem of programmer misunderstandings about the behavior of targeted systems. Second, required changes to the simulation may be carried out much more quickly and less expensively; instead of having to find the programmer who coded the simulation in the first place (or, worse, finding another programmer to puzzle through the first programmer’s code) to make a change, an SME can directly access the relevant elements of the simulation and make the necessary changes directly. A tool that is much more special-purpose than iRides Author is the TAO Sandbox.42 This tool lets Navy tactics SMEs build tactical problems that can be played out simply by choosing maps or charts, dragging units such as ships, subs, missile bases, and airfields into position, selecting the behavior of hostile units, and, optionally, scheduling future events, such as a change in the hostile posture of an opposing unit or the appearance of a new unit in the scenario at a specified time. The domain specificity of the TAO Sandbox has made it possible for SMEs to build moderately complex tactical problems in less than an hour.40 Frequently, the most time-consuming part of such scenario construction is writing the mission briefing for the problem, rather than any of the steps that actually determine how the simulated scenario will play. Even more important than the advantages of SME implementation is the potential that specialized implementation tools offer for automatically supporting elements of instruction in the context of the simulation. Simulations for training, unlike other software applications, such as word processors, spreadsheets, or drawing programs, are intended primarily to teach about processes and procedures in simulated worlds. There are many types of user interactions specifically to support instruction that such simulations can use, such as “highlight” an object to direct the student’s attention, “demonstrate” an action, “require” a specific action, “detect” a simulation state of pedagogical interest, and “replay” a recorded procedure. An implementation tool that is designed to produce simulations that provide these types of instructional interactions has an advantage over more generic tools, such as conventional programming languages, which require that such services be created “de novo” for every new simuMILITARY MEDICINE, Vol. 178, October Supplement 2013

lation project. Although efficiencies in development are one advantage to such tool-supplied capabilities, a more important advantage is that their presence will increase the probability that instruction will actually be used in the simulation context. Without such capabilities in a simulation, it is likely that instructor feedback will be the only possible source of correction. Automated instruction not only offers cost savings, it also helps to ensure that instruction is consistent and based on evidence about the kind of feedback that supports learner performance in the simulation, rather than on the feedback decisions that often vary with different instructors. Reviews and meta-analysis of studies have clearly indicated that a majority of feedback strategies used by instructors either have no impact or make performance worse.44,45 In many cases, simulation-authoring tools are designed to work in conjunction with run-time tools for simulation training, as described below in the section “Software to Support the Run-time Delivery of Medical Simulations.” For example, the iRides Author system produces simulation specifications that are interpreted by a run-time component, called iRides. Every simulation built with iRides Author is actually delivered by iRides. SOFTWARE TOOLS FOR TESTING SIMULATIONS The testing phase of simulation development can also benefit from a close relationship with the results of CTA. Testing should ensure that all the observables specified by the CTA, and all the action types, are available in the simulation. Correct portrayal of the effects of undesirable or incorrect actions in context must be ascertained, as well, of course, as the accurate portrayal of the effects of correct actions. In a truly integrated CTA-based development process, one of the products of the CTA would be a test plan for determining the correctness (i.e., validity) and utility of the simulation. In the case of computer-based simulations, an integrated suite of tools could actually construct an automatically executable test of the simulation produced, based on the CTA. SOFTWARE TOOLS TO SUPPORT MAINTENANCE OF SIMULATIONS Medicine is not a closed knowledge corpus. New diseases are discovered. New techniques of medical intervention are also developed. New instructional or assessment strategies are also discovered. The conditions under which certain medical procedures should be carried out are refined. As medical procedures evolve, simulations and simulation-based training must evolve in parallel, ideally very quickly, so that there is no significant gap between approved new practice standards and the suitability of a simulation environment for accurately representing and assessing adherence to those standards. Again, a CTA-based approach offers hope for cost-effective simulation maintenance. When a medical procedure or process is revised, the previous task analysis can most often be modified to reflect the revisions quite easily. If there is a CTA-based simulation design tool, it can provide lists of observables, 11

Cognitive Task Analysis-Based Design and Authoring Software

actions, and relevant conditions. These lists can be compared with those produced by the previous analysis of the procedure. New observables and new actions can be added to the simulation, and new or revised conditions can be added or edited. If the analysis tool produces a test plan, that plan can be used to drive the testing/validation phase of the new implementation of the simulation. SOFTWARE TO SUPPORT THE RUN-TIME DELIVERY OF MEDICAL SIMULATIONS Implementation of a computer-based simulation for training can be carried out in either of 2 ways: One, it could produce a stand-alone executable “program,” compiled in the machine language of the processor. This is the product of a conventional computer programming process. Alternatively, it could result in a “specification” that is interpreted and delivered by a software tool that can deliver a variety of training simulations. Clearly, in the second case, the run-time simulation interpreter constitutes a software tool to support run-time delivery. However, simulations produced using the conventional programming language approach can contain software components that support simulation delivery. Most modern computer programs make copious use of “libraries” of preexisting code. Conventional code libraries exist to support interapplication communication, printing, specialized graphics packages, and so on. Possible libraries to support simulation training could include: — Journaling packages, which write out records of student actions and other significant simulation events. — Replay packages, which support the reading in of journal records to replay student sessions, using the simulation. — Performance measurement packages, which report actions and events that can be evaluated during simulation sessions. Of course, these capabilities can all be included in the second type of run-time software, the run-time interpreter. In addition, a run-time simulation interpreter can provide instructional services, such as highlighting objects and carrying out actions in a demonstration, as described above. If all training simulations supported a standard set of services for instruction, they could provide those services to instructional applications. Munro34 has proposed a universal set of such simulation services for instruction. SOFTWARE TO SUPPORT ANALYSIS OF PERFORMANCE To automatically assess learner performance in the context of a game or simulation, there must be an assessment component that observes user performance in that context and makes judgments. In traditional simulation-based training, the assessor is a human being with expertise in the required behaviors. If the evaluator were instead an automated software component, assessments could be made at lower cost and with greater consistency. Assessment and instructional 12

interactions can take place either during a simulation session or post hoc. To avoid the negative effects of excessive cognitive load,22 simulations that require rapid reactions to an unfolding scenario sometimes provide assessments in afteraction reviews or other postsimulation activities. In many cases, assessments cannot be made based on immediate in-simulation activities, but must be the result of cumulated observations. Software can support such analysis of performance for learner assessment. The TAO Sandbox reports user actions and other performance-related events (e.g., the destruction of a friendly surface unit by a hostile submarine’s torpedo) to a separate application, the CRESST Assessment Application (CAA),46 using socket-based communications. The CAA is an application developed at the Center for Research on Evaluation Standards and Student Testing at UCLA. The CAA contains a detailed model of user knowledge at a higher level of representation than actions and events. When a performance announcement is made to the CAA, it informs a Bayesian network that represents the student’s exhibited competencies. Each such announcement results in modifications to the Bayes net model of student knowledge about the domain and about how to carry out procedures effectively. A third software application, the “CAA Monitor,” tracks the CAA’s changes to the Bayes net and relates those changes to performance events. This application can chart changes in student knowledge as students perform in one or a whole series of simulation training sessions. FUTURE DIRECTIONS It is critical that we implement a collaborative team approach to the future development of health care simulations. The application of evidence-based training and focused CTA methods with the most successful health care SMEs will produce more complete and accurate representations of expert cognitive and behavioral strategies used in complex and challenging health care settings. This information will guide the development of simulations that provide efficient and effective practice of medical skills acquired during training. There is also scope for research on tools that will better integrate the CTA process with simulation development. With appropriate software support, CTA-based strategies will provide vital data that could be imported into the simulation authoring environment to guide the production of simulations that will support effective learning experiences. Simulators provide an environment for health care professionals to repeatedly and deliberately practice new skills while receiving immediate and corrective feedback, based on desired expert performance captured by the CTA and included in the instructional design embedded in a simulation. Current medical simulators largely focus on incomplete descriptions of technical skills since about 70% of critical decisions are missing. Future research should focus on the use of CTA to expose critical decision-making skills necessary to perform complex health care tasks. Incorporating “gold standard” CTAs as the basis for teaching and MILITARY MEDICINE, Vol. 178, October Supplement 2013

Cognitive Task Analysis-Based Design and Authoring Software

assessing technical and decision-making skills should also be examined.

15.

ACKNOWLEDGMENTS

16.

The Office of Naval Research supported the development of iRides, iRides Author, and the TAO Sandbox (N00014-02-1-0179, N00014-06-1-0711, N00014-08-1-0126, N00014-08-C-0563, N00014-09-C-0813, N00014-10C-0265). Research on the use of Cognitive Task Analysis and Simulators was supported, in part, by the U.S. Army Medical Research and Materiel Command (W81XWH-04-C-0093) and the U.S. Army Training and Doctrine Command (W911NF-04-D-0005). The work reported herein was also partially supported by a grant from the Office of Naval Research, Award Number N00014-10-1-0978.

17.

18.

REFERENCES 1. Halsted WS: The training of the surgeon. Bulletin of John Hopkins Hospital 1904; 15: 267–76. 2. Kirschner P, Sweller J, Clark RE: Why minimally guided learning does not work: an analysis of the failure of discovery learning, problem-based learning, experiential learning and inquiry-based learning. Educ Psychol 2006; 41(2): 75–86. 3. Bell RH: Surgical council on resident education: a new organization devoted to graduate surgical education. J Am Coll Surg 2007; 204(3): 341–6. 4. Reznick RK, MacRae H: Teaching surgical skills: changes in the wind. N Engl J Med 2006; 355(25): 2664–9. 5. Bradley P: The history of simulation in medical education and possible future directions. Med Educ 2006; 40(3): 254–62. 6. McGaghie WC, Issenberg SB, Petrusa ER, Scalese RJ: A critical review of simulation-based medical education research: 2003–2009. Med Educ 2010; 44: 50–63. 7. Clark RE, Pugh CM, Yates K, Sullivan M: Use of cognitive task analysis and simulators for after action review of medical events in Iraq (Technical Report 5-21-2008). Los Angeles: Center for Cognitive Technology, Rossier School of Education, University of Southern California, 2008. Available at http://cogtech.usc.edu/publications/clark_etal_surgical_ aar_2008_06_21.pdf; accessed June 3, 2013. 8. Clark RE, Elen J: When less is more: research and theory insights about instruction for complex learning. In: Handling Complexity in Learning Environments: Theory and Research. Edited by Elen J, Clark RE. Oxford, UK, Elsevier Science Limited, 2006. 9. Sullivan ME, Yates KA, Baker CJ, Clark RE: Cognitive task analysis and its role in teaching technical skills. In: Textbook of Simulation, Skills and Team Training. Edited by Tsueda S, Scott D, Jones D. Woodbury, CT, Cine-Med, 2010. 10. Clark RE: Cognitive task analysis for expert-based instruction in healthcare. In: Handbook of Research on Educational Communications and Technology, Ed 4, pp 541–51. Edited by Spector JM, Merrill MD, Elen J, Bishop MJ. New York, Springer, 2014. 11. Velmahos GC, Toutouzas KG, Sillin LF, et al: Cognitive task analysis for teaching technical skills in an inanimate surgical skills laboratory. Am J Surg 2004; 18: 114–9. 12. Clark RE, Estes F: Cognitive task analysis. Int J Educ Res 1996; 25(5): 403–17. 13. Yates KA: Towards a taxonomy of cognitive task analysis methods: a search for cognition and task analysis interactions. Doctoral dissertation, University of Southern California, 2007. Available at http://www .cogtech.usc.edu/publications/yates_dissertation_2007.pdf; accessed May 6, 2013. 14. Catrambone R: Task analysis by problem solving (TAPS): uncovering expert knowledge to develop high-quality instructional materials and training. Paper presented at the 2011 Learning and Technology Symposium, Columbus, GA, 2011. Available at: http://cunningham

MILITARY MEDICINE, Vol. 178, October Supplement 2013

19.

20.

21.

22.

23.

24. 25.

26. 27.

28.

29. 30.

31.

32. 33.

.columbusstate.edu/technologysymposium/docs/Catrambone%20white% 20paper.pdf; accessed May 6, 2013. Lipsh*tz R, Klein G, Oransanu J, Salas E: Taking stock of naturalistic decision making. J Behav Decis Mak 2001; 14: 331–52. Feldon DF, Clark RE: Instructional implications of cognitive task analysis as a method for improving the accuracy of experts’ self-report. In: Avoiding Simplicity, Confronting Complexity: Advances in Studying and Designing (Computer-Based) Powerful Learning Environments, pp. 109–16. Edited by Clarebout G, Rotterdam Elen J., The Netherlands, Sense Publishers, 2006. Chao CJ, Salvendy G: Percentage of procedural knowledge acquired as a function of the number of experts from whom knowledge is acquired for diagnosis, debugging and interpretation tasks. Int J Hum Comput Interact 1994; 6: 221–33. Crispen PD: Identifying the point of diminishing marginal utility for cognitive task analysis: surgical subject matter expert interviews. Doctoral dissertation, University of Southern California, 2010. Available at http://gradworks.umi.com/3403725.pdf; accessed May 6, 2013. Bartholio CW: The use of cognitive task analysis to investigate how many experts must be interviewed to acquire the critical information needed to perform a central venous catheter placement. Doctoral dissertation, University of Southern California, 2010. Available at http://digitallibrary.usc.edu/cdm/ref/collection/p15799coll127/id/385767; accessed May 6, 2013. Ericsson KA: Deliberate practice and the acquisition and maintenance of expert performance in medicine and related domains. Acad Med 2004; 79(10): S1–S12. Sweller J, Clark RE, Kirschner PA: Why minimally guided teaching techniques do not work: a reply to commentaries. J Educ Psychol 2007; 43(2): 115–21. van Merrie¨nboer JJG, Sweller J: Cognitive load theory in health professional education: design principles and strategies. Med Educ 2010; 44: 85–93. Merrill MD: Hypothesized performance on complex tasks as a function of scaled instructional strategies. In: Handling Complexity in Learning Environments: Research and Theory, pp 265–82. Edited by Elen J, Clark RE. Oxford, UK, Elsevier Science Limited, 2006. Rosenshine B: Principles of instruction: research-based strategies that all teachers should know. American Educator 2012; 12–9. Royce WW: Managing the development of large software systems. In Proceedings, IEEE WESCON, pp 1–9. August 1970. Reprinted in ICSE ’87 Proceedings of the 9th International Conference on Software Engineering, pp 328–38. Los Alamitos, CA, IEEE Computer Society Press, 1987. Available at http://www.cs.umd.edu/class/spring2003/ cmsc838p/Process/waterfall.pdf; accessed May 6, 2013. Beck K: Embracing change with extreme programming. IEEE Computer 1999; 70–7. Towne DM, Munro A: The intelligent maintenance training system. In: Simulators IV, pp 277–84. Edited by Fairchild BT. San Diego, CA, Society for Computer Simulation, 1987. Towne DM, Munro A: The intelligent maintenance training system. In: Intelligent Tutoring Systems: Lessons Learned. Edited by Psotka J, Massey LD, Mutter SA. Hillsdale, NJ, Erlbaum, 1988. Towne DM, Munro A: Simulation-based instruction of technical skills. Hum Factors 1991; 33: 325–41. Towne DM, Munro A: Two approaches to simulation composition for training. In: Intelligent Instruction by Computer: Theory and Practice. Edited by Farr M, Psotka J. London, Taylor and Francis, 1992. Towne DM, Munro A, Pizzini QA, Surmon DS, Coller LD, Wogulis JL: Model-building tools for simulation-based training. Interact Learn Environ 1990; 1: 33–50. Munro A, Towne DM: Productivity tools for simulation centered training development, Educ Technol Res Dev 1992; 40: 65–80. Munro A: Authoring interactive graphical models. In: The Use of Computer Models for Explication, Analysis and Experiential Learning. Edited by de Jong T, Towne DM, Spada H. New York, Springer Verlag, 1994.

13

Cognitive Task Analysis-Based Design and Authoring Software 34. Munro A: Foundations for software support of instruction in game contexts. In: Computer Games and Team and Individual Learning, pp. 55–74. Edited by O’Neil HF, Perez RS. Amsterdam, Elsevier, 2008. 35. Munro A, Breaux R, Patrey J, Sheldon B: Cognitive aspects of virtual environments design. In: Handbook of Virtual Environments. Edited by Stanney K. Mahwah, NJ, Erlbaum, 2002. 36. Munro A, Surmon D, Pizzini QA: Teaching procedural knowledge in distance learning environments. In: Web-Based Learning: Theory, Research, and Practice. Edited by Perez RS, O’Neil HF Mahwah, NJ, Erlbaum, 2006. 37. Nance RE: A history of discrete event simulation programming languages. Technical Report 93-21. Blacksburg, VA, Department of Computer Science, Virginia Polytechnic Institute and State University, 1993. Available at http://eprints.cs.vt.edu/archive/00000363/01/TR-93-21.pdf; accessed May 6, 2013. 38. Cypher A: Watch what I do: Programming by demonstration. Cambridge, MA, MIT Press, 1993. Available at http://acypher.com/wwid/; accessed May 6, 2013. 39. Lieberman H: Your Wish is My Command: Programming by Example. San Francisco, Morgan Kaufmann, 2001.

14

40. Munro A, Pizzini QA, Bewley W: Learning Anti-submarine Warfare in the Context of a Game-Like Tactical Planner. In Proceedings of the Interservice/Industry Training, Simulation & Education Conference (I/ITSEC), 2009. Available at http://ntsa.metapress.com/link.asp?id= u1q691546361167p; accessed May 6, 2013. 41. Munro A, Pizzini Q, Johnson MC: The iRides Simulation Language: Authored Simulations for Distance Learning, Behavioral Technology Lab. Final Technical Report, University of Southern California, 2004. 42. Munro A, Pizzini Q: The TAO Sandbox Instructor Guide. Working paper, Center for Cognitive Technology, University of Southern California, 2011. 43. Borning A: The programming language aspects of ThingLab, a constraint-oriented simulation laboratory. ACM Trans Program Lang and Syst 1981; 3(4): 353–87. 44. Kluger A, DeNisi A: Feedback interventions: toward the understanding of a double-edged sword. Cur Dir Psychol Sci 1998; 7(3): 67–72. 45. Shute V: Focus on formative feedback. Rev Educ Res 2008; 78(1): 153–89. 46. Munro A, Surmon D, Koenig A, Iseli M, Lee J, Bewley W: Detailed modeling of student knowledge in a simulation context. In: Advances in Applied Modeling and Simulation. Edited by Duffy V. Boca Raton, FL, CRC Press, Taylor and Francis, 2012; 212–21.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

MILITARY MEDICINE, 178, 10:15, 2013

Using Cognitive Task Analysis to Develop Simulation-Based Training for Medical Tasks Jan Cannon-Bowers, PhD*; Clint Bowers, PhD†; Renee Stout, PhD‡; Katrina Ricci, PhD§; COL Annette Hildabrand, AVC USA∥ ABSTRACT Pressures to increase the efficacy and effectiveness of medical training are causing the Department of Defense to investigate the use of simulation technologies. This article describes a comprehensive cognitive task analysis technique that can be used to simultaneously generate training requirements, performance metrics, scenario requirements, and simulator/simulation requirements for medical tasks. On the basis of a variety of existing techniques, we developed a scenario-based approach that asks experts to perform the targeted task multiple times, with each pass probing a different dimension of the training development process. In contrast to many cognitive task analysis approaches, we argue that our technique can be highly cost effective because it is designed to accomplish multiple goals. The technique was pilot tested with expert instructors from a large military medical training command. These instructors were employed to generate requirements for two selected combat casualty care tasks—cricothyroidotomy and hemorrhage control. Results indicated that the technique is feasible to use and generates usable data to inform simulation-based training system design.

INTRODUCTION The importance to the Department of Defense of a welltrained and prepared military medical capability is unquestionable. However, compared to training civilian health care professionals (which is itself highly complex and costly), meeting the training needs imposed by military medicine presents several significant challenges. First, military medical training occurs across a wider spectrum of learners, with heavy reliance on non-physician paraprofessionals as the first-line caregiver. Second, these caregivers are often required to perform life-saving surgical procedures in uncertain, stressful environments. Third, the length of the available training period is often not optimal—given the military’s need for skilled caregivers, training must often occur in a very compressed time frame. This places a high priority on training approaches that can efficiently prepare personnel to be effective in the field. To respond to these challenges, military medical training has relied on skill-based training of critical procedures. Such *USF Health, Center for Advanced Medical Learning and Simulation, University of South Florida, 12901 Bruce B. Downs Boulevard, MDC 46, Tampa, FL 33612. †Department of Psychology, University of Central Florida, 4000 Central Florida Boulevard, Orlando, FL 32816. ‡Renee Stout, Inc., 2630 Fallbrook Drive, Oviedo, FL 32765. §Naval Air Warfare Center Training Systems Division, 12350 Research Parkway, Orlando, FL 32826. kDoD Animal Use Programs Human Performance Training and Biosystems, Office of the Secretary of Defense for Research and Engineering, 4800 Mark Center Drive, Suite 17E08, Alexandria, VA 22350-3600. The views expressed in this article are those of the authors and do not necessarily represent the views or official position of the organizations with which they are affiliated, including the Department of Defense, the Department of the Navy, or the Naval Air Warfare Center Training Systems Division. The findings and opinions expressed here do not necessarily reflect the positions or policies of the Office of Naval Research. doi: 10.7205/MILMED-D-13-00211

MILITARY MEDICINE, Vol. 178, October Supplement 2013

training emphasizes the ability to learn and perform a series of specific procedures that comprise a critical life-saving task. This is often accomplished through hands-on practice in increasingly realistic simulation settings. Attempts to prioritize potentially high impact interventions are needed. That is, “which tasks are most likely to benefit from simulationbased training?” and “what needs to be done to advance simulations so that they are most effective?” are the questions that need to be addressed. There is also a need to develop and validate surrogate measures of performance that can serve as criterion measures for determining whether effective training can be accomplished. Toward that end, the present study used cognitive task analysis (CTA) techniques to identify important task cues associated with two critical combat medical procedures. By specifying these cues, we believe we can create a set of standards against that simulations can be compared, allowing estimates of their potential effectiveness for use in training and assessing critical medical procedures. Moreover, the CTA results (once validated) provide a set of critical task cues that should be included in any simulation-based training system. Finally, we sought to determine what the likely trainee errors are on major task steps so that initial training requirements could be identified. The following sections lay out our rationale for using CTA to elucidate critical cues, errors, and simulation requirements. We then describe the specific protocol and methods we used. Finally, we present a summary of what we found for two important combat casualty care tasks (cricothyroidotomy and hemorrhage control) and discuss the implications of this work for future efforts. BACKGROUND: CTA Traditionally, the most common approach to describing worker behavior has been the task analysis. Task analysis 15

CTA to Develop Simulation-Based Training for Medical Tasks

decomposes jobs into tasks and component subtasks until each activity the operator must perform is documented. The resulting analysis is then distilled into a set of training needs and approaches.1 However, as the nature of work has changed, traditional task analysis methods that focused primarily on observable behavior are inadequate for the modern workplace. The key shortcoming is that the use of traditional task analysis techniques resulted in no data about the thought processes that were involved in the work. As workers began to take on tasks that required not only physical activity but also data monitoring and analysis, decision making, problem solving, and other higher order tasks, there arose a requirement to better understand the cues and cognitive processes that were components of successful performance.2 This gave rise to the development of techniques designed to elicit knowledge from experts. CTA is a term that refers to the use of some, or all, of these techniques, in combination with traditional task analysis methods, to fully understand the behavioral and cognitive processes needed to perform a specific job or task.2 CTA appears well suited to the study of specific combat casualty care tasks. Certain procedures performed by medical personnel, especially surgical procedures, clearly require the operator to perform a series of behavioral and cognitive tasks. In these cases, making good decisions is just as critical to performance as proper technique. Fully describing these tasks is an essential aspect of training development, and it appears that CTA can be helpful in this regard. For example, it has been shown that training needs analyses using CTA yield 35% more information than non-CTA based analyses.3 It has also been shown that surgical laboratory courses based on CTA results lead to better outcomes for surgical students.4. In addition, CTA has been shown to be effective in identifying critical decision errors that may be targets for training.5 CTA for training development has been slow to be adopted as a training development technique in medicine. This may be due to the high costs associated with the CTA process. A CTA of a single surgical procedure typically requires several hours of an expert’s time. Moreover, to ensure the reliability of the results, CTAs are often conducted with several experts. For example, it has been suggested that a minimum of four experts are required to fully articulate the subtasks of one surgical procedure.1 Finally, because CTAs are often designed for specific training purposes, they may focus on only one aspect of training, such as training needs analysis. It is often the case that other CTAs are repeated to accomplish other goals, such as simulator design, scenario creation, or performance measurement. Consequently, the expense and inconvenience of CTAs are often perceived as exceeding the value of the procedure. In this article, we assert that a properly designed CTA can anticipate and satisfy the needs of the entire training spectrum. In this fashion, one can obtain the benefits of CTA without enduring the cost of multiple analyses. Specifically, we will describe four content areas that should be considered 16

in designing a CTA for use in military medicine: identifying training needs, developing training scenarios, developing performance metrics, and identifying simulator requirements. CTA to Identify Training Needs The most typical application of CTA is to elicit information for use in establishing training needs. This approach seems particularly important for medical training, given its emphasis of the apprenticeship model. Learners often obtain much of their training by observing the performance of a subject matter expert. However, although this observation may be useful for acquiring procedural skills, there is a substantial risk that learners may misunderstand the cognitive processes that cue the initiation of a procedure, or decisions that are made during the procedure.6,7 CTA for the identification of training needs relies on the observational approaches of traditional task analysis. The most common addition is to ask experts to “think aloud” while performing the task.8 The expert’s description yields information about their cognitive processes without intervention by the analyst. These protocols are often videotaped and transcribed for later analysis. Given the complexity of many military medical situations, it may be beneficial to augment the traditional “think-aloud” protocol to more fully understand the range of cognitive activities required for successful performance. Specifically, there appears to be a benefit to having the expert perform the task on a Human Patient Simulator (HPS) rather than simply imagining performance.9 CTA to Create Training Scenarios Training scientists have advocated scenario-based training as a technique that is useful for acquiring complex skills and abilities.10 This type of training is particularly important for the acquisition of skills that require substantial practice to acquire mastery. This approach is being adopted in medical training, especially to aid the transition from novice to expert. However, the key determinant of success in this training approach is the scenario provided to the trainee. The scenario must be created so that it (a) is appropriate to the trainee’s current level of knowledge, (b) provides a reasonable level of challenge, (c) allows the practice of skills that have been taught previously, and (d) allows measurement of performance and opportunities for feedback.11 A CTA can be a rich source of information for the analyst who anticipates the need to create training scenarios. One method of obtaining information for this purpose is to use the Critical Decision Method.12 This CTA approach uses probes to have experts describe past situations (or critical incidents) in which specific skills were critical. The analyst uses probes to identify the common patterns of cues the expert uses, conditions under which various strategies are useful, and expectations he/she may have for a situation; these are all important to the expert’s decision making process. This information can MILITARY MEDICINE, Vol. 178, October Supplement 2013

CTA to Develop Simulation-Based Training for Medical Tasks

then be used as a basis for developing training scenarios that mimic the actual performance environment.13 Feedback is an important element in determining training outcomes in scenario-based training. In addition to creating the training scenarios, CTA can also be useful in the provision of feedback to trainees. For example, McCloskey et al14 trained instructors in some CTA procedures as a way to improve their feedback to trainees. Specifically, they trained the instructors to use probes to elicit information about the cues that were used, the interpretation of those cues, the decision-making processes that made use of them, and the manner in which they influenced the final decision. The results indicated that instructors were much more confident in their ability to provide meaningful feedback to the trainees. CTA to Create Performance Metrics One challenge for training professionals in this field is to assess the degree to which the trainee has acquired the knowledge, skills, and abilities targeted for training. This is especially difficult as the trainee progresses from performing observable skills to higher levels of expertise, when cognitive elements such as decision making come into play. Professionals in this area have responded to this challenge by articulating a scenario-based approach to training assessment that is similar to that applied to training. In this approach, scenarios are designed in such a manner so as to elicit the targeted behaviors as evidence that the knowledge/ skills have been mastered. However, there is often not a body of empirical evidence to serve as a foundation for these judgments. Therefore, we must rely on the information possessed by the task experts. Eliciting assessment information from experts is not as easy as simply asking them. It has been shown that, in the process of becoming an expert, learners tend to re-organize information about the task so that it becomes “automatic.” This allows them to act very quickly, but makes it difficult to articulate the individual processes that may be part of that action.15 As such, CTA tools are often used in the process of “knowledge elicitation.” This refers to the application of these tools to assist experts in articulating otherwise inaccessible knowledge.16 Knowledge elicitation techniques are critical to the use of CTA for designing assessments. Consistent with this position, Mislevy et al17 described a process for using CTA in developing an assessment of dental assistants. The key elements of this approach are to: select a set of cases that evoke the targeted skills; obtain think-aloud protocols from performers representing a continuum of expertise levels; analyze the protocols to identify those situations and behaviors that discriminate among performance levels; and construct assessment scenarios that use those situations that are most effective in discriminating among levels of expertise. This approach differs from other CTA approaches in a few important ways. First, it places a responsibility on the examiner to select specific elements of the task for analysis. Rather MILITARY MEDICINE, Vol. 178, October Supplement 2013

than have the performers conduct an entire task, the analyst selects only specific parts of the larger task that are presumed to discriminate among expertise levels. Mislevy et al17 accomplished this by articulating a set of subtasks based on training manuals, licensure exams and so forth. They then had experts provide judgments about which tasks were likely to discriminate among levels of expertise. A second important difference is in the creation of the assessment scenarios. The purpose of this step is to take the tasks selected above and to use them to create a reasonable, realistic, evaluation scenario. This scenario must contain all of the cues necessary to allow behaviors to occur that are likely to generalize to the actual performance situation. As Mislevy et al17 point out, the scenario must include not only the critical cues, but the “context, expectations, instructions, affordances, and constraints the employee will encounter” (p. 340). Further, the scenario must be “seeded” with events that provide an appropriate challenge to the examinee so that he/she exhibits the targeted behaviors.1 Hence, there is a need to create specific events that provide the opportunity for the examinee to display the targeted behaviors, if he/she has actually mastered them. A final important difference in CTA approach for developing assessments is the manner in which the data are analyzed. Rather than attempting to catalog all of the necessary tasks and subtasks, the analysis phase for this purpose focuses only on those elements that discriminate between the behaviors of novices and experts. These elements are used to create an assessment instrument that will be used by trained observers. The items are limited to the key discriminating behaviors to make the scale manageable for raters. It may also be helpful to provide cues to the rater to focus their attention on important simulator events or behaviors that they should watch for.1 CTA to Identify Simulator Requirements CTA is often used in the process of developing new systems, particularly to understand the needs of the operator for tasks with significant cognitive demands.1 Interestingly, however, there is little guidance about how to use these methods in the design of training simulators. There is clearly an opportunity to optimize simulator design by using CTA methods as part of the requirements generation process. Training simulators are designed to provide opportunities for learners to practice newly acquired knowledge, skills, and abilities in a realistic environment. To accomplish this goal, a simulator must satisfy several requirements in creating an effective practice environment. One such requirement is that the simulator provides the task cues required for trainees to select the appropriate course of action. Scientists in the area of expertise have described the importance of developing appropriate cue-pattern associations as a key element that discriminates between experts and novices1; that is, connection between the critical task conditions (cues) with appropriate 17

CTA to Develop Simulation-Based Training for Medical Tasks

responses. To achieve this, task cues must be presented at an adequate level of fidelity to allow transfer from the simulation to the actual task. Cues that are absent, or poorly presented, may not trigger appropriate behaviors when needed. Furthermore, if cues are improperly presented, there is a risk that trainees might develop incorrect associations, impeding their ability to transfer learning to the operational setting. This could lead to errors of varying magnitude in performance. Given the above, it is tempting to include every possible cue in a simulation at the highest possible level of fidelity. However, matters of cost largely prohibit this. Therefore, there is a need to identify those cues that are required for experts to make their decisions, and to identify the minimum level of fidelity required to trigger realistic behavior. In that fashion, one can achieve the most cost-effective approach for training outcomes. Using a CTA is crucial here as research suggests that some simulated elements that were thought to have importance by design engineers were perceived to have little value by learners.18 Above and beyond issues of cue fidelity, CTA is also essential for establishing the needs for new simulation technologies and capabilities. For example, Low-Beer et al19described a CTA-based approach to teaching digital rectal examination procedures. The authors determined that the existing simulation technologies did not give the instructors the proper cues to make a determination of expertise. Consequently, they created a modified version of the simulator that allowed instructors to better see the targeted behaviors. They concluded that the new simulator not only enabled more effective evaluation, but in subsequent simulator assisted CTAs, they identified several more behaviors that needed to be trained. The following section documents a study whose purpose was a CTA in a medical domain. METHOD Participants The participants for this study were nine instructors from a military training command (for the sake of clarity, we will refer to them as “instructors” throughout the remainder of this article) where the targeted procedures are trained. These instructors all had several years of experience and substantial training in the two techniques of interest. It should be noted that, although there are some variations in the steps that specific training commands use for each procedure, we do not believe that the critical cues should be different for any particular set of steps. Procedure The Institutional Review Board–approved protocol lasted for 2 days, 1 day per procedure (e.g., cricothyroidectomy, hemorrhage control). Before starting the CTA, instructors completed an informed consent form. Then, for each procedure, instructors were asked to complete a think-aloud protocol while they performed the targeted procedure on an HPS. They were asked to imagine that they were doing the proce18

dure in a combat situation. After their free response, they reviewed the procedures again while the interviewer provided specific probes to elicit the critical cues for performance. Next, instructors were asked specific questions about the existing simulator and its differences from live tissue. Instructors were then asked to perform the procedure with an emphasis on behaviors that discriminate between levels of performance in their students (e.g., novices vs. experts). RESULTS On the basis of our design, the categories for which we report responses and observations relevant to each of the major steps in each task were the following (1) Cues used to perform the step (2) What is different or lacking in terms of cues on the HPS relevant to performing the step and/or what instructors want the HPS to have (3) Typical trainee or student errors in performing the step (4) What the instructors can observe when evaluating trainees’ performance of the step and/or how they can observe this (5) The decisions made in the process of performing each procedure Cricothyroidotomy Task Results from the cricothyroidotomy task are shown in Table I. Inspection of this table reveals first that we concluded that nine critical steps describe the task (shown in column 1). The second column indicates examples of the major cues that experts reported as crucial to each step. The next column delineates the current deficiencies in the simulator (which can also be considered requirements) used to train this task. Several of these relate to the simulated tissue (skin, membranes), absence of blood and lack of realism in several other aspects of the task. The fourth column indicates typical errors that novices are likely to make at each step. These errors can be used as a basis to assess trainee performance and to specify training content. The fifth column displays the observable behaviors at each step. This information can be used as the basis to develop a performance assessment tool or observational protocol (e.g., checklist) that can be used by instructors or raters. Finally, there is a description of the decision making demands. For example, returning to the first step of cricothyroidotomy task (controlling the trachea), results indicated that it is important for trainees to understand the necessity for grasping the trachea between the thumb and middle finger of their non-dominant hand to free their index finger to palpate for landmarks, and to understand that if the trachea is not properly secured it will likely move and they can lose their landmark. Results also showed that simulations designed to train this step should allow trainees to actually grasp the trachea in this manner, include an average trachea width and variations to instruct proper grasping of the trachea, and allow MILITARY MEDICINE, Vol. 178, October Supplement 2013

CTA to Develop Simulation-Based Training for Medical Tasks TABLE I. Major Step Control Trachea

Palpate for Landmarks

Make Vertical Incision

Reacquire Landmark— Cricothyroid Membrane Penetrate Cricothyroid Membrane

Dilate Area

Cues Used to Perform Tactile cues from thyroid cartilage— grasp with nondominant hand between thumb and middle finger and secure it from moving Tactile cues of depression over cricothyroid membrane; feeling the groove below a notch (point of Adam’s apple or thyroid cartilage)

CTA Results for the Cricothyroidotomy Task

Simulator Deficiencies or Requirements

Typical Errors

Simulated skin is too Insufficient control of rigid; thyroid rings are trachea; not grasping it sometimes missing; between thumb and cannot actually grasp middle finger of nontrachea and trachea dominant hand to free does not move; include index finger to palpate trachea of varying widths Simulator skin is too Missing landmarks; rigid; “membrane” misinterpreting “false” opening is too large, landmarks; incorrect causing the depression use of finger for to feel too large; palpation (e.g., using missing anatomical two hands to palpate); landmarks (e.g., feeling too high or too trachea rings) low on trachea Simulator skin is “Freezing up”; too difficult to incise and timid or aggressive causes too much with incision; cutting resistance; skin does too shallow or too not open in proper deep; incorrect scalpel layers; absence of grip and use of tip of underlying structures blade vs. belly of and blood blade; cutting too high or too low Simulator membrane No results obtained does not look or feel like actual membrane

Visual cues from opening of skin layers; seeing underlying structures; tactile cues of underlying structures; tactile cues of scalpel blade and visual cues of blade’s depth Visual cue of seeing finger in incision; tactile cue of palpating cricothyroid membrane, such as graininess Lack of supporting Tactile cue of “poking through” membrane; structures reduce feeling of bone realism; absence of resistance if cutting redundant vasculature, horizontally vs. just vocal cords, and blood; poking through absence of bone resistance Visual cues from No results obtained opening with hemostat or scalpel; tactile cues of feeling the area is opening

Insert ET Tube

Tactile and kinesthetic cues when inserting ET tube, including guiding it with tools

Simulator imposes much more resistance than live patient; lack of natural lubrication (e.g., saliva, mucous, blood)

Inflate Cuff

Visual and tactile connection of syringe and cuff; seeing that syringe is at proper cc line; seeing that it is removed

Simulator cuff leaks more than in human because of inadequate seal

Severing surrounding blood vessels or vocal cords

Losing control of incision by removing scalpel before securing it with hemostats or a cric hook; tearing of ET tube balloon with tools Improper cric hook use, including orienting it toward the chin; excessive force on or twisting of ET tube; improper location of tube, including going subcutaneous Failing to make actual connection; leaving syringe attached, allowing leakage and deflation of cuff

Observable Trainee Behaviors

Decision Making Demands

Trachea movement because of releasing grip or inadequate grip; vertical location of grip

Trachea control adequate; release trachea

Finger dip into depression over cricothyroid membrane; index finger usage and vertical location on trachea

Interpretation of physical cues

Location of incision; depth of incision; scalpel grip, angle, and whether or not tip or belly of blade is used

Location of incision; choice of scalpel technique

Whether or not landmark is reacquired; placement of finger in incision Depth and location of “poke”; blade angle; width of horizontal movement

Reacquisition of membrane successful

Proper use and location of tools; proper incision control

Location and depth of incision

No results obtained

Proper use and Tube orientation of hook; successfully pulling hook to placed make anatomical structures more pronounced; rotation and position of ET tube Number of cc’s Cuff inserted; their appropriately checking of rigidity rigid of cuff; detachment of syringe (Continued)

MILITARY MEDICINE, Vol. 178, October Supplement 2013

19

CTA to Develop Simulation-Based Training for Medical Tasks TABLE I. Major Step

Cues Used to Perform

Bag Patient; Assess Respirations and Epigastric Noises

Tactile cues indicate proper attachment; visual signs of misting in tube; chest sounds; seeing rising and falling of chest

Simulator Deficiencies or Requirements Absence of misting; improper pressure on cuff; absence of chest sounds and movement

the trachea to move if not properly controlled. Results further indicated that metrics could include determining whether the trainee had grasped the trachea between the thumb and middle finger of her non-dominant hand, had lost her grasp, or had allowed the trachea to move. TABLE II.

Major Step

Cues Used to Perform

Simulator Deficiencies or Requirements

Visual appearance of Absence of realistic wound, including depth artery and long and seeing severed bones; absence of artery along long bone; realistic blood flow, feel of wound, including including rate, depth and feeling spurting, and severed artery and long pooling; absence of bone; blood flow, realistic blood including spurting vs. characteristics, oozing, and rate of including color and cavity filling; color and viscosity viscosity of blood Pack Wound Tactile cue of finger Absence of realistic against bleeding source; cavity and blood feeling of pressure as vessels; absence of gauze fills cavity; slowing of flow of feeling that intentionally blood as gauze packing with some force; contacts bleeder; feeling of wound as spongy absence of or like raw separation of steak muscle and skin if packed too aggressively Apply Visual cues for N/A Pressure placement of dressing, Dressing including whether it is placed directly centered over wound Wrap Wound Visual and kinesthetic Simulator skin is and Dressing cues indicating tightness unrealistically nonWith Elastic of bandage pliable Bandage

20

Documentation of type of gauze used and whether or not wound is bleeding through tape by writing on or marking tape

Typical Errors Insufficient force when squeezing tube; squeezing at incorrect rate; excessive pressure on BVM when attaching to ET tube

Observable Trainee Behaviors

Decision Making Demands

Proper attachment; proper squeezing

Interpretation of cues

Hemorrhage Control Task Results from the hemorrhage control task are shown in Table II. Consistent with Table I, the major steps in the hemorrhage control task appear in the first column, followed by cues needed to perform the step and implications for

CTA Results for the Hemorrhage Control Task

Locate Source of Bleeding

Document

Continued

N/A

Typical Errors

Observable Trainee Behaviors

Failure to locate Location of pressure source of bleeding; being applied; improper wound whether blood flow cleaning; inadequate stopped; wound application of cleaning and drying; pressure because of direction of palpation incorrect location applied or insufficient force

Failure to maintain pressure; failure to dry wound before packing; failure to make direct contact with wound; use of insufficient packing material; packing that is not aggressive enough or is too aggressive No results obtained

Failure to secure wound tightly enough because of using too long of throws when wrapping or insufficient force Improper or incomplete documentation

Decision Making Demand Interpretation of cues

Whether or not gauze Interpretation of is in direct contact cues; correct force with bleeder and for packing; whether blood flow is amount of packing slowed; amount of gauze used; depth of gauze; force of packing and whether or not skin and muscle are separating Placement of dressing and whether it is applied as a ball on top of wound

Interpretation of cues; dressing placement

Student exertion; length of throws used; tautness of bandage and whether it is shinny

Interpretation of cues; tightness of wrapping; amount of wrapping

Improper or incomplete documentation

N/A

MILITARY MEDICINE, Vol. 178, October Supplement 2013

CTA to Develop Simulation-Based Training for Medical Tasks

simulation, training, and measurement. These results can be used to determine training requirements, inform the design of simulations necessary to train the procedure, and develop assessment protocols, as can the results for the cricothyroidotomy task. For example, regarding the first step of hemorrhage control task (locating the source of bleeding), results indicated that trainees need to understand the anatomy of the body and how an arterial bleed will tend to follow the long bones; that it will be deeper than a venous bleed; that it will flow more rapidly and more quickly fill the cavity; and that it will actually spurt from the severed artery, which can be palpated and sometimes seen. Results also showed that simulations designed to train this step should include long bones and the bleeding artery near it; a proper flow of blood; spurting blood; and a rapidly filling cavity. Results further indicated that metrics could include determining whether the trainee: had applied adequate pressure to stop the flow of blood, had cleaned and dried the wound to allow visual inspection, or had palpated in the correct direction to feel the source of the bleed. CONCLUSIONS Developing effective training and assessment for combat casualty care tasks is an imperative. Given the cognitive and behavioral demands of such tasks, this can be a complex and expensive undertaking. CTA techniques hold promise as a means to develop a comprehensive understanding of the major steps to be performed in a task, along with detailed information about how best to train those tasks, and important insights into how to develop sound metrics for them. On the basis of past work into CTA and our experience in conducting one for the combat casualty care tasks discussed above, we offer the following guidelines for conducting a comprehensive CTA that enables multiple goals to be accomplished. (1) Select multiple subject matter experts (SMEs) (instructors are an excellent source) who can give detailed accounts of task/training requirements. Use an HPS as a reference point for SMEs to recount how the task is performed. (2) Elicit the major tasks and subtasks (steps) needed to accomplish the task using a think-aloud protocol. (3) Have SMEs perform the task again, using probe questions to elicit critical cues at each step. (4) Ask specific questions pertaining to deficiencies in simulation and other training devices. (5) Elicit specific errors likely to be made by novices at each step and determine what specific observable behaviors typify each step. ACKNOWLEDGMENTS This work was funded by the Naval Air Warfare Center Training Systems Division, Orlando, Florida, under contract PO SEA1333. The work reported

MILITARY MEDICINE, Vol. 178, October Supplement 2013

herein was also partially supported by a grant from the Office of Naval Research, Award Number N00014-10-1-0978.

REFERENCES 1. Crandall B, Klein G, Hoffman R: Working Minds: A Practitioner’s Guide to Cognitive Task Analysis. Cambridge, MA, MIT Press, 2006. 2. Cooke NJ: Varieties of knowledge elicitation techniques. Int J Hum Comput Stud 1994; 41: 801–49. 3. Clark RE, Estes F: Cognitive task analysis for training. Int J Educ Res 1996; 25(5): 403–17. 4. Velmahos GC, Toutouzas KG, Sillin LF, et al: Cognitive task analysis for teaching technical skills in an inanimate surgical skills laboratory. Am J Surg 2004; 187(1): 114–9. 5. Craig C, Klein MI, Griswold J, Gaitonde K, McGill T, Halldorsson A: Using cognitive task analysis to identify critical decisions in the laparoscopic environment. Hum Factors 2012; 54(3): 1–25. 6. Johnson S, Healy A, Evans J, Murphy M, Crawshaw M, Gould D: Physical and cognitive task analysis in interventional radiology. Clin Radiology 2006; 61: 97–103. 7. Yates K, Sullivan M, Clark R: Integrated studies on the use of cognitive task analysis to capture surgical expertise for central venous catheter placement and open cricothyrotomy. Am J Surg 2012; 203(1): 76–80. 8. Ericsson KA, Simon HA: Verbal Protocols: Verbal Reports as Data. Cambridge, MA, MIT Press, 1993. 9. Clark RE: The Use of Cognitive Task Analysis and Simulators for the After Action Review of Medical Events in Iraq. University of Southern California Technical Report, DTIC Accession Number ADA466686. Los Angeles, CA, University of Southern California, 2005. Available at http://www.dtic.mil/cgi-bin/GetTRDoc?Location=U2&doc=GetTRDoc .pdf&AD=ADA466686; accessed May 3, 2013. 10. Cannon-Bowers JA: Recent advances in scenario-based training for medical education. Curr Opin Anaesthesiol 2008; 21(6): 784–9. 11. Salas E, Wilson KA, Burke CS, Priest HA: Using simulation-based training to improve patient safety: what does it take? Jt Comm J Qual Patient Saf 2005; 31(7): 363–71. 12. Klein G, Calderwood R, MacGregor D: Critical decision method for eliciting knowledge. IEEE Trans Syst Man Cybern 1989; 19(3): 462–72. 13. Fowlkes J, Dwyer DJ, Oser RL, Salas E: Event-based approach to training (EBAT). Int J Aviat Psychol 1998; 8(3): 209–21. 14. Phillips J, McDermott PL, Thordsen M, McCloskey M, Klein G: Cognitive requirements for small unit leaders in military operations in urban terrain (No. ARI-RR-1728). Fairborn, OH, Klein Associates, 1998. Available at http://www.dtic.mil/dtic/tr/fulltext/u2/a355505.pdf; accessed May 8, 2013. 15. Glaser R: Expertise and learning: how do we think about instructional processes now that we have discovered knowledge structures? In:. Complex Information Processing: The Impact of Herbert A. Simon. Edited by Simon H, Klahr D, Kotovsky K. NY, Psychology Press, 1999. 16. Diaper D: Knowledge Elicitation: Principle, Techniques, & Applications. NY, Springer-Verlag, 1989. 17. Mislevy RJ, Steinberg LS, Breyer FJ, Almond RG, Johnson L: A cognitive task analysis with implications for designing simulationbased performance assessment. Comput Human Behav 1999; 15(3–4): 335–74. 18. Grant T, McNeil MA, Luo X: Absolute and relative value of patient simulator features as perceived by medical undergraduates. Simul Healthc 2008; 3(3): 133–7. 19. Low-Beer N, Kinnison T, Baillie S, Bello F, Kneebone R, Higham J: Hidden practice revealed: using task analysis and novel simulator design to evaluate the teaching of digital rectal examination. Am J Surg 2011; 201(1): 46–53.

21

MILITARY MEDICINE, 178, 10:22, 2013

Use of Cognitive Task Analysis to Guide the Development of Performance-Based Assessments for IntraOperative Decision Making Carla M. Pugh, MD, PhD*; Debra A. DaRosa, PhD† ABSTRACT Background: There is a paucity of performance-based assessments that focus on intraoperative decision making. The purpose of this article is to review the performance outcomes and usefulness of two performance-based assessments that were developed using cognitive task analysis (CTA) frameworks. Methods: Assessment-A used CTA to create a “think aloud” oral examination that was administered while junior residents (PGY 1-2’s, N = 69) performed a porcine-based laparoscopic cholecystectomy. Assessment-B used CTA to create a simulation-based, formative assessment of senior residents’ (PGY 4-5’s, N = 29) decision making during a laparoscopic ventral hernia repair. In addition to survey-based assessments of usefulness, a multiconstruct evaluation was performed using eight variables. Results: When comparing performance outcomes, both approaches revealed major deficiencies in residents’ intraoperative decisionmaking skills. Multiconstruct evaluation of the two CTA approaches revealed assessment method advantages for five of the eight evaluation areas: (1) Cognitive Complexity, (2) Content Quality, (3) Content Coverage, (4) Meaningfulness, and (5) Transfer and Generalizability. Conclusions: The two CTA performance assessments were useful in identifying significant training needs. While there are pros and cons to each approach, the results serve as a useful blueprint for program directors seeking to develop performance-based assessments for intraoperative decision making.

INTRODUCTION There is a need to assess intraoperative decision making. However, there is a paucity of training curricula and performance assessments addressing this important aspect of surgical skill. Previous work on intraoperative decision making has largely focused on surgical outcomes relating to procedure choices or technical approaches.1–3 In the article, “Intra-operative decision making in the treatment of shoulder instability,” the authors discuss surgical outcomes for patients who had an open arthrotomy versus a minimally invasive arthroscopic procedure as the treatment of choice.2 In another article, the authors discuss the use of a minimally invasive versus open approach to empyema.3 These studies address intraoperative decision making as it relates to surgical planning based on intraoperative findings. Additional work focusing on intraoperative decision making has taken advantage of cognitive psychology script theory.4 In these studies, researchers developed, implemented, and evaluated script concordance tests (SCTs) as a means of assessing operative judgment. SCTs are a form of problem/case-based pencil-and-paper examinations. However, compared to standard multiple choice questions where there is one correct answer, SCT items are based on a Likert scale. As such, there is no single correct answer, and scoring is based on deviation

from the expert range. SCTs have been used to assess clinical judgment in surgery, urology, and radiology.5–7 When used to assess intraoperative judgment, SCTs were noted to correlate with years in training but not the number of cases performed in the operating room.8,9 The complexity of intraoperative judgment calls for a variety of research methods to better understand the range and types of decisions made in the operating room.10 In addition to innovative pencil-and-paper tests like the SCTs, performance-based assessments may provide critical insight. Performance-based assessments are considered to be a more direct measure of task performance compared to oral and pencil-and-paper examinations. Pencil-and-paper examinations are indirect indicators or correlates of performance in specific task domains.11,12 In general, performance-based assessments involve the observation of trainees performing tasks in various contexts. Because of the complexity of performance-based assessments, validity must be evaluated from several different constructs. The purpose of this research project was to perform a critical, multiconstruct evaluation of two, independently developed, performancebased assessments that focus on resident’s intraoperative decision-making skills. METHODS

*Associate Professor of Surgery, Vice Chair of Education and Patient Safety, University of Wisconsin, Department of Surgery, 600 Highland Ave–CSC 785B, Madison, WI 53792. †Vice Chair of Education, Professor of Surgery, Northwestern University Department of Surgery, 251 E. Huron St., Galter 3-150, Chicago, IL 60611. The findings and opinions expressed here do not necessarily reflect the positions or policies of the Office of Naval Research or the United States Army Medical Research and Material Command. doi: 10.7205/MILMED-D-13-00207

22

Methodology This research project used a mixed-methods qualitative approach. The methods were based on case-oriented comparative research (COCR) and variable-oriented comparative research (VOCR).13 The primary goal was to compare two, newly developed, performance-based assessments that focus on intraoperative decision making. A secondary goal was to MILITARY MEDICINE, Vol. 178, October Supplement 2013

Cognitive Task Analysis for the Development of Performance-Based Assessments Standardized Questions Posed by Faculty Facilitators to Prompt “Think Aloud” Moments During the Laparoscopic Cholecystectomy Simulation

Choice of Laparoscopic Approach Initial Port Management Placement of Subxiphoid Port Indication for Conversion Indication for Cholangiography Extraction of Gallbladder Closing of Trocar Sites

1. What factors influence the initial decision to use a laparoscopic approach for removing the gall bladder? 2. Describe the special risks of inserting the Veress needle in thin or muscular patients? 3. What structures should be inspected for injury immediately after initial trocar insertion? 4. How do you decide where to place the subxiphoid trocar? 5. What factors should determine whether a laparoscopic operation should be converted to open? 6. What are the criteria for using intraoperative cholangiography? 7. When should a specimen bag be used for extraction? 8. When should the fascia for trocar sites be closed? 9. List potential complications associated with port sites and describe how to prevent them.

provide program directors and researchers a conceptual blueprint that would facilitate the development, implementation, and evaluation of performance-based assessments for intraoperative decision making. To develop the blueprint, comparison points (independent variables) were defined and used to guide the case-oriented comparison and operationalize blueprint components. COCR has been used for over 40 years.14 –16 This research method often employs qualitative research techniques and maintains the overarching goal of discovery. The COCR is often compared with variable-oriented research, which is traditionally based in hypothesis-driven and quantitative realms. Both approaches seek causal relationships; however, the approach and communication of such relationships differs.14–16 Although implementation of both methods varies depending on context and investigator preference, a commonly stated difference is that COCR is reserved for comparing small case numbers2–4 and that conclusions drawn from a causal perspective should be done within cases and not across cases.17 Our methods followed these guidelines. Study Cases Two independent research projects used cognitive task analysis (CTA) techniques to guide the development of assessments geared toward intraoperative decision making. CTA techniques are used to discover the cognitive activities and

FIGURE 1.

knowledge that experts use to perform complex tasks.18,19 The outputs are used as the gold standard by which nonexperts are compared. The first independent project used CTA to create a skills lab curriculum for laparoscopic cholecystectomy. The curriculum included a pencil-and-paper test and a “think aloud” oral examination that took place while junior residents, PGY-1 and PGY-2 (N = 69), performed a porcine-based laparoscopic cholecystectomy. Faculty facilitators prompted think aloud moments using a set of standardized questions that correlated to laparoscopic cholecystectomy procedural steps. The questions, as shown in Table I, were posed at the appropriate times throughout the simulated procedure and scored for accuracy and completeness on anchored scales of 1 (low) to 5 (high). Participants were asked to discuss their knowledge of the critical procedural steps (e.g., How will you decide where to place the subxiphoid trocar?), error recognition (e.g., Describe special risks of inserting a Veress needle in thin or muscular patients), and error management (e.g., List potential complications associated with port sites and describe how to prevent them).20 The second independent project used CTA to create a simulation-based assessment. The assessment focused on decision making during a laparoscopic ventral hernia repair. This assessment was geared toward senior level residents in the fourth and fifth postgraduate years, PGY-4 and PGY-5 (N = 29). The residents were tasked with repairing a 10 cm 10 cm ventral hernia using a newly developed,

+

TABLE I.

The laparoscopic ventral hernia simulator.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

23

Cognitive Task Analysis for the Development of Performance-Based Assessments

decision-based simulator (Fig. 1). Residents were expected to independently make all of the decisions necessary in completing the hernia repair using laparoscopic techniques. Operative performance was video recorded and later scored using a checklist evaluation. Outcomes were based on task errors and task completion rates. In addition, their hernia repairs were graded based on mesh placement and coverage of the hernia defect.21 Data Collection and Analysis For the case-oriented comparison, descriptive data were collected regarding project goals, performance assessment development, implementation, and outcome-related details.13 These data were used to guide case comparison based on commonalities and disparities. Additional data were collected using eight variables that have previously been shown to be important in assessing validity of performance-based assessments. These variables were used as part of the variableoriented comparison and included (a) Fairness, (b) Cognitive Complexity, (c) Content Quality, (d) Content Coverage, (e) Meaningfulness, (f) Consequences, (g) Cost and Efficiency, and (h) Transferability and Generalizability.11 Data analysis was qualitative in nature. Between-case comparisons were used to provide insight on the pros and cons of each case regarding approach and outcomes.13 Within-case comparisons and assumptions were evaluated to gain a better understanding of individual project outcomes and what this means for trainees and program directors. Student surveys and subjective evaluations based on review of individual project methods and project outcomes were used to generate summative evaluations for each of the eight variables. Analysis goals were descriptive and qualitative in nature as opposed to causal related. In essence, the independent variables served as a framework for descriptive analysis. RESULTS Using the case-oriented comparative method, commonalities and disparities in process and outcomes were reviewed. TABLE II. Grounding Theory Task Fidelity Outcome Instrument Task Analysis Development Goal Content Focus Learner Group Decision-Making Assessment Method Data Collection Assessment-Related Outcome

24

Important commonalities included (1) use of performancebased assessments to assess intraoperative decision-making skills of surgical trainees; (2) use of CTA research methods as a means to guide development of the performancebased assessments; (3) use of high fidelity, hands-on tasks that allow the use of surgical instruments commonly and currently being used in today’s operating rooms; and (4) use of outcome measures that require completion of scoring rubrics/checklists based on the observation of experienced raters. Important disparities included (1) experience level of the study participants, (2) content domain, (3) type of performancebased assessment, and (4) domain- and implementationspecific differences in assessment outcomes (Table II). Overall, the disparities did not outweigh the ability to perform an in-depth comparison. For the variable-guided assessment, case-related advantages included (1) Cognitive Complexity, (2) Content Quality, (3) Content Coverage, (4) Meaningfulness, and (5) Transfer and Generalizability. In contrast, both projects had some disadvantages in the 3 remaining evaluation areas (Table III). When comparing outcomes, both approaches revealed major deficiencies in residents’ intraoperative decisionmaking skills. For study A, during the think aloud exercise, residents struggled to formulate and verbalize their decisions and were not able to talk and operate at the same time. For Study B, 89% of residents made critical decision-based errors that prevented them from successfully completing the hernia repair. DISCUSSION There is a paucity of training curricula and performance assessments that focus on intraoperative decision making. Intraoperative decisions, related to procedure choices or technical approaches, are known to effect surgical outcomes.1–3 As such, preoperative and intraoperative judgments are critical to patient welfare. Prior works relating to the assessment of intraoperative decision making have

Summary of Case Commonalities and Disparities (bolded entries) Cognitive Task Analysis Hands-on, With Real World Tools Observation-Based Checklist Training Curriculum Pre- and Post-Test Assessments Assessment Checklist Laparoscopic Cholecystectomy PGY 1 and 2 (N = 63) Real-time Think Aloud Pencil-and-Paper Exam Onsite, Multiple Faculty Raters, Pre- and Post-test Grading Residents Experienced Difficulty in Formulating and Verbalizing Intraoperative Decisions

Cognitive Task Analysis Hands-on, With Real World Tools Observation-Based Checklist Decision-Making Simulators Formative Assessment Assessment Checklist Laparoscopic Ventral Hernia Repair PGY 4 and 5 (N = 29) Hands-on Simulated Task Video Review, Single Faculty Rater, Grading of Simulator Skins 89% of Residents Made Critical Decision-Based Errors That Prevented Task Completion

MILITARY MEDICINE, Vol. 178, October Supplement 2013

Cognitive Task Analysis for the Development of Performance-Based Assessments TABLE III.

Fairness Cognitive Complexity Content Quality Content Coverage Meaningfulness Consequences Cost and Efficiency

Transfer/Generalizability

Variable-Oriented Comparison of the Two Cases

Approach-1 “Think Aloud” (DaRosa et al.20)

Approach-2 Simulation Based Assessment (Pugh et al.21)

Formative Assessment–Yes Summative–No Broad PRO: Realistic Tissue CON: Minimal Variation PRO: Allows Major Steps CON: Minimal Variation Resident: 4.63/5.0, SD = 0.52 (Teaching Effectiveness) Possible Increase in Faculty and Resident Time Commitment Porcine Liver-$30–$45.00 Faculty Presence Required

Formative Assessment–Yes Summative–No Broad PRO: Wide Clinical Variation CON: Unrealistic Tissues PRO: Allows Major Steps CON: No Cautery Scenarios Resident: 4.63/5.0, SD = 0.47 (Usefulness) Faculty: 4.57/5.0, SD = 0.49 (Usefulness) Possible Increase in Faculty and Resident Time Commitment Laparoscopic Ventral Hernia Simulator-$10.00 Faculty Presence not Required, but Laparoscopic Camera Assistant Needed Highly Likely the Assessment Can Be Used in Other Programs

Highly Likely the Assessment Can Be Used in Other Programs

+

used SCTs.8,9 Using this method, two studies have shown no correlation between number of procedures performed in the operating room and SCT ratings. There is, however, a strong correlation to training level. Despite the reported validity and reliability of SCTs for assessing intraoperative decision making, performance-based assessments are considered to be a more direct measure of task performance. However, because of the complexity of performance-based assessments, validity must be established based on several different constructs. The purpose of this research project was to perform a critical, multiconstruct evaluation of 2 independently developed, performance-based assessments geared toward resident’s intraoperative decision-making skills. The first performance-based assessment included a think aloud exercise during a porcine-based laparoscopic cholecystectomy. This assessment involved both PGY-1 and PGY-2 residents (N = 69). Faculty facilitators prompted think aloud moments using a set of standardized questions. The results of this study revealed that residents experienced difficulty formulating and verbalizing their intraoperative decisions and were unable to talk and operate at the same time.20 The second performance-based assessment required senior level residents, PGY-4 and PGY-5 (N = 29), to repair a 10 cm 10 cm ventral hernia using a newly developed, decision-based simulator. Residents were expected to independently make all of the decisions necessary in completing the hernia repair using laparoscopic techniques. Results of this study showed that 89% of the residents made critical decision errors that prevented them from completing the hernia repair.21 Key commonalities that allowed for case comparison included (1) use of performance-based assessments to assess intraoperative decision-making skills of surgical trainees and (2) use of high fidelity, hands-on tasks that allow the use of surgical instruments commonly and currently being used in today’s operating rooms. A preferred requirement of performance-based assessments is that they are conducted in MILITARY MEDICINE, Vol. 178, October Supplement 2013

a venue that is as close to the real environment as possible.11 It is this key factor that sets performance-based assessments apart from pencil-and-paper and oral examinations and allows for a more direct evaluation of task performance skills. A variable-oriented approach was used to guide the case comparison. In 1991, Linn, Baker, and Dunbar published a review that summarized eight key factors that affect the validity of performance-based assessments: (a) Fairness, (b) Cognitive Complexity, (c) Content Quality, (d) Content Coverage, (e) Meaningfulness, (f) Consequences, (g) Cost and Efficiency, and (h) Transferability and Generalizability.11 These eight factors were used as independent variables to frame the data collection and discussion from a case comparison standpoint. These independent variables were not used to generate a hypothesis that sought effects on a common dependent variable, rather qualitative methods were used. Qualitative and descriptive analyses based on the previously listed variables showed that both approaches showed major advantages in five of the eight evaluation areas including (1) Cognitive Complexity, (2) Content Quality, (3) Content Coverage, (4) Meaningfulness, and (5) Transfer and Generalizability. In contrast, both projects showed some disadvantages in the remaining three evaluation areas (Table III). A commonality in the disadvantages related to the novelty of performance-based assessments for intraoperative decision making. Currently, there are no standard curricula in wide use. As such, it may not be fair to use a performance-based assessment on this topic as a summative evaluation in residency training. However, a formative assessment may be appropriate. An additional disadvantage includes the potentially high-time requirements for implementation of performance-based assessments. This also affects cost and efficiency. From a cognitive standpoint, both user groups clearly showed deficiencies at levels that were unexpected. Junior residents struggled to formulate and verbalize decisions while operating. However, when the performance tasks were 25

Cognitive Task Analysis for the Development of Performance-Based Assessments

separated (performing an operation vs. verbalizing intraoperative decisions) the residents performed well. Similarly, it was not expected that 89% of senior level residents would fail to complete an operation that they had performed before in the live operating room setting. Cognitive load theory offers an explanation for both of these observations. This theory is based on the assumption that the brain has limited working memory for dynamic processing combined with partly independent processing units for visual and auditory inputs.22 At any given time, these inputs are in competition with working memory. When experiencing novel situations, with a variety of new inputs, the capacity for working memory is less and may result in an inability to multitask (junior residents unable to talk and operate at the same time) or forgetfulness (junior residents forgetting concepts they had recently verbalized or senior residents forgetting the steps of a surgical procedure when allowed to be independent for the first time). These findings underscore a key weakness in using pencil-and-paper tests to assess intraoperative decision making: the inability to dynamically assess unforeseen knowledge and performance deficits that affect cognitive load, hence task completion. For both cases, it must be underscored that there were no trick questions or rarely experienced scenarios. Table I shows the standardized questions used in the think aloud study. These questions represent basic surgical knowledge for laparoscopic cholecystectomy. However, this is new knowledge for the participants in this study (PGY 1-2), hence they are not as experienced with recalling and verbally communicating the facts. For the second study with the senior residents (PGY4-5), there were several points in the procedure where errors were committed including (1) poor laparoscopic port planning, (2) failure to repurpose a port to facilitate visualization or retraction, (3) failure to match mesh size to hernia defect, and (4) improper mesh prep on the back table. Each of these steps is critical in successful completion of a laparoscopic hernia repair. Forgetting or making errors in any one of these steps will significantly affect successful and timely completion of the repair. Although many of the residents had performed this procedure several times before, it appears that 50 cases (the highest case experience for this group of residents) over a 5-year period is not enough to reach mastery or independent performance. A major limitation of this study was access to only two performance-based assessments that focused on intraoperative decision making. Although the case-oriented comparative methods are designed to study two to four cases, limitations in causality and outcome generalizability are duly noted.13 In addition, when using a mixed-method approach, there will always be some difficulty in data fitting, analyzing, and reporting. Study strengths include topic importance and the need to reliably assess intraoperative decision making. Although more traditional assessments such as pencil-and-paper tests lend themselves to robust item analysis, performance-based assessments require a multiconstruct approach. Our findings 26

serve as a useful blueprint for developing and evaluating performance-based assessments that focus on intraoperative decision making. ACKNOWLEDGMENTS This research was supported in part by the Department of Defense, United States Army Medical Research and Material Command, USAMRMCW81XWH0710190, and the Association for Surgical Education—Center for Excellence in Surgical Education, Research and Training (CESERT) Grant Award. The work reported herein was also partially supported by a grant from the Office of Naval Research, Award Number N00014-10-1-0978.

REFERENCES 1. Portis AJ, Laliberte MA, Holtz C, Ma W, Rosenberg MS, Bretzke CA: Confident intraoperative decision making during percutaneous nephrolithotomy: does this patient need a second look? Urology 2008; 71(2): 218–22. 2. Sisto DJ, Cook DL: Intraoperative decision making in the treatment of shoulder instability. Arthroscopy 1998; 14(4): 389–94. 3. Roberts JR: Minimally invasive surgery in the treatment of empyema: intraoperative decision making. Ann Thorac Surg 2003; 76(1): 225–30. discussion 229–30. 4. Charlin B, Tardif J, Boshuizen HPA: Scripts and medical diagnostic knowledge: theory and applications for clinical reasoning instruction and research. Acad Med 2000; 75: 182–90. 5. Meterissian SH: A novel method of assessing clinical reasoning in surgical residents. Surg Innov 2006; 13: 115–9. 6. Sibert L, Darmoni SJ, Dahamna B, Hellot MF, Weber J, Charlin B: On line clinical reasoning assessment with script concordance test in urology: results of a French pilot study. BMC Med Educ 2006; 6: 45. 7. Charlin B, Brailovsky CA, Brazeau-Lamontagne L, Samson L, Van der Vleuten CP: Script questionnaires: their use for assessment of diagnostic knowledge in radiology. Med Teach 1998; 20: 567–71. 8. Park AJ, Barber MD, Bent AE, et al: Assessment of intraoperative judgment during gynecologic surgery using the Script Concordance Test. Am J Obstet Gynecol 2010; 203(3): e1–6. [Epub] PMID: 20494330. 9. Meterissian S, Zabolotny B, Gagnon R, Charlin B: Is the script concordance test a valid instrument for assessment of intraoperative decisionmaking skills? Am J Surg 2007; 193(2): 248–51. 10. Pugh CM, Santacaterina S, Darosa DA, Clark RE: Intraoperative decision making: more than meets the eye. J Biomed Inform 2011; 44(3): 486–96. [Epub] 2010 January 10. 11. Linn RL, Baker EL, Dunbar SB: Complex, performance-based assessment: expectations and validation criteria. Educ Res 1991; 20(8): 15–21. 12. Levine HG, McGuire CH, Nattress LW Jr.: The validity of multiple choice achievement tests as measures of competence in medicine. Am Educ Res J 1970; 7(1): 69–82. 13. Ragin C: Turning the tables: how case-oriented research challenges variable-oriented research. In: Rethinking Social Inquiry: Diverse Tools, Shared Standards, pp 123–38. Edited by HE Brady, D Collier. Lanham, MD, Rowman & Littlefield, 2004. 14. Ragin C, Zaret D: Theory and method in comparative research: two strategies. Social Forces 1983; 61(3): 731–54. 15. Weber M: The Methodology of the Social Sciences. New York, Free Press, 1949. 16. Jones RA: Emile Durkheim: An Introduction to Four Major Works, pp 60–81. Beverly Hills, CA, Sage Publications, Inc., 1986. 17. Ragin C: Comparative methods. In: Handbook of Social Science Methodology, pp 67–81. Edited by S Turner, W Outhwaite. Thousand Oaks, CA, Sage, 2007. 18. Klein GA, Calderwood R, MacGregor D: Critical decision method for eliciting knowledge. IEEE Trans on Syst, Man, and Cybern 1989; 19: 462–72.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

Cognitive Task Analysis for the Development of Performance-Based Assessments 19. Clark RE, Estes F: Cognitive task analysis for training. Int J Educ Res 1996; 25(5): 403–17. 20. DaRosa D, Rogers DA, Williams RG, et al: Impact of a structured skills laboratory curriculum on surgery residents’ intraoperative decisionmaking and technical skills. Acad Med 2008; 83(10 Suppl): S68–71.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

21. Pugh CM, Plachta S, Auyang E, Pryor A, Hungness E: Outcome measures for surgical simulators: is the focus on technical skills the best approach? Surgery 2010; 147(5): 646–54. 22. Chandler P, Sweller J: Cognitive load theory and the format of instruction. Cog Instru 1991; 8: 293–332.

27

MILITARY MEDICINE, 178, 10:28, 2013

Balancing Physiology, Anatomy and Immersion: How Much Biological Fidelity Is Necessary in a Medical Simulation? Thomas B. Talbot, MD, MS, FAAP*† ABSTRACT Physiology and anatomy can be depicted at varying levels of fidelity in a medical simulation or training encounter. Another factor in a medical simulation concerns design features intended to engage the learner through a sense of immersion. Physiology can be simulated by various means including physiology engines, complex state machines, simple state machines, kinetic models, and static readouts. Each approach has advantages in terms of complexity of development and impact on the learner. Such factors are detailed within the article. Various other biological, hardwarebased, and virtual models are used in medical training with varying levels of fidelity. For many medical simulationbased educational experiences, low-fidelity approaches are often adequate if not preferable.

INTRODUCTION There has been a tremendous shift in medical training over the last decade. The venerable approach of passive observation, trial and error, slavish working hours, and lengthy “rounds” is moving toward medical simulation as a mainstay of training for physicians, nurses, and medics. Animal-based laboratories are moving to computer-driven training experiments that replicate the physiology of humans in an environment that encourages experimentation and repetition over one-off opportunities. As computer, material science, and electromechanical technologies advance, there is a concurrent desire to create simulation experiences with ever higher fidelity. Given this, it is important to ask how much fidelity is optimal. This article explores various aspects of biological fidelity with emphasis on physiology and anatomy. The intent is to contrast approaches based on type of training, development effort, and impact on the learner. PHYSIOLOGICAL FIDELITY Many training scenarios involve demonstrations of physiological action with an expectation that the learner diagnose a condition based upon the demonstrated physiology, make interventions and view a realistic physiological response as would be seen in a patient encounter. A variety of mechanisms exist that can do this with varying levels of fidelity, dynamism, and effort involved in their creation. COMPLEX FIDELITY: ANIMAL PHYSIOLOGY Complex human physiology can be simulated with animal surrogates. Live animals possess analogous physiology that

*Institute for Creative Technologies, University of Southern California, 12015 Waterfront Drive, Playa Vista, CA 90094-2536. †Telemedicine and Advanced Technology Research Center, 1054 Patchel Street, Fort Detrick, MD 21702-5012. The findings and opinions expressed here do not necessarily reflect the positions or policies of the Office of Naval Research. doi: 10.7205/MILMED-D-13-00212

28

is highly reliable, responds to just about any medication or physical intervention, and allows for realistic interaction with learners. The drawbacks to using live animals include the high cost and effort to maintain the animals, the need to avoid suffering, infectious disease considerations, dissimilar anatomy, lack of repeatability, and the possible or likely loss of the animal.1 COMPLEX FIDELITY: PHYSIOLOGY ENGINES A replicable experience with complex fidelity can be achieved through physiology engines. Physiology engines are computer-coded mathematical models that simulate body systems. Basic physiology engines replicate the cardiovascular system and the effects of hemorrhage, fluids, and medications on the model. Some manikins include such engines.2 More complex physiology engines are multisystem with large pharmacology libraries and multidrug interactions. An example of a multisystem model is HumMod, created by the University of Mississippi Medical Center.3 HumMod can readily simulate a wide variety of conditions such as hemorrhage, heart failure, ketoacidosis, or hyperaldosteronism. The results of HumMod outputs are in the form of graphs or data that will closely match results from physiology research studies and textbooks. Finally, biological modeling can simulate biological processes down to the molecular level, though this requires intense computing power and is used for research purposes rather than for education. Physiology models are a high-end solution that can run in real or accelerated time. They have the capability to mimic realistic physiological activity and can gracefully manage unexpected user inputs. They can cope with the effects of multiple interventions even if those interventions are antagonistic to each other. One problem is that realistic changes in physiology may be too gradual or subtle for the learner to notice unless on-screen indicators readily depict historical trends. Some physiology processes, such as sepsis or chemistry changes, unfold too slowly to be observed during an educational scenario. In these cases, the simulation will appear MILITARY MEDICINE, Vol. 178, October Supplement 2013

Biological Fidelity in Medical Simulations

insufficiently responsive and fail to engage the learner. Ironically, gradual responses to user inputs can reduce the user’s impression of biological fidelity. The need to closely observe monitor displayswhile tracking ongoing changes distracts the learner from observing the patient.4 It is often difficult for a simulation to correlate virtual patient verbal behavior or appearance with the state of the physiology engine. Because manikins are constructed mostly out of rigid plastic and few are motorized, they have little capability to change their general appearance. Efforts to show changes in patient appearance in Virtual Reality (VR) simulations based on physiological parameters have been attempted and are maturing (Fig. 1).5 Physiology engines are often present in high-end medical simulations. Higher end manikins such as the METI iStan6 and Laerdal SimMan 3G7 use physiology models that focus on the respiratory and cardiovascular systems. With these systems, changes in pulse, blood pressure, and respiratory rate are concretely accessible from the physical examination of the manikin as well as on a monitoring display. These engines will respond appropriately to artificial ventilation, chest compressions, and cardiac medications, for example. The high interactivity and close linkage of the physiological response closely replicates an actual critical care encounter. A shortcoming of manikins is the limited behavioral repertoire and the presence of monitoring displays that are often watched over by learners who neglect the physical examination. Manikin features rated by medical students to be most useful include chest rise, palpable pulses, interactive voice, and the vital signs display.8 Anesthesia simulation is a common use of physiology engines and it is conducted with manikins, virtual patient avatars,9 or with a simulated patient monitor (Fig. 2). The most sophisticated manikins can simulate gas exchange on real anesthesia equipment. Because anesthesia training is

FIGURE 1. Patient with maxillofacial trauma shows the graphical quality of typical high-end VR training systems. Newer models in development can separate out physiological data such as pulse, respiration, temperature, and capillary refill and tie it to relevant animations for pallor, flushing, sweating, anxiety, and distress. Image courtesy of TruSim, a division of Blitz Game Studios.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

FIGURE 2. A VR high-fidelity anesthesia trainer uses a physiology engine for precise determination of medication effects. Physiology engines permit combinations of user inputs that may be unexpected by the developer yet still produce reliably lifelike results. HumanSim image courtesy of the Virtual Heroes division of Applied Research Associates.

heavily biased toward pharmacological effects and subtle trends, physiology engines are ideal for this application.10 Physiology engines are also strongly suited to detailed exploratory activities, especially for advanced learners. They are unique in that they permit repetitive explorations while attempting different approaches. This empowers the learner to discover relationships between interventions experimentally. Because of the expense and formidable effort to create physiology engines, the Telemedicine and Advanced Technology Research Center (TATRC), Armed Forces Simulation Institute for Medicine (AFSIM), and the Defense Medical Research Program are sponsoring the creation of an open-source physiology engine as a public resource for all to use freely.11 It is hoped that the public physiology engine will increase the adoption of this technology and the available corpus of high-fidelity medical simulation content. COMPLEX FIDELITY: KINETIC MODELS A different approach to simulated physiology is the SIMapse Nerve Agent Laboratory. SIMapse visually shows cholinergic neurotransmission, the effects of nerve agents, and actions of various nerve agent antidotes on different body systems. The approach of SIMapse is less mathematical and more kinematic; neurotransmitter molecules are graphically depicted in three-dimensional (3D) space and interact with receptors, destructive enzymes, and other agents. The physiological process is graphically showed as actions and behaviors. The results are a very close approximation to known science even though the simulation does not use actual 29

Biological Fidelity in Medical Simulations

FIGURE 3. The SIMapse Nerve Agent Laboratory v3 provides a realistic portrayal of nerve agent pharmacology through motion and sound for educational purposes. The program simulates physiological behavior without referring to actual physiological data. It teaches accurate physiology by showing physiology mechanisms, relationships, and pharmacology behavior. Trends over time are depicted as colored sparklines at the bottom of the display.

physiology data, yet it does deliver a high-impact learning experience (Fig. 3).12 MODERATE FIDELITY: COMPLEX STATE MACHINES In practice, physiology engines are neither necessary nor desirable in many educational situations because simpler methods for depicting physiology states are often more practical. Moderate fidelity approaches to depicting physiology are often managed by complex state machines (CSMs) also known as hierarchical finite state machines.13 CSMs are computer programs consisting of logical rules and decisiontree-based logic that responds to user activity. User actions, simulation timers, and other events trigger different states that change the patient presentation and vital signs. CSMs are designed for scenario-based training and respond to well-defined, known possible interactions. A major advantage of state machines is that expression of different states is often a marked change that is readily noticed by the learner. It 30

is also easy to indicate changes in patient behavior, vital signs, or appearance because these changes can be concretely tied to a change in the state machine. The CSM approach is ideally suited for training scenarios with limited depth and scope. Simulations appear more responsive to the learner because they provide immediate and visible responses to learner interaction. The major disadvantage of state machines is that they do not respond well to unexpected, complex, or combinatorial inputs. Undesirable inputs and program complexity are addressed by limiting the variety of possible interventions. Recovery from user errors once a scenario has moved down a decision-tree is difficult to program. Another disadvantage is that each branch point must be individually coded. Adding additional branch points or levels to the state machine can exponentially increase the development work necessary to build the scenario.14 Medium and high-end manikins often feature this type of state machine and may be packaged with a scenario builder toolkit. In the practice MILITARY MEDICINE, Vol. 178, October Supplement 2013

Biological Fidelity in Medical Simulations

of visiting dozens of simulation centers in the United States, it is the author’s observation that few centers go through the effort to create custom scenarios with a toolkit. Instead, they tend to rely on scenarios provided by the manufacturer. VR scenario and game-based training often use the CSM model. Another medium-fidelity approach is to couple a state machine to some sort of physiology model. When the U.S. Army Field Management of Chemical and Biological Casualties Course required a nerve agent pharmacology trainer targeting less advanced learners than would be appropriate for the SIMapse Nerve Agent Laboratory, the SIMapse engine was adapted for a multimedia application called Nerve Academy.15 Nerve Academy uses an on-screen lecturer who delivers mini-lessons coupled with preprogrammed use of the SIMapse engine. It also includes numerous interactive activities at the end of mini-lessons that trigger events in the engine. The result is a responsive learning experience with high fidelity that is very easy for the learner to use.

The advantage of combining a limited set of inputs or a state machine to a physiology model is that it is possible to have high biological fidelity and visible trends in data although allowing for simplicity of use and well-defined, responsive changes in the appearance of the model. Another advantage to coupling a limited interaction set to a physiology model is that doing so greatly simplifies the complexity and sophistication required of the model by limiting the parameters the model has to account for. LOW FIDELITY: SIMPLE STATE MACHINES Most educational scenarios and simulations in medicine use low physiological fidelity to good effect. Low-fidelity approaches require less technology, effort, and sophistication to pull off and, therefore, are easier to author. Patient simulations can use simple state machines (SSMs) to great effect. SSMs consist of 3 or more fixed states that alter the appearance, communication, and physiology data of the simulated patient. They lack branching and conditional

FIGURE 4. The Cyanide Exposure Simulator is an SSM consisting of 7 frames. It uses simple animation effects to create the impression of live physiology. Numerous physiology displays are hand-drawn graphics that are progressively revealed, offering the illusion of real-time monitoring. Except for button controls, the simulation required absolutely no programming. This example shows that low-fidelity approaches can convincingly convey a dynamic process to learners.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

31

Biological Fidelity in Medical Simulations

features of the more complex CSMs yet still offer many of their benefits. They are especially useful in adding a dynamic appearance to a simple case presentation when creation of an in-depth simulation is too resource intensive. They can be created with simple technology, such as web pages, interactive animations or PowerPoint (Microsoft Corporation, Redmond, Washington). An example familiar to the author is a simple animationbased application called the “Cyanide Exposure Simulator.”15 The simulation is usually deployed during large classroom sessions. It consists of 7 possible states that show an inhalational exposure to cyanide and the effects on human physiology. States are selected with two buttons to traverse back and forth through a timeline. Each step down the timeline plays a 10- to 15-second video clip, changes signs, and alters graphs (Fig. 4). The overall effect is that of a dynamic simulation that seems to portray a wealth of biological data. The video with the actor conveys the clinical picture. The rolling displays for the pneumograph, Electrocardiography, and Electroencephalography are fixed lines that are progressively uncovered, producing the illusion of live monitoring. Vital signs of respiratory rate, heart rate, cardiac output, and blood pH are represented by triangle indicators on linear gauges. The transition between states is a simple animation that moves the indicators over 3 seconds. The impression of this gradual transition is that of physiology that is changing with the timeline. In truth, the vital signs presented are generic as everything else. The cyanide simulator effectively portrays cyanide effects even though no detailed physiology data are ever conveyed to the learners. In fact, providing such level of fidelity that includes more detail would diminish educational effectiveness.16 Based on user feedback as a developer, the author finds that most learners are unlikely to detect a difference between this simple state model and an expensive version that uses a physiology engine if the physiology changes seem logical.

TABLE I.

SSMs can also be used for intervention-based training. Virtual Nerve Agent Casualty (VNAC) pushes the envelope for a simple state model to simulate treatment of a severe nerve agent exposure.17 Video of a patient is played although the learner interacts by dropping antidotes and interventions onto the patient display. The state model plays a new video with each intervention and counts up correct interventions until a required quantity is reached. At this point, the state is changed that alters data on the screen (Table I). The simulation requires persistence on part of the learner but results in a satisfactory simulated experience despite the fact that the biological fidelity is low. A drawback of this simple application is that it is very unforgiving of learner errors and does not allow for a corrective pathway if treatment errors are made. LOW FIDELITY: GAME-BASED APPROACHES A game-based approach to depicting physiology is the health score. Health scores are simple numeric or bar graph representations of health. They consist of a point-based score or 100-point scale and are often called hit points. This very simple and low-fidelity representation of health status is ubiquitous and is readily understood by game players. In games, the player’s avatar loses hit points upon receiving damage. Losing all hit points results in death of the player. Hit points tend to gradually regenerate or are restored by activating plus-ups. Medical simulations can also exploit health scores. One approach to health scores uses trend zones. For example, one can split the range of hit points with low, stable, and high zones (Fig. 5). High health scores will improve until health is full. Scores in the stable zone will improve very slowly and scores in the low health zone will automatically decrease. Simply setting the health score based on a virtual patient’s disease or improvement after intervention will now be followed by an automatic, ongoing action. This action can be

Simple State Model of Virtual Nerve Agent Casualty (VNAC). (States 4 and 5 are Not Depicted for Clarity.) VNAC is a Product of the U.S. Army Medical Research Institute of Chemical Defense Chemical Casualty Care Division State 6 Initial State

Conditions Default

Vitals: Heart Rate 140 Respiration: 30 Blood Pressure: 40 Systolic Secretion 4+ Bronchospasm 4+ Twitch+ Seizure + “The patient is twitching and having difficulty breathing. . .”

32

State 2

State 3

Final State for Success

Activating Conditions 3 Atropine 3 Oxime 1 Diazepam Vitals: Heart Rate 120 Respiration: 24 Blood Pressure: 60 systolic Secretion 3+ Bronchospasm 3+ Twitch + Seizure +

Activating Conditions 4 Atropine 3 Oxime 3 Diazepam Vitals: Heart Rate 130 Respiration: 20 Blood Pressure: 100 systolic Secretion 2+ Bronchospasm 2+ Twitch – Seizure −

Activating Conditions 15 Atropine 4 Oxime 4 Diazepam Vitals: Heart Rate 160 Respiration: 13 Blood Pressure: 140 systolic Secretion− Bronchospasm− Twitch− Seizure− “Great job, the patient is stable for transport”

State 7 Final State for Failure Activating Conditions 5-Minute Delay in Treatment, Seizures Not Controlled in 10 Minutes, An Error is Made Vitals: Dead “You just didn’t treat her well enough”

MILITARY MEDICINE, Vol. 178, October Supplement 2013

Biological Fidelity in Medical Simulations

ceives fidelity in the simulation will be the subject of game test research.

FIGURE 5. Health Scores represent total health by points or a percentage displayed on a bar graph. This type of display is not capable of showing historical values, but the score can be observed to change if the learner watches it. The above sample includes a legend that depicts trend zones that would otherwise be invisible to the learner. Health scores in the low-trend zone automatically decline and scores in the high zone automatically increase over time. The score does not change within the stable zone. A little additional math creates a surprisingly vibrant and useful indicator of patient health for game-based learning.

enhanced further by adding nonlinear response curves.18 With trend zones, the score will always be moving, encouraging the learner to intervene. This approach requires very little development effort. A more sophisticated approach from the gaming world uses score modifiers. Score modifiers come in four basic forms: instant damage, instant healing, damage over time (DoT), and healing over time (HoT). Instant modifiers are a one-time reduction or improvement to the health score. DoT is a continuous score reduction over a specified amount of time. In the example of a bleeding virtual patient, the health score will progressively decline on the screen as bleeding is a DoT effect. If the learner gives the patient a blood transfusion, perhaps some points will be added to the health score, but the score will continue to decline. If the learner intervenes and stops the bleeding (i.e., tourniquet), then the DoT effect is cancelled out. Applying a blood transfusion (instant healing) will now permanently increase the score because the DoT effect is no longer present. Careful application of different score modifiers can mimic responsive and sophisticated physiology. The drawback is that health is being represented by a single moving bar; such a display typically does not show historical trends. Learners who grew up with video games are accustomed to this health scores even though it is not widely used for medical simulations. Further research is needed to assess the impact of health scores on learner perception. TATRC is currently working with BreakAway to develop a multiplayer online hospital-based mass casualty simulation for coordinating a response to a very large number of casualties (Communication with Jennifer McNamara, CBRNE-GAME Leader). The clinical role in the game involves triage and treatment for hundreds of casualties. The patients will use a blended model, including a health score, score modifiers, and a simple state model with only three descriptions of patient presentation. Use of these shortcuts allows for patients that change noticeably over the course of the exercise and act responsively to player intervention. The learner will be busy prioritizing and selecting treatments and will likely not take notice that the fidelity is shallow. In fact, the simulation depends on the fact that fidelity is shallow so learners focus on “the big picture.” The question of how the learner perMILITARY MEDICINE, Vol. 178, October Supplement 2013

LOW FIDELITY: STATIC PRESENTATIONS The most common method of depicting physiology is through a static presentation. Static presentations state vital signs and provide a case in written or verbal form. Static presentations can have as little or as much detail as desired with little effort required on the part of the author. This format is commonly used in written tests, magazines, multimedia, and web pages to effect and is well known to medical learners. While the advantages involve ease of authorship and distribution, the disadvantage is the lack of interaction. This format does a poor job of showing progression and the effects of intervention. Static presentations have strictly right and wrong answers without consideration of creative possibilities. This format is the most common type of educational patient case. PHYSIOLOGICAL FIDELITY: APPLICATIONS Each approach to depicting physiology has its own best use (Table II). Advanced learner simulations and exploratory activities are well suited to physiology engines. Interactive case scenarios can be conducted with a physiology engine, but CSMs and SSMs are usually preferred because of the relative simplicity of development and lower computing resource requirements. Because state machines consist of a few dozen calculations to perform versus thousands for a physiology engine, state machines require fewer computing resources to run. Although computers are always becoming more powerful, the resource requirements to run a physiology engine in real time become significant when attempting to run 100 or 1,000 simultaneous instances within a serious game. Another area where the logical simplicity of state machines is preferable includes low power and low computing resource environments such as mobile devices and tablets. Game-based training is well suited to CSMs and health scores. Presentations and mini-activities with few options are well suited to SSMs. Case studies are most easily written as a static presentation, though adding a state machine can increase the interactive possibilities. The available number of quality medical training scenarios is limited by the effort required to develop them. Fortunately, the majority of medical scenarios do not require high physiological fidelity. Clever developers use a number of tricks to create the impression of physiological fidelity. These techniques include visible responsiveness to user input and use of animation. ANATOMICAL FIDELITY Medical simulations use anatomy in both virtual and physical environments. Anatomic features are most important for training invasive procedures though they are also useful in practicing diagnostic skills. Various anatomical approaches are identified here along with their strengths and weaknesses. Live models such as human standardized patients have extremely good fidelity and are useful for physical diagnosis 33

Biological Fidelity in Medical Simulations TABLE II. Physiology Engines Handling of Unexpected and Complex Inputs Ease to Correlate Visualization With Model Response to User Input Graceful Recovery From Learner Errors Suitability for Lengthy Scenarios Biological Fidelity Typical Perception of Biological Fidelity Best Use Scenario

Development Effort

Comparison of Approaches to Virtual Patient Physiology Complex State Models

Health Scores

Static Presentations

Easy

Difficult

Impossible

Moderate

N/A

Difficult

Easy

Very Easy

Moderate

Very Easy

Gradual

Instant

Instant

Gradual/Instant

None

Yes

Challenging

No

Yes

N/A

High

Low

Low

High

Low

High Moderate–High

Moderate High

Low Low

Low Moderate

Low None

Advanced Simulations and Exploratory Learning Difficult

Interactive Case Scenarios and Game Based Training Moderate

Interactive Case Scenarios, Presentations and Mini-Activities Easy

Game-Based Training

Case Studies

Easy

Very Easy

skills. Major limitations include the limited availability of pathologies that are stable enough to use this technique. The expense of hiring standardized patients is enormous and humans are not suitable for rehearsing invasive procedures.19 Cadavers offer many advantages for anatomical realism and some can even be partially reanimated if fresh, but they are expensive, limited in their availability and administratively burdensome to possess.20 Live animals have their own limitations. They can be used for training purposes for both examination and invasive procedure training, though there are important ethical restrictions and precautions involved in their use. Their superior tissue properties are not yet achievable by artificial means. They can be operated on and do all the physical things that humans do. They do not require a team of developers to bestow their capabilities. Differences between the animal model and true human anatomy are a disadvantage to this approach.21 Task trainers are simulators that consist of discrete body parts or regions for training a specific task. Task trainers have been developed for intravenous access, central line placements, intraosseous access, lumbar puncture, colonoscopy, tracheal intubation, and many other procedures. They allow for repetitive practice of physical procedures at low cost. They are usually constructed of molded plastic and rubber-like compounds. They tend to have suitable anatomic landmarks but can be lacking in suitable tissue properties.22 Ideal tissues will feel like flesh and bone and include appropriate compliance, texture, moisture, bleed 34

Simple State Models

and traction properties. Most task trainers fall far short of this ideal. Crude materials also limit fidelity for training for physical diagnostic skills such as palpation of blood vessels and organs.23 Task trainers also require maintenance and replacement of consumable parts. Despite limitations, even unsophisticated task trainers have been proven effective in medical training.24 Manikins known to the author fall short on anatomical fidelity because they lack internal anatomy such as tissues, muscles, bones, organs, and vessels. They are accurate at a gross level and may include palpable pulses, chest rise, palpable ribs, and other features. Current manikins do not articulate naturally, lack muscle tone, and cannot be operated on. The skin is rubber-like or plastic and they lack realistic tissue properties. Nevertheless, they are useful in many training situations and numerous procedures can be rehearsed on them. Current manikins do tend to have sophisticated airways. They are often best for scenario-based training and demonstration of decision making and proficiency for a sequence of events than specific skill rehearsal that is often better performed with a dedicated task trainer. Depending on the use, even very basic models may be suited to medical training if the scenarios are designed properly. A less well-known, but growing approach is the high-fidelity physical model. Physical models are task trainers for surgical intervention. They combine soft plastics, gelatinous tissue substitutes, composites, and cloth to represent the internal anatomy of an organ or wound and are often moulaged to represent a specific pathology. These often disposable trainers include MILITARY MEDICINE, Vol. 178, October Supplement 2013

Biological Fidelity in Medical Simulations

simulated skin, bone, nerves, muscles, and tissue planes.25 Some are capable of blood flow and hemorrhage. Common physical models include limbs, the inguinal canal, and other areas of the body. They allow for surgical rehearsal at low cost and are a very practical technology. VR systems use a 3D human representation on a computer display. For surgical purposes, they are viewed stereoscopically on a 3D screen or with display goggles. They usually include realistic appearing surgical manipulators or a 3Dpen-like device with haptic feedback. Enormous advances in graphical computing power now allow for realistic 3D portrayals of both external and internal human anatomy with detailed surface textures and appearance. These recent graphical advances are so dramatic that VR simulations from even a few years ago now appear antiquated. Graphical realism is no longer limited by the computer. It is limited the effort and expense of creating the 3D models and artwork. The technology behind the haptics (sensation of feeling), although awkward to get used to, can provide uncanny sensations of moving tissue, a scalpel, or hard points underneath the skin.26 VR surgical systems are extremely difficult to develop material for and remain very expensive. Because of this, the available library of procedures and scenarios using this technology remains small. Worse yet, content developed on one system is not transferable to that of another manufacturer. The AFSIM and the National Capitol Area Simulation Center are attempting to ameliorate this through Tri-Service Open Platform for simulation (TOPS) by specifying a common interface between software and hardware systems (Communication with TOPS Prinicipal Investigator, Alan Liu). AFSIM is also working to develop an open-source haptic-enabled surgery toolkit to promote additional content development and advancement of this technology. INTERACTIVITY AND NARRATIVE Fancy graphics, accurate anatomy, and physiological fidelity are not the only, nor the most important features of a successful medical simulation. The quality of a simulation-based training experience depends on successful engagement with the learner. Achieving engagement depends on a sense of immersion, successfully executing visuals, responsiveness, and a good narrative. Creating the sense of immersion is more important than 3D or the level of visual detail in the simulation. It depends on having a consistent simulation world with things to do or see that are interesting to the learner.27 Simulations can achieve engagement with the learner more successfully if actions they perform in the scenario are followed by a visible or audible response.28 Responsiveness connects the learner to the scenario. In the case where virtual humans are encountered, responsiveness establishes likeability and rapport with the learner. Nonverbal cues and gestures, even random ones, increase this sensation of rapport with the scenario and virtual patient. For these reaMILITARY MEDICINE, Vol. 178, October Supplement 2013

sons, factors such as “response to user input,” “perception of biological fidelity,” and other factors are listed in Table II, which is a comparison of approaches to virtual patient physiology. This table is based on the author’s experience as a developer and interactions with hundreds of learners using various medical simulations. In addition to design choices such as the level of fidelity required and the amount of development effort one wishes to expend, one should not neglect the impact of old-fashioned storytelling; research strongly shows that narrative is a successful tool to engage people.29 CONCLUSION When it comes to methods to depict physiology in a medical simulation, some generalizations can be made (Fig. 2). Physiology engines excel at advanced simulations and exploratory learning where sophisticated learners want to try unexpected things and see accurate responses. Examples of this include anesthesia trainers and toxicology simulations. Complex state models are useful for interactive case scenarios and gamebased training because they provide predictable behavior, excellent responsiveness, and can be reasonably complex while appearing to have high fidelity. Simple state models are well suited to interactive case scenarios, lecture presentations, and mini-activities because they are very easy to author yet are interactive and responsive to user input. Health scores are ubiquitously used in entertainment titles but not so much in medical simulations. They have the advantage of easy implementation, being dynamic and that they are a familiar format. Further research is needed to determine their suitability and acceptance of health scores in medical simulations. In fact, the author recommends that those intending to create medical simulations use the simplest technology possible that achieves the learning objectives. Either way, further research is needed to obtain data regarding learner perceptions of simulations that use these technological approaches for physiology as the extant literature is sparse compared to research on graphical realism, immersion, and narrative. Biological fidelity in medical simulation is not an end in and of itself. Physiology, anatomy, interaction, narrative, and the technology behind them represent an array of tools. The choice of these tools must be determined by educational objectives. The fact that more complex systems require more time, effort, and money to create means that unnecessary use of high fidelity results in less available training content overall. Although some applications rightly require exacting fidelity, many do not. Medical educators should choose the most appropriate level of technology that achieves the goal. When doing so, they may often find that the level of fidelity required is lower than what they initially expected. ACKNOWLEDGMENT This study was partially supported by a grant from the Office of Naval Research, Award Number N00014-10-1-0978.

35

Biological Fidelity in Medical Simulations

REFERENCES 1. “Use of Simulation Technology in Medical Training”, House Report 112-078, National Defense Authorization Act for Fiscal Year 2012. Committee Report of the 112th United States Congress. Available at http://thomas.loc.gov:80/cgi-bin/cpquery?%26dbname=cp112%26r_n= hr078.112%26sel=DOC; accessed May 6, 2013. 2. Cooper JB, Taqueti VR: A brief history of the development of mannequin simulators for clinical education and training. Postgrad Med J 2008; 84: 563–70. 3. Hester R, et al: HumMod: a modeling environment for the simulation of integrative human physiology. Front Physio 2011; 2: 12. doi: 10.3389/ fphys.2011.00012. 4. Grant T, McNeil MA, Luo X: Absolute and relative value of patient simulator features as perceived by medical undergraduates. Simul Healthc 2010; 5(4): 213–8. 5. Knight JF, Carlet S, Tregunna B, Jarvis S, Smithies R, de Freitas S: Serious gaming technology in major incident triage training: a pragmatic controlled trial. Resuscitation 2010; 81: 1175–9. 6. CAE Healthcare Meti Learning. iStan Manikin. Available at https:// caehealthcare.com/home/eng/product_services/product_details/istan#; accessed January 3, 2013. 7. Laerdal SimMan 3G. Laerdal Products & Services. Available at http:// www.laerdal.com/doc/85/SimMan-3G; accessed September 10, 2011. 8. Donoghue AJ, Durbin DR, Nadel FM, Stryjewski GR, Kost SI, Nadkarny V: Perception of realism during mock resuscitations by pediatric housestaff: the impact of simulated physical features. Simul Healthc 2008; 3(3): 113–37. 9. Applied Research Associates Virtual Heroes Division. HumanSim. Available at http://www.humansim.com/; accessed January 3, 2013. 10. Morgan PJ, Cleave-Hogg D: A worldwide survey of the use of simulation in anesthesia. Can J Anaesth 2002; 49(7): 659–62. 11. TATRC. Solicitation for the Developer Tools for Medical Education Practical Physiology Research Platform. Available at http://www.grants.gov/ search/search.do?mode=VIEW&oppId=130394; accessed January 3, 2013. 12. Talbot TB. SIMapse Nerve Agent Laboratory 2.0 and Nerve Academy CD-ROM. Available at https://ccc.apgea.army.mil/products/info/products .htm; accessed August 22, 2011. 13. Ahlquist J, Novak J: Game Development Essentials: Game Artificial Intelligence, Chapters 2–3. New York, Thomson/Delmar Learning, 2007. 14. Riedl MO, Young RM: From linear story generation to branching story graphs. IEEE Computer Graphics and Applications 2006; 26(3): 23–31. 15. Talbot TB. Cyanide Exposure Simulator. U.S. Government use software. Produced at the Chemical Casualty Care Division of the U.S. Army Medical Research Institute of Chemical Defense (USAMRICD), Aberdeen Proving Ground, Aberdeen, MD, 2006.

36

16. Clark RC, Mayer RE: e-Learning and the Science of Instruction : Proven Guidelines for Consumers and Designers of Multimedia Learning, Chapter 7. San Francisco, Pfeiffer, 2008. 17. USAMRICD Chemical Casualty Care Division. Virtual Nerve Agent Casualty (VNAC) on Medical Management of Chemical Casualties DVD-ROM 5.0. Available at https://ccc.apgea.army.mil/products/info/ products.htm; accessed August 22, 2011. 18. Mark D: Behavioral Mathematics for Game AI, Chapter 12. Boston, Charles River Media, 2009. 19. King AM, Perkowski-Rogers LC, Pohl HS: Planning standardized patient programs: case development, patient training, and costs. Teach Learn Med 2010; 6(1): 6–14. 20. Parker LM: What’s wrong with the dead body? Use of the human cadaver in medical education. Med J Aust 2002; 176(2): 74–6. 21. Good ML: Patient simulation for training basic and advanced clinical skills. Med Educ 2003; 37: 14–21. 22. Kunkler K: The role of medical simulation: an overview. Int J Med Robot 2006; 2: 203–10. 23. Bradley P: The history of simulation in medical education and possible future directions. Med Educ 2006; 40: 254–62. 24. Scerbo MW, Dawson S: High Fidelity, High Performance? Simul Healthc 2010; 5(1): 8–15. 25. Reihsen TE, Poniatowski LH, Sweet RM: Cost-effective, Simulated, Representative (Human) High-Fidelity Organosilicate Models. Interservice/Industry Training, Simulation and Education Conference (I/ITSEC) 2011 Proceedings. 2011 Paper No. 11328:1–7. Available at http://ntsa.metapress.com/link.asp?id=h630v537t40u4767; accessed May 6, 2013. 26. Coles TR, Meglan D, John NW: The role of haptics in medical training simulators: a survey of the state of the art. IEEE Trans Haptics 2011; 4(1): 51–66. 27. Alexander AL, Brunye T, Sidman J, Weil S: From gaming to training: a review of studies on fidelity, immersion, presence and buy-in and their effects on transfer in PC-based simulations and games. DARWARS Proceedings 2005. Available at http://www.aptima.com/publications/ 2005_Alexander_Brunye_Sidman_Weil.pdf; accessed February 14, 2012. 28. Kenny PG, Parsons TD, Rizzo AA: Human computer interaction in virtual standardized patient systems. In: Human-Computer Interaction, Part IV, Proceedings of the 13th International Conference for Human Computer Interactions, LNCS 5613, pp 514–23. Edited by Jacko JA. Springer, Berlin 2009. Available at http://www.springer.com/computer/ hci/book/978-3-642-02582-2; accessed May 6, 2013. 29. Tortell R, Morie FJ: Videogame play and the effectiveness of virtual environments for training: Proceedings of the Interservice/Industry Training, Simulation, and Education Conference, 2006: 1–9. Available at http://ntsa.metapress.com/link.asp?id=8kwnffjqxvcm382v; accessed May 6, 2013.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

MILITARY MEDICINE, 178, 10:37, 2013

Cost Considerations in Using Simulations for Medical Training J. D. Fletcher, PhD*; Alexander P. Wind, MS† ABSTRACT This article reviews simulation used for medical training, techniques for assessing simulation-based training, and cost analyses that can be included in such assessments. Simulation in medical training appears to take four general forms: human actors who are taught to simulate illnesses and ailments in standardized ways; virtual patients who are generally presented via computer-controlled, multimedia displays; full-body manikins that simulate patients using electronic sensors, responders, and controls; and part-task anatomical simulations of various body parts and systems. Techniques for assessing costs include benefit–cost analysis, return on investment, and cost-effectiveness analysis. Techniques for assessing the effectiveness of simulation-based medical training include the use of transfer effectiveness ratios and incremental transfer effectiveness ratios to measure transfer of knowledge and skill provided by simulation to the performance of medical procedures. Assessment of costs and simulation effectiveness can be combined with measures of transfer using techniques such as isoperformance analysis to identify ways of minimizing costs without reducing performance effectiveness or maximizing performance without increasing costs. In sum, economic analysis must be considered in training assessments if training budgets are to compete successfully with other requirements for funding.

INTRODUCTION The advantages of using simulation in training may be summarized as follows: —Safety: Simulated lives and health can be jeopardized to any extent required for learning. —Economy: Simulated materiel, equipment, and other resources—physical or fiduciary—can be used, misused, and expended as needed. —Visibility: Simulation can provide visibility in at least two ways. It can (1) make the invisible visible and (2) control the visibility of details allowing the learner to discern the forest from the trees or the trees from the forest as needed. —Time control: Simulated time can be sped up, slowed down, or stopped. It can also be completely reversed, allowing learners to replicate specific problems, events, or operational environments as often as needed.

in medical education and training, and economic and cost analyses. These reviews are followed by a discussion of ways in which measures of simulation training effectiveness can be combined with cost analysis to yield assessments of costs, cost-effectiveness, and return on investment in medical training and Education. SIMULATION IN MEDICAL TRAINING AND EDUCATION As abstracted from reviews and comments (e.g., Bradley1 and Rosen2), at least 4 forms of simulation appear to be used in medical training.

These advantages seem to be applicable in medical training and education as elsewhere. Overall, simulation can provide massive amounts of practice with feedback, exposing individuals or teams to realistic situations that in real-world settings would range from the impracticable to the unthinkable. All these advantages are relevant and interrelated. This article focuses on the economic value of simulation. It suggests ways to assess the use of simulation in medical education and training through objective economic and cost analyses. The article begins with brief reviews of simulation

Standardized Patients Applying formal procedures developed in 1964 by Barrows and Abrahamson3 and continued into the present,4,5 actors, real patients, or lay people can be trained as “standardized patients” to participate in role-playing exercises for assessing and improving a leaner’s ability to carry out medical procedures, such as taking medical histories, performing physical examinations, ordering tests, providing counsel, and prescribing treatment. These patients can be available when and where they are needed, trained to respond consistently to examination questions, and used when training with a real patient would be inappropriate, as in counseling cancer patients. They are, however, expensive to recruit and train and cannot present cues normally provided by physiological examinations. Examples:

*Science and Technology Division, Institute for Defense Analyses, 4850 Mark Center Dr., Alexandria, VA 22311. †University at Albany, State University of New York, 1400 Washington Ave., Albany, NY 12222. The findings and opinions expressed here do not necessarily reflect the positions or policies of the Office of the Secretary of Defense. doi: 10.7205/MILMED-D-13-00258

—Gerner et al6 found that consultation quality ratings by standardized patients after their visits with 67 general practitioners predicted later ratings by parents concerning improvements in the weight control behavior of their children. 95% of the general practitioners reported that they found training with standardized patients to be useful.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

37

Cost Considerations in Using Simulations for Medical Training

—A study by Betcher7 found that the use of standardized patients (graduate students in theater) in role-playing consultations, followed by debriefing, was effective in improving the communication skills and confidence of nurses and other caregivers by 5% to 37% in advising end-of-life patients and their families. —Safdieh et al8 compared the long-term effects on quality of neurologic examinations performed by 58 medical students who were trained using standardized patients to those performed by 129 students who were trained without this experience. Two years after this training, the authors found that a statistically significant advantage in performing these examinations favored students who were trained using the standardized patients.

Virtual (Computer-based) Patients Interactive software simulations of patients have been used in standard simulation exercises9,10 and in gaming simulations11 for training and assessing medical skills. These simulations are gradually taking the place of standardized patients, although the absence of strong artificial intelligence, which would allow full mixed-initiative dialogue, limits their applicability. However, growth in the use of virtual patients is likely to continue because of their ability to scale inexpensively to large numbers of physically dispersed learners, adapt quickly to prior knowledge and other individual characteristics of learners, and be available anytime and anywhere via the global information infrastructure. Examples: —Steadman et al12 compared learning in a week-long acute care course by 31 fourth-year medical students who were randomly assigned to a group receiving a virtual patient with labored breathing versus a patient receiving interactive problem-based learning without simulation. The simulation group performed significantly better (71%—a 24-point gain over the pretest) on the final assessment than the problem-based learning group (51%—a 7-point gain over the pretest). —Ten Eyck et al13 used a crossover design involving 90 students to compare virtual patient simulation with group discussion in emergency medical instruction. The learners received one set of topics using one instructional treatment and then switched mid-rotation to receive the other set of topics using the other treatment. Material presented in simulation format produced significantly higher scores than material presented using group discussion methods. —Botezatu et al14 compared learning by 49 students studying hematology and cardiology topics using virtual patients with learning by students receiving more conventional instruction (lecture and small group discussion). They assessed learning immediately after instruction and its retention after 4 months. They found 38

effect sizes ranging from 0.5 to 0.8 in favor of the virtual patients. Electronic Patients These types of patients are generally whole-body manikins that physically simulate patients, although some use of helmetmounted virtual reality capability has also been used.15 Military and civilian simulation developers have collaborated to produce and assess electronic patients. An early effort by David Gaba (Stanford University) and CAE-Link developed a full-scale manikin system called Anesthesia Crisis Resource Management. It was designed for Air Force and National Aeronautics and Space Administration Crew Resource Management training.16 Examples: —Alinier et al17 compared the performance of intensivecare nursing students who were assigned at random to two groups. One group followed the usual course of training, and the other group used a “universal patient simulator.” The researchers found significantly greater performance improvement (twice the percentage increase) for the simulation-trained group than for the group receiving the usual course of training. —Radhakrishnan et al18 used electronically controlled manikins to compare the performance of manikin-trained and conventionally trained nursing students. They found that the manikin-trained students significantly outperformed the other students in patient safety and in assessing vital signs. —Cendan and Johnson19 used a randomized, repeatedmeasures design to train 40 second-year medical students in treating neurogenic, hemorrhagic, septic, and cardiac shock. This study compared two instructional approaches such as web-based training text with a culminating simulation exercise and a manikin-based exercise with instructors who provided management and evaluation in response to student questions. All students were exposed to both approaches, with half completing the web-based exercises first and half completing the manikin-based exercises first. Learning from web-based and manikin-based instruction was similar; however, overall learning was greater when the web-based simulation was presented first. Part-Task Trainers Anatomical models of body parts are used to provide training. These “part-task” simulations are becoming more advanced to keep pace with medical treatments and technology. They are used in instruction that ranges from minimally invasive laparoscopy20 to major cardiologic surgery21 and delicate ophthalmological procedures.22 They have been enhanced considerably through the development and inclusion of haptic systems and interfaces.23 MILITARY MEDICINE, Vol. 178, October Supplement 2013

Cost Considerations in Using Simulations for Medical Training

Examples: —Barsuk et al24 found significantly fewer needle passes, arterial punctures, and catheter adjustments and higher overall success rates among 76 simulator-trained residents using a haptic task trainer than among 27 residents trained without the simulation. —Holzinger et al25 compared learning by 96 medical students who were randomly divided into 3 groups: a conventional text-based lesson group, a group learning from a blood dynamics simulator alone, and a group learning from the simulator blended with additional human instructor support. They found no difference between the first two conditions but significantly more learning in the third.

ECONOMIC ANALYSIS: THE VALUE OF A POUND OF SIMULATION Administrative decision making largely consists of allocating resources among competing alternatives.26 Such decision making is not only a matter of adopting enhancements, but also includes determining what must be given up to do that. To some degree, this process is described by the “rational theory of choice,” which balances costs against factors that contribute to the achievement of a specific goal.27–29 This approach can be overdone by neglecting issues that are hard to measure (e.g., attitudes, culture, and trust), but it plays an essential role for cases in which benefits and costs can be clearly identified, such as business profit and loss, combat success or failure, and the achievement of objectives through instruction. The objective of economic analysis is to inform decisions by assessing alternative courses of action and/or inaction. It does so by estimating the amount and probability of expected returns from each alternative and balancing these returns against projected costs, consequences, and constraints.30 Economic analysts look for value to be gained and resources to be sacrificed for each alternative identified. Such deliberation assumes that the analyst or the decision maker has assembled and considered a comprehensive list of available alternatives. Introduction of an additional alternative can dramatically alter the decision space. Economic analyses remain as subject to controversy as any other analyses. Underlying assumptions, inclusion and exclusion of data elements, proper data collection, sampling procedures, criterion levels, and similar issues contribute to controversy. An economic analysis can never be assuredly correct, but it can and should be explicit and should allow decision makers to determine how well and to what extent it informs their decisions. The single most accessible and readily commensurable criterion for choosing among alternatives remains cost measured in fungible monetary units (e.g., dollars). In these cases, decisions key on units returned for units invested. Even with these data, the decisions do not make themselves. Other factors poorly suited to economic analysis MILITARY MEDICINE, Vol. 178, October Supplement 2013

also come into play. The effectiveness of military and medical training outweighs cost when human lives are in risk. Cost considerations—however necessary, objective, and well-conceived—should not be the sole concern in informing decisions. Assessing Costs Assessment of costs invested is a central factor in economic analysis. Costs may be categorized as one of the following: research and development, initial investment, operations and maintenance, and salvage and disposal.30 Salvage and disposal costs are omitted from many analyses because they are one-time only and difficult (usually impossible) to estimate accurately. Many research and development and initial investment costs are also one-time only, but they may be known and may be a matter of interest for some alternatives. When an alternative is being considered as a replacement for an existing program, both research and development costs and initial investment costs for the replacement can be included even though they are not included for the program in place. In these cases, analysts may decide that the costs for the current program are “sunk” (i.e., beyond recovery no matter what happens). These sunk costs do not factor into the decision. A perennial and debilitating problem for cost analysis in education and training is the absence of generally accepted, standardized cost models. Such models would present unambiguously specified and well-defined cost elements that clearly identify what they do and do not include. Without these specifications, decision makers, among others, do not know clearly what the cost analysis and the cost analysts are telling them. A variety of commentators have provided a basis for cost models to be used in instruction. For instance, Levin and McEwan31 suggested five classes of elements, or “ingredients,” to be considered in a cost model: personnel, facilities, equipment and materials, other program inputs, and client inputs. Personnel costs include all the human resources needed by the approach. Levin and McEwan recommend that personnel be classified according to their roles (instructional, administration, clerical, and so forth), qualifications (training, experience, specialized skill), and time commitments (full time, part time). Facilities costs include all resources required to provide physical space for the approach. Equipment and materials include furnishings, instructional equipment, and supplies. Other inputs in this scheme include components that do not fit elsewhere (e.g., instructor training and insurance costs). Other costs are especially relevant in military and industrial training, where student pay and allowances are funded by the same organization that provides the instruction, thereby increasing interest in the speed with which students reach objective thresholds of competency. Much of the rationale for applying technology in industrial and military training is keyed to its capacities for tutorial individualization, which allows the adjustments for prior learning and self-pacing to qualify 39

Cost Considerations in Using Simulations for Medical Training

FIGURE 1.

Cost model framework for instruction.

students more quickly for duty or allows students to maximize their competencies—be all they can be—while holding instruction time constant.32 Kearsley33 developed a model much like Levin and McEwan’s but with an added dimension for the components, or categories, of instruction system development: analysis, design, development, implementation, and evaluation. These components can be combined with the typical cost categories of personnel, facilities, and equipment and materials. Integrating these two categories yields the cost framework shown in Figure 1, which presents an outline, not a fully developed cost model. Explicit discussion of what is included in, and/or excluded from, each cell of this framework will help analysts know what they are talking about and will help decision makers determine the extent to which an analysis can be applied to inform their decisions. It is rare, if not impossible, for a cost analysis in any area, including instruction, to be entirely correct. As discussed earlier, every such analysis requires assumptions and extrapolations, but it can and should be explicit. The framework shown in Figure 1 may contribute to this end. Benefit–Cost Analysis A benefit–cost analysis is used to determine whether the benefits returned by a candidate course of action outweigh the costs of investing in it. The calculation of a benefit-to-cost ratio is straightforward as described by Fitzpatrick et al34 Phillips,35 and McDavid and Ingleson,36 among others. It reduces all costs of an action to a single unit. It does the same for all benefits and then calculates the ratio of benefits to costs. We can calculate a benefits-to-cost ratio using whatever metrics we choose, but the terms for input and output must be commensurable (i.e., both must be measured using the same units). Monetary units tend to be those most readily translated from whatever investment resources are required and whatever returns are produced. For that reason, these ratios are usually expressed in terms of dollars, pounds, euros, or whatever monetary unit communicates most easily and usefully to likely decision makers. 40

A benefit-to-cost ratio is calculated as follows (Phillips35; McDavid and Ingleson36): Value of the result Cost of the investment It tells us how many units of value we get for every unit of cost. For instance, Thompson37 reported that in 1667 public health officials in London found that expenditures to combat the plague would yield a benefit to cost ratio of 84:1. Return on Investment Return on investment is closely related to benefit-to-cost ratios. It is also a ratio, and calculating it is as straightforward as its name suggests. It is calculated as follows (Phillips35; McDavid and Ingleson36): Value of the result Cost of the investment Cost of the investment Return on investment must be calculated for some period of time, such as a year. As with monetary units, the length of time should be determined by analysts in consultation with decision makers who are likely to use the results of the analysis. Example: —Fletcher and Chatham38 studied returns from investing in several training innovations. They found ratios of 2.49 for the “Top Gun” investment in training Navy combat pilots, 3.37 for using technology-based, in-transit training to sustain and enhance the bombing skills of pilots, and 2.50 if technology-based training were used for 40% of Department of Defense specialized skill training. Benefit–cost and return-on-investment analyses require value and cost to be commensurable. Of the two, returnon-investment analysis may be preferred because it indicates how many units of net benefits are returned, after investment MILITARY MEDICINE, Vol. 178, October Supplement 2013

Cost Considerations in Using Simulations for Medical Training

costs have been subtracted, for each unit invested. Of course, spikes, dips, and diminishing returns have to be considered with differently timed units of investment, so averaging and curve smoothing may be required. Return-on-investment analysis may be helpful for an ancillary reason. It treats costs for education and training explicitly as investments, not as infrastructure expenses. Treating these costs as infrastructure expenses is often their fate in training venues, including those of Department of Defense, where training is bundled with transit, hospitalization, and stockade costs. Cost-effectiveness analysis When commensurability is difficult, cost-effectiveness analysis can be used.31,35,36 Costs of investment can usually be expressed in monetary units, but the full return—the benefits—of instruction may not be amenable to monetary units. Cost-effectiveness analysis allows effectiveness (e.g., information retention, job knowledge and motivation of workers, supervisor ratings, and productivity) to be measured in its own units. In instruction, it accommodates a more comprehensive range-of-objective outcome than analyses requiring commensurability. Cost-effectiveness is calculated as a direct ratio of cost to benefits or benefits to cost. In determining cost-effectiveness, the usual practice is to hold either costs or effectiveness constant across all alternatives being considered and observe variations in costs or effectiveness. Sometimes, either costs or effectiveness is simply assumed to be constant across the alternatives. One could argue that cost is implicitly assumed to be constant by its absence from many instructional evaluations. The assumption may be reasonable, but analysts should present data or information to validate it so that decision makers can decide for themselves if it is warranted. The good news is that cost-effectiveness does not require commensurability. The bad news is that it is a relative term. Relevant decision alternatives must be specified in assessing it. The addition of an alternative for achieving the objective(s) after a cost-effectiveness analysis is done can change its conclusions and recommendations entirely. Despite common usage, we cannot properly say that an investment, by itself, is or is not cost-effective; however, no harm is done in calculating a cost-effectiveness ratio for it. Example: —Fletcher et al39 combined experimental data reported by Jamison et al40 and Levin et al41 with their own empirical findings to assess the costs of raising student scores on a standard test of mathematics comprehension by one standard deviation. They compared these costs for professional tutoring, peer tutoring, reducing class size, increasing instructional time, and using computer-based instruction. They found that the most cost-effective approaches among all these alternatives were peer tutoring and computer-based instruction. MILITARY MEDICINE, Vol. 178, October Supplement 2013

Cost-effectiveness analyses have been used in health care since the 1960s to determine the relative value of specific interventions, such as a medication, surgical procedure, or counseling techniques.42 Example: —Tsai et al43 calculated cost-effectiveness ratios that compared hospital-based home care for patients who had mental illness with care based on traditional, outpatient therapy. They measured effectiveness in terms of disease maintenance behavior, psychotic symptoms, social function, and service satisfaction. Overall cost was the sum of costs for all direct mental health services. They found cost per unit of effectiveness to be $4.3 for home care and $13.5 for outpatient therapy. One form of cost-effectiveness analysis is cost-utility analysis, where the return is assessed in terms of utility or value received by the beneficiaries of the investment. Costutility analysis is frequently recommended and promoted but rarely used in sectors other than health services, where decision makers often assess different quality-of-life alternatives for their patients.44,45 They must balance quality of life against additional years of life to help patients review the net benefit or utility provided by different treatments. Assessment of Simulation Decision making concerning potential improvements in training raises two basic questions. Compared to current practice, does it produce threshold levels of human performance capabilities at less cost, or does it increase human performance capabilities while holding costs constant? Both costs and effectiveness must be considered if assessments of training simulation are to inform decision making in a responsible manner.30,46 Transfer effectiveness ratios

A key issue is the extent to which capabilities produced through simulation-based training transfer to “real-world” tasks. More specifically, does the human performance produced by simulation-based training either reduce costs without diminishing performance or improve performance without increasing cost? One approach to this issue is the use of transfer effectiveness ratios (TERs). TERs were developed by Roscoe and Williges47 for aircraft pilot training, but they apply to simulation-based training in general. A TER can be defined as follows: TER ¼

Tc Tx , X

where TER is the transfer effectiveness ratio; Tc is the time or trials required for a control/baseline group to reach criterion performance; Tx is the time or trials required for an experimental group to reach criterion performance after X time or trials using simulation (or any other instructional approach 41

Cost Considerations in Using Simulations for Medical Training

of interest); and X is the time or trials spent by the experimental group using the simulation. Roughly, the TER indicates how many trials or units of time are needed to achieve criterion performance in the objective experience (e.g., flying an aircraft, repairing a radar repeater, and performing a medical procedure) are saved for every unit of simulation training invested. Example: —Taylor et al48 used TERs to compare times required to reach criterion performance in using specific aviation instruments with and without a Personal Computer Aviation Training Device (PCATD). One group was trained only during flight in the aircraft. A second group was trained first with the PCATD and later in the aircraft. Criterion performance was measured during flight. Taylor et al found that the PCATD group required about 4 hours less of in-flight training, suggesting a transfer effectiveness ratio of 0.15—or a savings of 1.5 flight hours for each 10 hours of PCATD time. These findings suggest that the requisite levels of performance can be attained at lower cost using simulation—if the simulation costs less to operate than an airplane. If it does, then “the larger the TER, the better” is good news for the simulation. Example: —Orlansky et al49 compared the costs of flying military aircraft with the cost of operating (“fly”) simulators. They found that the cost of operating a flight simulator was about one-tenth the cost of operating military aircraft, so the use of a flight simulator was generally cost-effective if the TER for the simulator exceeded 0.10. This finding is useful and significant. However, a few caveats are in order. First, as Povenmire and Roscoe50 pointed out, not all simulation training hours are equal. Early trials or hours in a simulation may save more trials or time than later ones. A TER is likely to decrease monotonically and approach zero for large values of simulation training. This consideration leads to learning-curve differences between TERs and incremental TERs, or ITERs, with the inevitable diminishing returns captured best by the latter. An ITER can be defined as follows: ITER ¼

TxDx Tx , DX

where ITER is the incremental transfer effectiveness ratio; Tx–Dx is the time or trials required to reach criterion performance with access to simulation after completing x–Dx units of time or trials; X is the time or trials spent by the experimental group using the simulation; Tx is the time or trials required to reach criterion performance, with access to simu42

lation; and DX are incremental units of time or trials after starting at unit X. Roughly, the ITER indicates the amount of transfer produced by successively greater increments of time or trials in the simulation. As Morrison and Holding51 pointed out, total time or trials to criterion begins to decrease as the use of effective simulation increases, but, sooner or later, they begin to increase. At some point, training total time or trials to criterion with simulation will exceed those without it and produce negative TERs. Example: —Taylor et al52 used ITERs to compare the number of trials to specific completion standards, time to complete a flight lesson, and time to a successful evaluation flight with and without a PCATD. One group trained only during flight in the aircraft, and three other groups trained first with the PCATD and later in the aircraft. Criterion performance was measured during flight. The number of trials to reach criterion was less for all three PCATD groups than for the aircraft-only group. The three experimental groups trained with the PCATD for 5, 10, and 15 hours, respectively. The 10-hour PCATD group required the fewest number of trials to reach criterion for five of the eight criterion tasks, the 5-hour PCATD group required the fewest number of trials for two of the criterion tasks, and the 15-hour PCATD group required the fewest number of trials on only one criterion task. Average ITERs were 0.662, 0.202, and 0.148, respectively, for the 5-hour, 10-hour, and 15-hour PCATD groups, indicating the diminishing returns from time in simulation training accounted for by ITERs. Second, transfer effectiveness is tied to the specific skill, knowledge, or performance levels—the training objectives— being sought. This issue was illustrated in a study by Holman53 involving a CH-47 helicopter simulator. Holman found that if the knowledge and skills of interest were simply overall ability to fly the helicopter, the TER was 0.72. However, he also found that the 24 TERs for the specific skills he examined ranged from 2.8 to 0.0. The TER that is relevant depends, as in all assessment, on the decision it is intended to inform, which includes the type of transfer sought. Holman required straightforward, “near” transfer, where many elements exercised by the simulator are similar, if not identical, to those required by the objective task performance. Near transfer echoes long-standing prescriptions for including identical elements that are shared by the learning (e.g., simulation) and the eventual task environments.54,55 Other applications may require “far” transfer from simulation to the objective task, where fewer elements are common to both simulation and task performance and higher level thought and analysis is required by the performer for transfer to occur.56 Similarly, transfer may key on automated responses learned in simulation to the objective environment in a MILITARY MEDICINE, Vol. 178, October Supplement 2013

Cost Considerations in Using Simulations for Medical Training

straightforward “low-road” fashion. Alternatively, it may require less focus on automatic responses and greater abstraction of simulation performance to conceptual levels that transfer indirectly but broadly to many objective environments (i.e., “high-road” transfer, which requires purposeful attention in the learning environment to the development of learners’ transfer abilities57). Given the physical and anatomical differences of human beings, both low- and high-road transfer seem particularly important in the development and assessment of medical training. These forms of transfer have received some attention in the development and assessment of simulations for medical training, but more may be in order. Third, the operating costs of objective, targeted tasks differ markedly and can produce quite different tradeoffs in assessing the cost-effectiveness of simulation-based training. For instance, Povenmire and Roscoe50 considered flight simulation training for Piper Cherokee pilots, where the cost ratio of simulation to targeted performance was 0.73, thereby requiring a much higher TER for cost-effectiveness than Orlansky et al49 found for high-performance military aircraft. Cost/effectiveness versus effectiveness/cost

TERs primarily concern ways to minimize costs while holding effectiveness constant. In this regard, the ratio between the fastest and slowest learners in typical classrooms appears to be at least 4:1.32 Learner ability remains a factor, but this ratio is most directly linked to prior knowledge. The variety and extent of prior knowledge increases with the age and experience of the learner, thereby making adjustments for it increasingly important for adult learners. Individualizing instruction by taking into account the knowledge and skill that each learner brings to training has been found to reduce time or trials to reach criterion levels of performance. Costs for specialized technical training in areas such as medicine might be reduced by as much as one-fourth if the capabilities currently available through computer technology were implemented to take advantage of these differences.38 On the other hand, maximizing effectiveness while holding costs constant may be more appropriate for military training. Personnel commands, which prepare orders to pass course graduates on to their next duty station, have found it prohibitively difficult to deal with individuals leaving training at arbitrary times. Fast learners who finish early are often detailed to necessary but undesirable duties and thereby have few incentives to save resources by shortening their time in training. It appears to be more feasible and beneficial for military organizations to provide training that allows each learner to “be all they can be,” while holding graduation dates for all students constant. For instance, learners who have experience with a topic exercise in simulation might be presented a more difficult exercise on that topic to enhance their knowledge or skill while holding simulation time constant. This procedure could accommodate the needs of military personnel systems to synchronize the preparation of post-training MILITARY MEDICINE, Vol. 178, October Supplement 2013

orders and, through various personnel actions, provide incentives for learners to take full advantage of opportunities to train beyond threshold levels of performance—training that could best be made available using simulation.32,38 Cost savings under procedures to maximize performance while holding time constant have been shown to be considerable.38,58 Unfortunately, most of these savings are realized in duty commands and not in the training commands that must bear the costs of developing and providing the extra training for fast learners. These costs can be minimized using simulation, but, at present, local training commands have limited incentives to implement such procedures. Further, return on investment appears to be relatively insensitive to development costs at military training scales.58 The Services could invest much more in the development of high-quality training and still receive strong monetary return on investment. The return to operational effectiveness for this investment is also likely to be substantial, but it is far more difficult to assess. Isoperformance

TERs cover transfer issues, but we would like to cover costs along with transfer effectiveness in a single omnibus analysis so that allocations of training time or trials between, for example, simulation and “hands-on” exercises produce targeted levels of performance at minimal cost. Isoperformance provides one approach for solving this problem. The basic idea is to devise a function, usually depicted as an isoperformance curve, showing every point where different combinations of training inputs produce equivalent performance outputs.51,59,60 The solution, then, is to find the point on the curve where costs are minimized. Isoperformance relates two or more training inputs to a training outcome held at some prescribed value or level. It is generally assumed that each input by itself could produce the desired-level outcome; however, some inputs may provide unique contributions to the outcome, necessitating their inclusion at least to some degree. Isoperformance identifies all combinations of the inputs needed to produce the objective performance. Bickley61 pointed out that cost considerations in simulationbased training require at least 2 component functions. First, a function is needed to relate simulation trials or time to their costs. Second, a function is needed to relate costs to trials or time in simulation to performance on the “real” task or job. The first consideration can usually be treated with a simple linear function to account for time or trials in simulation. The second consideration is more complicated. It is called an isoperformance curve because it trades off simulation time or trials with real task experience while holding performance on that task at some threshold level. It requires the analyst to specify a criterion level of performance and a level of confidence for achieving it. Given these considerations, the factors that determine criterion performance can be traded off against one another, as shown in Figure 2. 43

Cost Considerations in Using Simulations for Medical Training

FIGURE 2.

Notional isoperformance curve drawn as a function of simulator and actual equipment costs.

Performance—the output of the training—is expected to be the same everywhere on the total cost curve (the upper curve in Figure 2). Total costs initially decrease as simulation time or trials are substituted for those with the (presumably more expensive) real-world objective task or job. Costs then begin to increase as more and more simulation training is allocated and substituted in. Costs for nonsimulation training—initially, the middle curve in Figure 2—start in the same place as total costs but then decrease monotonically as more and more simulation training is substituted in. Notably, these costs rarely reach zero because sooner or later training will have to include time or trials in performing the objective real-world task or job. Costs for simulation—initially, the bottom curve in Figure 2— start at zero and rise monotonically with its increasing use. Bickley recommended the following formulation for an isoperformance curve, which appears as the top curve in Figure 2: Y ¼ aebx + c where Y is time or trials in the real task required to reach criterion performance; x is the time or trials in simulation; and a, b, and c are the parameters of the model. Given a data set with reasonable variability in matching simulation time or trials to task proficiencies, the values for a, b, and c in this model can be calculated, and an appropriate isoperformance curve can be developed. Algorithms for doing so are available, as Bickely61 and de Weck and Jones60 point out. The cost-effective solution under this formulation is given by the minimum on the upper, total cost curve. It can then be 44

used to allocate training time or trials between simulation and the real equipment. In effect, it holds performance (or effectiveness) constant and suggests an allocation of inputs that minimizes costs. Carter and Trollip62 illustrated the other side of the coin. They used a mathematically equivalent approach to devise an optimal strategy for maximizing performance (or effectiveness again) given fixed costs. The problem of collecting appropriate transfer data to use in TER or isoperformance analyses remains for some applications. Many of these analyses trade off simulation for training (e.g., aircraft piloting and tank gunnery) that otherwise would be exorbitantly expensive. Collecting adequate data to show all combinations of training inputs (e.g., simulation and aircraft piloting) that produce equivalent performance outputs can easily swamp a training developer’s budget. Morrison and Holding51 suggested that a solution to this problem would be to use limited but valid empirical data accompanied by expert judgment to double-check findings and fill in gaps. They suggest pilot “dosage” experiments with no simulation training, a great deal of simulation training, and two to three different allocations of simulation training in between. Findings from such experiments could then be reviewed and supplemented by expert judgment to produce an approximate learning curve sufficient for either TER or isoperformance analysis. If times or trials to criterion in simulation are a matter of hours or days, if the training is for a critical task or job, and/or if the tasks to be learned are inexpensive relative to piloting military aircraft (as may be the case for many medical procedures), this approach seems reasonable and, in fact, prudent. Morrison and Holding’s51 application of isoperformance analysis concentrated on gunnery training. The main idea MILITARY MEDICINE, Vol. 178, October Supplement 2013

Cost Considerations in Using Simulations for Medical Training

was to use simulation to save training ammunition. Other examples are available. For instance, Bickley61 focused on simulator versus flight time in the Army’s AH-1 helicopter. Jones and Kennedy59 discussed trading off personnel aptitude against training time. They also provide an appendix that shows step by step how to create an isoperformance curve. de Weck and Jones60 provide examples from spacecraft design and professional sports. Isoperformance analysis can be applied to any trade-off issue, including the use of simulation in medical training. Basically, isoperformance curves are just cost curves. It may be time to invest more seriously in this approach. SUMMARY AND DISCUSSION Discussion in this article has ranged from generally applicable techniques (economic analysis) to those techniques specifically focused on simulation-based instruction (TERs and isoperformance). Issues of benefit cost, net benefit cost, and cost-effectiveness seem applicable in a straightforward fashion to any sort of medical training and education. However, commensurability is a problem: How are we to capture fully in monetary terms the value of a patient’s life, quality of life, and overall health? It is solved to an appreciable degree by cost-effectiveness analysis, provided that we identify a comprehensive set of realistic alternatives. In contrast to cost-effectiveness, return on investment does not require the identification of all likely alternatives. Different returns from different investments in education and training can be compared later as they arise. However, return on investment focuses on investment costs, which may be unknown or sunk compared to existing alternatives. The result is that the research and development costs and initial investment costs of a new approach may need to be included. The new approach may then be at a disadvantage when considered and compared with return from existing approaches, where such costs are unknown, sunk, and omitted from the analysis. Applications of TERs and isoperformance to provide economic analyses for simulation used in medical training and education seem both feasible and worthwhile if our analyses are to treat costs in training seriously. Adequate policies and procedures for the cost and effectiveness of training programs might be developed without the expenditure of time, effort, and cost required for optimization, but they will require generally accepted cost models with well-defined cost elements, including those associated with simulation for medical training. These approaches may earn their keep by advancing the field beyond guesswork and/or administrative fiat in the competitive allocation of increasingly scarce resources to medical education and training. ACKNOWLEDGMENT Funding for this article was provided by the Office of the Deputy Assistant Secretary of Defense (Readiness), Training Readiness, and Strategy Directorate.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

REFERENCES 1. Bradley P: The history of simulation in medical education and possible future directions. Med Educ 2006; 40(3): 254–62. 2. Rosen KR: The history of medical simulation. J Crit Care 2008; 23(2): 157–66. 3. Barrows HS, Abrahamson S: The programmed patient: a technique for appraising student performance in clinical neurology. J Med Educ 1964; 39: 802–5. 4. Collins JP, Harden RM: The Use of Real Patients, Simulated Patients, and Simulators in Clinical Examinations (AMEE Medical Education Guide, No. 13). Dundee, UK, AMEE, 2004. 5. Wilson L, Rockstraw L (editors): Human Simulation for Nursing and Health Professions. New York, Springer, 2012. 6. Gerner B, Sanci L, Cahill H, et al: Using simulated patients to develop doctors’ skills in facilitating behaviour change: addressing childhood obesity. Med Educ 2010; 44(7): 706–15. 7. Betcher DK: Elephant in the room project: improving caring efficiency through effective and compassionate communication with palliative care patients. Medsurg Nurs 2010; 19(2): 101–5. 8. Safdieh JE, Lin AL, Aizer J, et al: Standardized patient outcomes trial (SPOT) in neurology. Med Educ Online 2011; 16(1): 1–6. 9. Cook DA, Triola MM: Virtual patients: a critical literature review and proposed next steps. Med Educ 2009; 43(4): 303–11. 10. Cendan JC, Lok B: The use of virtual patients in medical school curricula. Adv Physiol Educ 2012; 36(1): 48–53. 11. Cannon-Bowers JA, Bowers C, Procci K: Using video games as educational tools in healthcare. In: Computer Games and Instruction, pp 44–72. Edited by Tobias S, Fletcher JD. Charlotte, NC, Information Age Publishing, 2011. 12. Steadman RH, Coates WC, Huang YM, et al: Simulation-based training is superior to problem-based learning for the acquisition of critical assessment and management skills. Crit Care Med 2006; 34(1): 151–7. 13. Ten Eyck RP, Tews M, Ballester JM: Improved medical student satisfaction and test performance with a simulation-based emergency medicine curriculum: a randomized controlled trial. Ann Emerg Med 2009; 54(5): 684–91. 14. Botezatu M, Hult H, Tessma MK, Fors U: Virtual patient simulation: knowledge gain or knowledge loss? Med Teach 2010; 32(7): 562–8. 15. Satava RM: Accomplishments and challenges of surgical simulation: dawning of the next-generation surgical education. Surg Endosc 2010; 15(3): 232–41. 16. Gaba DM, DeAnda A: A comprehensive anesthesia simulation environment: re-creating the operating room for research and training. Anesthesiology 1988; 69(3): 387–94. 17. Alinier G, Hunt WB, Gordon R: Determining the value of simulation in nurse education: study design and initial results. Nurse Educ Pract 2004; 4(3): 200–7. 18. Radhakrishnan K, Roche JP, Cunningham H: Measuring clinical practice parameters with human patient simulation: a pilot study. Int J Nurs Educ Scholarsh 2007; 4: Article8. 19. Cendan JC, Johnson TR: Enhancing learning through optimal sequencing of web-based and manikin simulators to teach shock physiology in the medical curriculum. Adv Physiol Educ 2011; 35(4): 402–7. 20. Crochet P, Aggarwal R, Dubb SS, et al: Deliberate practice on a virtual reality laparoscopic simulator enhances the quality of surgical technical skills. Ann Surg 2011; 253(6): 1216–22. 21. Lee TL, Son JH, Chandra V, Lilo E, Dalman RL: Long-term impact of a preclinical endovascular skills course on medical student career choices. J Vasc Surg 2011; 54(4): 1193–200. 22. Privett B, Greenlee E, Rogers G, Oetting TA: Construct validity of a surgical simulator as a valid model for capsulorhexis training. J Cataract Refract Surg 2010; 36(11): 1835–8. 23. Coles TR, John NW: The effectiveness of commercial haptic devices for use in virtual needle insertion training simulations. In: 2010 Third International Conference on Advances in Computer-Human Interactions

45

Cost Considerations in Using Simulations for Medical Training

24.

25.

26. 27. 28. 29. 30. 31. 32. 33. 34.

35. 36. 37. 38.

39.

40.

41. 42.

43.

44.

46

(ACHI 2010), pp 148–53. Piscataway, NJ, The Institute of Electronic and Electrical Engineers, 2010. Available at http://www.computer .org/csdl/proceedings/achi/2010/3957/00/3957a148-abs.html; accessed May 20, 2013. Barsuk JH, McGaghie WC, Cohen ER, O’Leary KJ, Wayne DB: Simulation-based mastery learning reduces complications during central venous catheter insertion in a medical intensive care unit. Crit Care Med 2009; 37(10): 2697–701. Holzinger A, Kickmeier-Rusta MD, Wassertheurera S, Hessinger M: Learning performance with interactive simulations in medical education: lessons learned from results of learning complex physiological models with the HAEMOdynamics SIMulator. Comput Educ 2009; 52(2): 292–301. Simon HA: Administrative Behavior. Ed 4. New York, Free Press/ Simon & Schuster, 1997. Becker GS: The Economic Approach to Human Behavior. Chicago, IL, University of Chicago Press, 1976. Keeney RL, Raiffa H: Decisions With Multiple Objectives: Preferences and Value Tradeoffs. Cambridge, UK, Cambridge University Press, 1976. Von Neumann J, Morgenstern O: Theory of Games and Economic Behavior. Princeton, NJ, Princeton University Press, 1944. Mishan EJ, Quah E: Cost-Benefit Analysis. London, UK, Routledge, 2007. Levin HM, McEwan PJ: Cost-Effectiveness Analysis. Thousand Oaks, CA, Sage Publications, 2001. Fletcher JD: Education and training technology in the military. Science 2009; 323(5910): 72–5. Kearsley G: Costs, Benefits, and Productivity in Training Systems. Reading, MA, Addison-Wesley, 1982. Fitzpatrick JL, Sanders JR, Worthen BR: Program Evaluation: Alternative Approaches and Practical Guidelines. Ed 3. New York, Allyn & Bacon, 2003. Phillips JJ: Return on Investment in Training and Performance Improvement Programs. Ed 2. Oxford, UK, Butterworth-Heinemann, 2003. McDavid JC, Ingleson LRL: Program Evaluation and Performance Measurement. Thousand Oaks, CA, Sage Publications, 2006. Thompson MS: Benefit-Cost Analysis for Program Evaluation. Beverly Hills, CA, Sage Publications, 1980. Fletcher JD, Chatham RE: Measuring return on investment in military training and human performance. In: Human Performance Enhancements in High-Risk Environments, pp 106–28. Edited by O’Connor PE, Cohn JV. Santa Barbara, CA, Praeger/ABC-CLIO, 2010. Fletcher JD, Hawley DE, Piele PK: Costs, effects, and utility of microcomputer assisted instruction in the classroom. Am Educ Res J 1990; 27(4): 783–806. Jamison DT, Fletcher JD, Suppes P, Atkinson RC: Cost and performance of computer-assisted instruction for education of disadvantaged children. In: Education as an Industry, pp 201–40. Edited by Froomkin JT, Jamison DT, Radner R. Cambridge, MA, Ballinger Publishing, 1976. Levin HM, Glass GV, Meister GR: Cost-effectiveness of computerassisted instruction. Eval Rev 1987; 11(1): 50–72. American College of Physicians (ACP): Primer on cost-effectiveness analysis. Effective Clin Pract 2000; 3(5): 253–5. Available at http://www.vaoutcomes.org/downloads/Cost_Effectiveness_Analysis.pdf; accessed April 10, 2012. Tsai SL, Chen MB, Yin TJ: A comparison of the cost-effectiveness of hospital-based home care with that of a conventional outpatient followup for patients with mental illness. J Nurs Res 2005; 13(3): 165–73. Drummond M, McGuire A: Economic Evaluation in Health Care: Merging Theory With Practice. Oxford, UK, Oxford University Press, 2001.

45. Vanhook PM: Cost-utility analysis: a method of quantifying the value of registered nurses. OJIN 2007; 12(3). 46. Petitti DB: Meta-Analysis, Decision Analysis, and Cost-Effectiveness Analysis: Methods for Quantitative Synthesis in Medicine. New York, Oxford University Press, 2000. 47. Roscoe SN, Williges BH: Measurement of transfer of training. In: Aviation Psychology, pp 182–93. Edited by SN Roscoe. Ames, IA, Iowa State University Press, 1980. 48. Taylor HL, Lintern G, Hulin CL, Talleur DA, Emanuel T, Phillips SI: Transfer of training effectiveness of a personal computer aviation training device. Int J Aviat Psychol 1999; 9(4): 319–35. 49. Orlansky J, Knapp MI, String J: Operating Costs of Aircraft and Flight Simulators (IDA Paper P-1733). Alexandria, VA, Institute for Defense Analyses, 1984. (DTIC AD-A144241). Available at http://www.dtic .mil/dtic/tr/fulltext/u2/a144241.pdf; accessed May 20, 2013. 50. Povenmire H, Roscoe SN: Incremental transfer effectiveness of a groundbased general aviation trainer. Hum Factors 1973; 15(6): 534–42. 51. Morrison JE, Holding DH: Designing a Gunnery Training Strategy (Technical Report 899). Alexandria, VA, US Army Research Institute for the Behavioral and Social Sciences, 1990. (DTIC AD-A226129). Available at http://www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA226129; accessed May 20, 2013. 52. Taylor HL, Talleur DA, Emanuel TW Jr., Rantanen EM, Bradshaw GL, Phillips SI: Incremental Training Effectiveness of Personal Computer Aviation Training Devices (PCATD) Used for Instrument Training (Final Technical Report ARL-02-5/NASA-02-3). Savoy, IL, Aviation Research Laboratory Institute of Aviation, University of Illinois at Urbana-Champaign, 2002. Available at http://www.aviation.illinois .edu/avimain/papers/research/pub_pdfs/techreports/02-05.pdf; accessed May 20, 2013. 53. Holman GL: Training Effectiveness of the CH-47 Flight Simulator. Research Report 1209. Alexandria, VA, US Army Research Institute for the Behavioral and Social Sciences, 1979. (DTIC AD-A072317). Available at http://www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA072317; accessed May 20, 2013. 54. Thorndike EL: The influence of first year Latin upon the ability to read English. School Soc 1923; 17: 165–8. 55. Thorndike EL: The Fundamentals of Learning. Ed 1. New York, Teachers College, 1932. 56. Cree VE, Macaulay C: Transfer of Learning in Professional and Vocational Education. London, UK, Routledge, 2000. 57. Perkins DN, Salomon G: Transfer of Learning (Contribution to the International Encyclopedia of Education). Ed 2. Oxford, England, Pergamon Press, 1992. 58. Cohn J, Fletcher JD; What is a pound of training worth? Frameworks and practical examples for assessing return on investment in training. In: Proceedings of the InterService/Industry Training, Simulation and Education Conference (I/ITSEC) 2010. Arlington, VA, National Training and Simulation Association, 2010. Available at http://ntsa.metapress.com/ link.asp?id=x81460715365274p; accessed May 20, 2013. 59. Jones MB, Kennedy RS: Isoperformance curves in applied psychology. Hum Factors 1996; 38(1): 167–82. 60. de Weck OL, Jones MB: Isoperformance: analysis and design of complex systems with desired outcomes. Systems Eng 2006; 9(1): 45–61. 61. Bickley WR: Training Device Effectiveness: Formulation and Evaluation of a Methodology. Research Report 1291. Alexandria, VA, US Army Research Institute for the Behavioral and Social Sciences, 1980. (DTIC AD-A122777). Available at http://www.dtic.mil/dtic/tr/fulltext/ u2/a122777.pdf; accessed May 20, 2013. 62. Carter G, Trollip S: A constrained maximization extension to incremental transfer effectiveness, or, how to mix your training technologies. Hum Factors 1980; 22(2): 141–52.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

MILITARY MEDICINE, 178, 10:47, 2013

Assessment Methodology for Computer-Based Instructional Simulations Alan Koenig, PhD; Markus Iseli, PhD; Richard Wainess, PhD; John J. Lee, PhD ABSTRACT Computer-based instructional simulations are becoming more and more ubiquitous, particularly in military and medical domains. As the technology that drives these simulations grows ever more sophisticated, the underlying pedagogical models for how instruction, assessment, and feedback are implemented within these systems must evolve accordingly. In this article, we review some of the existing educational approaches to medical simulations, and present pedagogical methodologies that have been used in the design and development of games and simulations at the University of California, Los Angeles, Center for Research on Evaluation, Standards, and Student Testing. In particular, we present a methodology for how automated assessments of computer-based simulations can be implemented using ontologies and Bayesian networks, and discuss their advantages and design considerations for pedagogical use.

INTRODUCTION Medical simulations exist in many forms, from stand-alone multimedia computer programs to interactive tactile simulators, to high fidelity virtual reality experiences.1–5 They can be used to teach a range of skills from partial tasks to full procedures,6 and can be valuable in offering practice with patient interactions.5 Furthermore, simulations can be used to teach and test competencies not only related to procedures but also related to higher order thinking skills as well, such as problem solving and decision making.7 However to be effective, medical simulations must be designed with careful attention paid to blending instructional and assessment design goals with appropriate types of interactivity and technological constraints. This article will explore some of the characteristics that have made medical simulations effective and will present a methodology developed by University of California, Los Angeles’, Center for Research on Evaluation, Standards, and Student Testing (CRESST), which highlights how instructional games and simulations can be designed and used to assess student performance. MEDICAL SIMULATION DESIGN CRITERIA Simulations can be designed for a variety of purposes, including individual instruction, team training, selection, and assessment. Despite this variability, most (if not all) medical simulation designs could benefit from deliberate consideration of the following: (1) goals of the simulation, (2) learning objectives, cognitive demands, and assessment, (3) affordances

Graduate School of Education & Information Studies, Center for the Study of Evaluation (CSE), National Center for Research on Evaluation, Standards, & Student Testing (CRESST), University of California, Los Angeles, 300 Charles E. Young Drive North, GSE&IS Building, Box 951522, Los Angeles, CA 90095-1522. The findings and opinions expressed in this article are those of the authors and do not necessarily reflect the positions or policies of the Office of Naval Research. doi: 10.7205/MILMED-D-13-00217

MILITARY MEDICINE, Vol. 178, October Supplement 2013

of the simulation, (4) the instructional strategies used, and (5) motivation considerations for users. The goals of a simulation can range from general task goals (i.e., effectively treating a patient) to more specific goals (such as resource management of teams or equipment). From those, the knowledge, skills, and attitudes that a student should attain should be specified as learning objectives, and rooted in specific cognitive demands germane to their mastery.8,9 Objectives should include behaviors (what the student should be able to do), conditions (context in which the behaviors are shown), and standards (to what level of proficiency).10 Simulation affordances address the actions and decisions students can perform/make inside the simulation and how well they map to the real world.11 It relates to task and cognitive fidelity, and involves determining the appropriate mechanisms or objects provided to students to facilitate their actions. Careful consideration must be paid to ensure that the affordances appropriately match the instructional strategies used. The motivation for engagement with the simulation should be taken into account as well.12 While it is true that in many cases the student may be required to use the simulation, there is value in making simulations that are engaging, intuitive, and satisfying to keep students actively involved in the content.13 SIMULATION FEATURES AND BEST PRACTICES McGaghie et al14 in their meta-analyses of recent medical research from 2003 to 2009, enumerate twelve medical simulation features and best practices: (1) providing feedback, (2) deliberate practice, (3) curriculum integration, (4) outcome measurement, (5) simulation fidelity, (6) skill acquisition and maintenance, (7) mastery learning, (8) transfer to practice, (9) team training, (10) high-stakes testing, (11) instructor training, and (12) educational and professional context. Since describing each in detail is beyond the scope of this article, we briefly highlight three key ones, namely feedback, deliberate practice, and assessment. Feedback was the most commonly studied feature of simulator training and considered the most important feature for 47

Assessment Methodology for Computer-Based Instructional Simulations

student learning by Issenberg et al.15 One type of feedback is debriefing, also known as after-action review in military contexts. Debriefing has long been held as the cornerstone of simulator training.16 A critical piece of learning is not experience alone, but reflecting on the experience and rehearsing through simulating the scenarios, environments, etc., are particularly well suited to this. Repetitive practice leads to faster automaticity and was considered key to transferring skills from simulators to real patients (43 studies).15 McGaghie et al17 further examined the role of practice in simulation-based medical education and found that it accounted for almost 50% of the variance in the average weighted effect size variable (h2 = 0.46) from 32 studies, with the most benefit coming from the over 8.1 hours of practice category. Deliberate practice includes repetitive practice and informative feedback17 and is needed to reduce skill decay or degradation.18 There are several strategies used in the assessment of simulations. They range from observations using checklists19,20 to very specific measurements taken by the simulation to more sophisticated statistical modeling. Qualitative methods such as journaling or case studies are also used. Depending on the type of medical simulation used, assessments in different areas are possible,21 including from their meta-analyses of 32 studies: improvement in knowledge (pooled effect size of 1.20), time skills (1.14), process skills (1.09), product skills (1.18), other behaviors (0.81), and effects on patient care (0.50). Other behaviors assessed included instructor ratings of competence, completion of procedures, and procedural errors if present. Large effect sizes over 0.80 like almost all of these22 can be seen when simulations are designed and integrated properly into the curriculum so that the benefits of the simulation can be realized. In addition, newer analysis methods are now available to look at changes in proficiency over time. GAME AND SIMULATION DESIGN, DEVELOPMENT, AND EVALUATION Simulation design and development is a process of defining learning and assessment goals, determining what is needed to achieve those goals, developing an appropriate simulation, and validating its effectiveness. In this article, our discussion applies to both games and simulations. For ease of reading, the term “simulation” will be used synonymously with “game.” CRESST’s Instructional Simulation Design and Development Methodology CRESST defined and showed a methodology for designing and developing learning simulations, which includes the creation of simulations as assessments.23 The methodology (Fig. 1) delineates both processes and role. The methodology supports the ADDIE (Analysis, Design, Development, Implement, Evaluate) instructional design model,24 as well as other instructional design models. Two boxes surround the various components. The upper boxed area surrounds components 48

FIGURE 1. CRESST’s Instructional Simulation Design and Development methodology.

that pertain to learning goals and objectives, cognitive demands associated with achieving the learning goals, explication of relevant ontologies, and the instructional and assessment parcels necessary for teaching and assessing the to-be-learned materials. The lower boxed area surrounds components that pertain to simulation design and development, including genre selection (i.e., first vs. third person perspective) and platform (e.g., game console, PC, or mobile device), modes of interaction with instructional content, a game play model specifying the elements and relationships of a game, and development of the simulation. As can be seen by the overlap of the two boxed areas in Figure 1, educators and simulation developers address four of the same components: assessment and instructional parcels, genre and platform, player interaction framework, and game play model. These are the components where a balance between media and learning occur. For example, there are four ways a learner can interact with instruction.23 One is to present information directly to the learner. Another is to let the learner find it on his/her own, which means it may not be discovered or the timing of discovery may be earlier or later than desired. Based on the importance of the information (a decision by educators), the simulation developer may be required to use a particular player interaction method. The arrows in Figure 1 indicate the flow of the design and development process. Portions of the methodology may require a compromise between educators and developers because of instruction and assessment requirements that may dictate selection of a particular genre or platform and because genres and platforms may restrict the types of instructional or assessment strategies available. There is a similar give and MILITARY MEDICINE, Vol. 178, October Supplement 2013

Assessment Methodology for Computer-Based Instructional Simulations

take regarding player interactions, assessment, and instructional requirements, with genre and platform. The game play model23 describes the components of a game, their relationships, and the directionality of the relationships. This model provides the architecture for ensuring the game adheres to the requirements necessary for being a game. If the goal of the medium is to be a simulation with little or no game play elements, the game play model can be bypassed. The last component, development of the simulation, is an iterative process that continuously revisits the way learners interact with instruction and assessments, and the ways content is represented. For a detailed description of the design and development methodology, refer to Wainess and Koenig.23 Blending Simulation With Instruction and Assessment With all the proposed potential of simulations to support learning, hopes for learning outcomes have not been achieved because not all educational simulations are equally effective.25 The simple presence of educational content in a simulation does not guarantee its efficacy.26 There is a strong consensus in the research community that learning outcomes are affected by the “instructional strategies” used in simulations, not by the simulations themselves.27–30 Recently, researchers have begun to argue that learning outcomes from simulations also depend on how well the domain instruction is integrated into the simulation.26,31,32 This argument is related to “cognitive load theory,” which is concerned with the learners’ cognitive architecture, including a limited working memory33 that can have an adverse effect on learning.34 “Cognitive load” is the total amount of mental activity imposed on working memory at an instance in time.35,36 There are three types of cognitive load: (1) “Intrinsic cognitive load,”37–39 which is the load involved in the process of learning; (2) “Germane cognitive load,” which is the cognitive load imposed by schema formation39,40; and (3) “Extraneous cognitive load,” which is the load caused by any unnecessary stimuli.37 Using one set of mechanics for both

FIGURE 2.

the game and the instruction and assessment can potentially reduce extraneous load. That is, blending learning the simulation with learning the content can lead to efficient and effective instruction by reducing the amount of switching the learner must do between learning the simulation and learning “from” the simulation, thus, lowering extraneous load. The remainder of this article will focus on the components within the Instructional Developer section of the methodology, with a particular emphasis on the CRESST Assessment Methodology. CRESST ASSESSMENT METHODOLOGY FOR COMPUTER-BASED SIMULATIONS CRESST has devised a methodology for automating assessments of student performance in computer-based simulations. Although the foundation of this methodology was developed primarily with military applications, the approach is generalizable to any domain of interest, including medicine. Figure 2 below shows the primary components of the methodology and how they are sequenced together. Ontology Development for Assessment and Simulation Design Ontologies are useful to the design of assessments because of their utility in organizing and storing knowledge “about” things, and about relations “between” things. They can be used to define and represent the current learning status and the educational goals in terms of knowledge (e.g., factual, conceptual, procedural, and metacognitive), skills (e.g., motor, problem solving, and leadership), and attitudes. Ontologies can be considered knowledge bases that contain the definitions, groupings, and relations of all entities of a domain. More specifically, information stored in ontologies can be graphically rendered as networks or maps, with nodes representing entities and links between the nodes showing the relationships and dependencies between the entities. For a more detailed description of ontologies refer to Guarino et al.41

CRESST Assessment Methodology Components.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

49

Assessment Methodology for Computer-Based Instructional Simulations

Usually entities involved in the assessment process are split into domain-independent entities, such as cognitive demands and higher-level cognitive constructs, and domaindependent entities, such as domain-specific knowledge. This split will help with the design of the initial ontology but will not necessarily be feasible for higher-level constructs that require a combination of cognitive demands and domain knowledge. We distinguish between higher-level entities that usually involve other entities and lower-level entities that represent the basic building blocks of the ontology. The CRESST ontology development process draws on preexisting research in the field.8,42– 45 Below, this process is described, with emphasis on linking educational and/or assessment goals with simulation observations. High-Level Ontology Development

The development of ontology is driven by the educational goals, which include the assessment and learning goals of the simulation. The goals contain the domain specifications, the required knowledge and skills, as well as the cognitive demands required for successful assessment completion. High-level ontology development delineates and defines the high-level entities and their relationships. Examples of highlevel entities are situation awareness,46 decision making, communication, metacognitive skills, strategic skills, reasoning skills, problem-solving skills,47–49 teamwork, standards for practice and content, standard operating procedures, and “big ideas.” If needed, high-level entities are split up further. For example problem-solving skills might require problem identification, problem representation/modeling, solution planning, and solution evaluation. A simplified example of the high-level part of a CRESSTdeveloped Navy shipboard damage control ontology is shown in Figure 3. (Note: damage control refers to operations pertaining to addressing the outbreak of fires and floods aboard ships.) It shows how overall damage control knowledge comprises knowledge and skills of other domains or subdomains. Low-Level Ontology Development

The development of a low-level ontology requires the specification of entities on a more detailed level. These entities

FIGURE 3. Example: Simplified high-level part of Navy shipboard damage control ontology. Relationships, shown as links, are mostly part-of relations. Dotted lines indicate possible additional relationships.

50

FIGURE 4. Example: Bottom-level part of the Navy shipboard damage control ontology. Unlabeled relationships are part-of relations. Dotted lines indicate possible additional relationships.

usually represent the “basic” knowledge and skills, which cannot be split up any further. Depending on the desired level of granularity of the assessment outcomes, these entities can be very detailed or more general. Examples of low-level entities include conceptual/procedural knowledge of a simple concept, a skill that requires no other skills or knowledge or a skill that requires other subskills, but whose subskills cannot be measured in the current simulation (and therefore are not added to the ontology). To connect high-level and low-level entities, intermediate entities can be added. In Figure 4, a low-level part of the Navy shipboard damage control ontology is depicted with the names of the leaf nodes being “Firefighting Equipment,” “Safety Checks,” “PPE Gear,” and “Reliefs Standing By,” with the intermediate node being “Equipment Selection.” The node “Equipment Knowledge” connects with the corresponding top-level ontology, as depicted in Figure 3. (Note: “PPE” stands for personal protective equipment.) BAYESIAN NETWORK DEVELOPMENT The ontology becomes the basis for devising an assessment system capable of evaluating performance and making predictions about the student’s mastery of the domain. At the root of this assessment system is a Bayesian network, which is a directed acyclic graphical model for representing probabilistic dependency relationships between variables.50 The Bayesian network, in essence, exists as an operationalized representation of the ontology. It relates the low-level entities of the ontology (i.e., observable actions or events from the simulation) with latent, nonobservable variables that comprise the high-level ontology entities (such as problemsolving ability). By observing low-level entities, and knowing how (and with what influence) they affect higher-level entities, the Bayesian network can make probabilistic inferences of the student’s mastery of each node represented. The goal is to design the Bayesian network structure such that it reflects an expert human rater’s thought processes when assessing student performance.51 To do this, the system must not only consider the discrete player actions but also consider them in the context in which they occur. Indeed, a single action can be good in one context and bad in another. MILITARY MEDICINE, Vol. 178, October Supplement 2013

Assessment Methodology for Computer-Based Instructional Simulations

Skill mastery is inferred based on the worth of taking one series of actions weighed against the value of taking alternative actions, given the “state of the world.” By having a state-based system in which the dynamic Bayesian network monitors actions relative to previous actions and relative to the state of the world, we increase the validity of the inferences made, and have greater confidence in the predictions of student mastery.

including: (a) directing the simulation to serve up a certain type of situation for additional practice; (b) sending instructions to alter the current situation; (c) serving up remedial information; or (d) sending a textual message to guide future performance. Implementing formative assessment facilitates real-time and adaptive responses back to the student, which can boost performance while in session with the simulator.

Formative vs. Summative Assessments

Summative Assessments

Formative Assessments

Formative assessments involve the gathering and interpretation of evidence about student achievement to make decisions about the next steps in instruction that are likely to be better, or better founded, than the decisions they would have taken in the absence of this evidence.52 In our formative assessment work, the simulation sends observable student actions and game states to the Bayesian network, which scores them using predefined rubrics that encapsulate both the knowledge base of the subject-matter experts and the contextual state of the world when the action was taken.45 From this, the Bayesian network infers probabilities of mastery and sends appropriate feedback back to the simulation. This feedback could take many forms,

FIGURE 5.

Summative assessments encapsulate all evidence gathered as of a certain time, typically at the conclusion of the simulation.53 Indeed, there are situations where it is either not needed or not desired to have real-time assessments. In these cases, the simulation would capture and log all meaningful actions and events, and then feed this data into the Bayesian network post hoc, where it would get scored using preestablished rubrics. Reporting and Debriefing (After-Action Review) To make meaningful sense of assessment data, various reporting and debriefing tools are used to summarize the student’s performance—both during the simulation, as well

Criterion-referenced report used in summarizing student performance in a shipboard damage control simulation developed for the Navy.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

51

Assessment Methodology for Computer-Based Instructional Simulations

as between simulations. We generally consider two classes of reporting tools: criterion-referenced metrics, and growth modeling reports. Criterion-Referenced Metrics

The automated assessment systems we attach to instructional simulations incorporate criterion-referenced assessments, which are designed to provide a measure of performance that is interpretable in terms of a clearly defined and delimited domain of learning or performance tasks.54 In practice, these criterion-referenced reports consist of various charts, graphs, descriptive summaries, and statistics that indicate how well the student performed and/or their mastery/ proficiency within a domain according to preestablished criteria. These types of reports are used both for summative feedback to students, as well as for after-action reviews conducted by instructors. Figure 5 below is an example of a criterion-referenced report used in scoring student performance in a shipboard damage control simulation developed for the Navy. The three pie charts correspond to the phases of fighting a fire casualty aboard ship—namely, Red = “Size it up,” Yellow = “Fight it,” and Green = “Monitor it.” Each phase involves carrying out specific tasks. (Note: The colors red, yellow, and green here “do not” correspond to achievement; they are merely Navyused terminology to describe the phases of firefighting.) Scoring for how well these tasks were performed is shown in the bar charts on the right, with each task being scored using a rubric with a 0 to 100 scale. For example, in the simulation, the student outfitted his team with the correct PPE, so the PPE score was 100%. But for Electrical Isolation (i.e., cutting electrical power to the room that’s on fire), the action was taken too late and endangered the ship, so the score for this task was only 50%. From these individual scores, the Bayesian network then infers the student’s proficiency with that phase of operations (overall), and is expressed as a percent in the pie chart. In this case, the Bayesian network infers only a 60% chance that the student has full proficiency in carrying out the operations that comprise the Red Phase of firefighting operations aboard ship. Growth Modeling Reports

Growth modeling looks at how individuals change over time and whether there are differences in patterns of change.55 With growth modeling, the goal is to track student performance over time across a consistent set of measures. This can occur both within a single simulation, and across multiple instances of the student engaging with that same simulation. When coupled with criterion-referenced performance reports, growth modeling helps to form a complete picture of the student’s capabilities within the domain being assessed. Figure 6 illustrates this with student performance data taken from the Navy damage control simulation. It shows the student’s achievement in keeping overall fire size (i.e., Casualty Strength) at or below a predefined safety threshold 52

FIGURE 6. Example: Growth modeling report across six different sessions showing the maximum fire size (i.e., casualty strength, expressed in arbitrary units) the simulated shipboard fire grew to before the student could get it under control and extinguish it.

when combating class “bravo” fires (such as a grease or petroleum fire) across six different game play sessions (each 30–60 minutes). Here, the lower the Casualty Strength, the better, as it means the fire is smaller in size. The goal for the student is to ensure the Casualty Strength does not exceed the safety threshold. In the fourth simulation run, the fire reached strength of 90 before the student was able to extinguish it. But in the fifth run, he improved his tactics and was able to extinguish the fire before it exceeded strength of 50. Here, Runs 2, 5, and 6 are best because in each, the fire strength remained below the safety threshold. SUMMARY Computer-based instructional simulations are becoming more and more ubiquitous, particularly in military and medical domains. As the technology that drives these simulations grows ever more sophisticated, the underlying pedagogical models for how they are designed must evolve accordingly. In this article, we have highlighted some of these key design considerations, namely learning goals and objectives, cognitive demands, simulation affordances, and instructional and assessment strategies. We then presented a CRESST Assessment Methodology that delineates how these elements can be represented in a domain ontology, and how this completed ontology can facilitate the development of a Bayesian network-based assessment engine. The goal of the Bayesian network is to model expert thinking when assessing student performance in the simulator. It can be connected to the simulation to perform real-time formative assessments, or it can be used post hoc to carry out summative assessments of performance. The output of such assessments can be exhibited in various reporting and debriefing instruments, including criterion-referenced and growth-modeling summary reports. Although the methodologies outlined here have proven useful in CRESST’s automated assessment work, there is still much research to be done. Our hope is that the approaches MILITARY MEDICINE, Vol. 178, October Supplement 2013

Assessment Methodology for Computer-Based Instructional Simulations

outlined here will inspire new ideas for how instruction, simulation, and assessment can be blended, and will foster an ongoing dialog among the educational and simulation development communities. ACKNOWLEDGMENTS The work reported herein was supported by grant number N00014-08-C-0563 with funding to the National Center for Research on Evaluation, Standards, and Student Testing (CRESST) by the Office of Naval Research. The work reported herein was also partially supported by a grant from the Office of Naval Research, Award Number N00014-10-1-0978.

REFERENCES 1. Issenberg SB, McGaghie WC, Hart IR, et al: Simulation technology for health care professional skills training and assessment. JAMA 1999; 282(9): 861–6. 2. Nehring WM, Lashley FR: Nursing simulation: a review of the past 40 years. Simul Gaming 2009; 40(4): 528–52. 3. Alverson DC, Saiki SM Jr., Kalishman S, et al: Medical students learn over distance using virtual reality simulation. Simul Healthc 2008; 3: 10–5. 4. Dunne JR, McDonald CL: Pulse!!: a model for research and development of virtual-reality learning in military medication education and training. Mil Med 2010; 175: 25–7. 5. Passiment M, Sacks H, Huang G: Medical Simulation in Medical Education: Results of an AAMC Survey. Washington, DC, Association of American Medical Colleges, 2011. 6. Burke CS, Salas E, Wilson-Donnelly K, Priest H: How to turn a team of experts into an expert medical team: guidance from the aviation and military communities. Qual Saf Health Care 2004; 13(Suppl 1): i96–104. 7. Fadde PJ: Instructional design for advanced learners: training recognition skills to hasten experience. Educ Technol Res Dev 2009; 57(3): 359–76. doi:10.1007/s11423-007-9046-5. 8. Baker EL: Moving to the Next Generation System Design: Integrating Cognition, Assessment, and Learning (CRESST Report 706). Los Angeles, CA, University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST), 2007. Available at http://www.cse.ucla.edu/products/reports/R706.pdf; accessed May 7, 2013. 9. Phelan J, Choi K, Vendlinski T, Baker EL, Herman JL: The Effects of POWERSOURCE© Intervention on Student Understanding of Basic Mathematical Principles (CRESST Report 763). Los Angeles, CA, University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST), 2009. Available at http://www .cse.ucla.edu/products/reports/R763.pdf; accessed May 7, 2013. 10. US Navy: Navy ILE Learning Objective Statements: Specifications and Guidance, 2006. Available at http://ieeeltsc.files.wordpress.com/2009/ 03/navy-ile-los_20060317.pdf; accessed May 7, 2013. 11. Gibson JJ: The Ecological Approach to Visual Perception. London, Houghton Mifflin, 1979. 12. Kneebone R: Evaluating clinical simulations for learning procedural skills: a theory-based approach. Acad Med 2005; 80(6): 549–53. 13. Billings DR: Efficacy of adaptive feedback strategies in simulationbased training. Mil Psychol 2012; 24(2): 114–33. 14. McGaghie WC, Issenberg SB, Petrusa ER, Scalese RJ: A critical review of simulation-based medical education research: 2003–2009. Med Educ 2010; 44: 50–63. 15. Issenberg SB, McGaghie WC, Petrusa ER, Gordon DL, Scalese RJ: Features and uses of high-fidelity medical simulations that lead to effective learning: a BEME systematic review. Med Teach 2005; 27(1): 10–28.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

16. Rall M, Manser T, Howard S: Key elements of debriefing for simulator training. Eur J Anaesthesiol 2000; 17: 516–7. 17. McGaghie WC, Issenberg SB, Petrusa ER, Scalese RJ: Effect of practice on standardized learning outcomes in simulation-based medical education. Med Educ 2006; 40: 792–7. 18. Anderson JM, Aylor ME, Leonard DT: Instructional design dogma: creating planned learning experiences in simulation. J Crit Care 2008; 23: 595–602. 19. Adler MD, Vozenilek JA, Trainor JL, et al: Comparison of checklist and anchored global rating instruments for performance rating of simulated pediatric emergencies. Simul Healthc 2011; 6: 18–24. 20. Reed SJ: Designing a simulation for student evaluation using Scriven’s Key Evaluation Checklist. Clin SimulNurs 2010; 6: e41–4. 21. Cook DA, Hatala R, Brydges R, et al: Technology-enhanced simulation for health professions education: a systematic review and meta-analysis. JAMA 2011; 306(9): 978–88. 22. Cohen J: A power primer. Psychol Bull 1992; 112(1): 155–9. 23. Wainess R, Koenig AD: Validation of a Methodology for Design and Development of a Game for Learning With a Multigroup Development Process. Presentation at the 2010 annual meeting of the American Educational Research Association, Denver, CO, 2010. Available at https:// www.cse.ucla.edu/products/overheads/AERA2010/Wainess.AERA2010 .pdf; accessed May 9, 2013. 24. Rothwell WJ, Kazanas HC: Mastering the Instructional Design Process: A Systematic Approach, Ed 4. Hoboken, NJ, John Wiley & Sons, 2008. 25. Sitzmann T: A meta-analytic examination of the instructional effectiveness of computer-based simulation games. Pers Psychol 2011; 64(2): 489–528. 26. Fisch SM: Making educational computer games “educational”, pp 56–61. In: Proceedings of the 2005 Conference on Interaction Design and Children, Boulder, Colorado, 2005. Available from http://dl.acm.org/citation .cfm?id=1109548&bnc=1; accessed May 7, 2013. 27. Ke F: A qualitative meta-analysis of computer games as learning tools. In: Handbook of Research on Effective Electronic Gaming in Education. Vol. 1, pp 1–32. Edited by Ferdig RE. Hershey, PA, Information Science Reference, 2009. 28. Kirriemuir J, McFarlane A: Literature Review in Games and Learning. Report 8. Bristol, UK, Futurelab, 2004. 29. Leemkuil H, de Jong T: Chapter 13: Instructional support in games. In: Computer Games and Instruction, pp 353–66. Edited by Tobias S, Fletcher JD. Charlotte, NC, Information Age Publishing, 2011. 30. Pavlas D, Bedwell W, Wooten SR II, Heyne K, Salas E: Investigating the attributes in serious games that contribute to learning. Proc Hum Fact Ergon Soc Annu Meet 2009; 53(27): 1999–2003. 31. Becker K: Design paradox: instructional games. Paper presented at the Future Play. The International Conference on the Future of Game Design and Technology, The University of Western Ontario, London, Ontario, Canada, 2006. 32. Egenfeldt-Nielsen S: Thoughts on Learning in Games and Designing Educational Computer Games, 2006. Available at http://game-research .com/?page_id=78; accessed February 27, 2009. 33. van Merrie¨nboer JJG, Sweller J: Cognitive load theory in health professional education: design principles and strategies. Med Educ 2010; 44: 85–93. 34. de Jong T: Cognitive load theory, educational research, and instructional design: some food for thought. Instr Sci 2010; 38: 105–34. 35. Al Asraj A, Freeman M, Chandler PA: Considering Cognitive Load Theory Within e-Learning Environments, pp 1–13. PACIS 2011 Proceedings. Queensland, Australia, Queensland University of Technology, 2011. 36. Van Gog T, Paas F, Sweller J: Cognitive load theory: advances in research on worked examples, animations, and cognitive load measurement. Educ Psychol Rev 2010; 22: 375–8. 37. Brunken R, Plass JL, Leutner D: Direct measurement of cognitive load in multimedia learning. Educ Psychol 2003; 38(1): 53–61.

53

Assessment Methodology for Computer-Based Instructional Simulations 38. Paas F, Renkl A, Sweller J: Cognitive load theory and instructional design: recent developments. Educ Psychol 2003; 38(1): 1–4. 39. Renkl A, Atkinson RK: Structuring the transition from example study to problem solving in cognitive skill acquisition: a cognitive load perspective. Educ Psychol 2003; 38(1): 15–22. 40. Ayres P: Using subjective measures to detect variations of intrinsic cognitive load within problems. Learn Instr 2006; 16: 389–400. 41. Guarino N, Oberle D, Staab S: What is an Ontology? In: Handbook on Ontologies, pp 1–17. Edited by Staab S, Studer R. Berlin, Springer, 2009. 42. Baker EL: Model-Based Performance Assessment. CRESST Report 465. Los Angeles, CA, University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST), 1998. Available at http://www.cse.ucla.edu/products/reports/TECH465 .pdf; accessed May 7, 2013. 43. Chung GKWK, Baker EL, Delacruz GC, Elmore JJ, Bewley WL, Seely B: An Architecture for a Problem-Solving Assessment Authoring and Delivery System (Deliverable to the Office of Naval Research). Los Angeles, CA, University of California, CRESST, 2006. Available at http://www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA450110; accessed May 7, 2013. 44. Vendlinski TP, Baker EL, Niemi D: Templates and Objects in Authoring Problem-Solving Assessments. CRESST Report 735. Los Angeles, CA, University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST), 2008. Available at http:// www.cse.ucla.edu/products/reports/R735.pdf; accessed May 7, 2013. 45. Koenig AD, Lee J, Iseli MR, Wainess R: A Conceptual Framework for Assessing Performance in Games and Simulations. Proceedings of the

54

46. 47. 48. 49. 50. 51.

52. 53. 54. 55.

Interservice/Industry Training, Simulation and Education Conference, Orlando, FL, 2009. Available at http://ntsa.metapress.com/link.asp? id=v6kv837200610221; accessed May 7, 2013. Endsley MR: Toward a theory of situation awareness in dynamic systems. Hum Factors 1995; 37(1): 32–64. Newell A, Simon HA: Human Problem Solving. Englewood Cliffs, NJ, Prentice-Hall, 1972. Simon HA, Dantzig GB, Hogarth R, et al: Decision Making and Problem Solving. Washington, DC, National Academy Press, 1986. Jonassen DH: Toward a design theory of problem solving. Educ Technol Res Dev 2000; 48(4): 63–85. Jensen FV, Nielsen TD: Bayesian Networks and Decision Graphs. New York, Springer, 2007. Iseli MR, Koenig AD, Lee JJ, Wainess RA: Automated Assessment of Complex Task Performance in Games and Simulations. Proceedings of the Interservice/Industry Training, Simulation and Education Conference, Orlando, FL, 2010. Available at http://ntsa.metapress.com/link .asp?id=c223k1n77031365r; accessed May 7, 2013. Black PJ, Wiliam D: Developing the theory of formative assessment. Educ Assess, Eval Accountability 2009; 21(1): 5–31. Taras M: Assessment—summative and formative—some theoretical reflections. Br J Educ Stud 2005; 53(4): 466–78. Linn RL, Gronlund NE: Measurement and Assessment in Teaching. Ed 8. Upper Saddle River, NJ, Prentice Hall, 2000. Bliese PD, Ployhart RE: Growth modeling using random coefficient models: model building, testing, and illustrations. Organ Res Methods 2002; 5(4): 362–87.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

MILITARY MEDICINE, 178, 10:55, 2013

Application of National Testing Standards to Simulation-Based Assessments of Clinical Palpation Skills Carla M. Pugh, MD, PhD ABSTRACT With the advent of simulation technology, several types of data acquisition methods have been used to capture hands-on clinical performance. Motion sensors, pressure sensors, and tool-tip interaction software are a few of the broad categories of approaches that have been used in simulation-based assessments. The purpose of this article is to present a focused review of 3 sensor-enabled simulations that are currently being used for patient-centered assessments of clinical palpation skills. The first part of this article provides a review of technology components, capabilities, and metrics. The second part provides a detailed discussion regarding validity evidence and implications using the Standards for Educational and Psychological Testing as an organizational and evaluative framework. Special considerations are given to content domain and creation of clinical scenarios from a developer’s perspective. The broader relationship of this work to the science of touch is also considered.

INTRODUCTION With the advent of simulation technology, several types of data acquisition technologies have been used to capture hands-on performance.1–5 Motion sensors, pressure sensors, and tool-tip interaction software are a few of the broad categories of technologies that have been used in simulationbased assessments.6 –10 One major distinction between some of the technologies used to capture hands-on performance is the location of the sensors, hence the type of data that is collected. Some of the simulation systems have the sensors on a tool, surgical instrument, or data glove. In this instance, the information captured helps to quantify hand or instrument positions during task execution. For example, when using a data glove, the associated metrics provide quantitative measures of various hand and individual finger positions throughout the task.11–14 Similarly, instrumented surgical tools allow information capture regarding the position of the surgical instrument during a procedure.1–3 In both instances, the data captured is clinician centered and focuses on the motor movements and positioning of the clinician. In this article, we will focus on sensor-enabled training tools in which the sensors are not on the clinician’s hand or instrument but on the patient. In this instance, the information captured using this type of technology helps to quantify human interaction with specific anatomical structures. For example, when using a simulated patient with sensor-enabled organs, the associated metrics provide quantitative measures regarding patient contact including anatomical location and quality of clinical palpation during a physical examination.15–17 Moreover, using a simulated patient with sensorenabled organs during a surgical procedure will allow data collection regarding instrument interaction with the patient’s Department of Surgery, University of Wisconsin, 600 Highland Avenue— CSC 785B, Madison, WI 53792. The findings and opinions expressed here do not necessarily reflect the positions or policies of the Office of Naval Research. doi: 10.7205/MILMED-D-13-00215

MILITARY MEDICINE, Vol. 178, October Supplement 2013

anatomical structures.18,19 The resulting data in both scenarios are patient centered and provide detailed information regarding direct (hands) or indirect (instruments) patient contact. As most procedures and physical examinations require hands-on contact with specific anatomical locations, patientcentered devices allow for assessment of a variety of errors including incomplete examinations, missed lesions, or use of excessive force. The purpose of this article is to present a focused review of three sensor-enabled simulations that are currently being developed for patient-centered assessments of clinical palpation skills. The Standards for Educational and Psychological Testing will be used as a framework to structure our review.20 METHODS Testing Standards The Standards for Educational and Psychological Testing is a set of testing standards developed jointly by the American Educational Research Association, American Psychological Association, and the National Council on Measurement in Education. The standards address professional and technical issues of test development and use in education, psychology, and employment. The intent is to promote the sound and ethical use of tests and to provide a basis for evaluating the quality of testing practices. While evaluation of the appropriateness of a test or testing application should depend heavily on professional judgment, the standards provide a frame of reference to assure that relevant issues are addressed. The standards apply equally to standardized multiple-choice tests and performance assessments. For the purpose of this article, our sensor-enabled, simulation-based assessments are considered to be a type of performance-based assessment: evaluation of performance during tasks that are valued in their own right. Sensor-Enabled Simulation Technology The focus of this article is on performance assessments using sensor-enabled mannequin technologies. Development and 55

Simulation-Based Assessments of Clinical Palpation Skills

implementation of this technology began in 1998.21 The initial clinical focus was physical examination of the female pelvis, female breast, and male digital rectal examination. All three systems include a partial mannequin; umbilicus to mid-thigh for the pelvic and digital rectal examinations; and left or right chest wall for the breast models. In addition, the mannequins have interchangeable parts that enable simulation of different clinical presentations, both normal and abnormal scenarios. For example, the pelvic examination inserts range from normal-anteverted and normalretroverted uterine positions as well as enlargements of the uterus or ovary. Paper-thin 2 mm force sensing resisters (sensors) are connected to important anatomical structures on the mannequin and imbedded organs.22 Data acquisition systems enable the sensors to be sampled at a specified sampling rate. For these clinical examinations, the sampling rate is set at 30 Hz. Figure 1 shows the pelvic exam simulator including computer interface and mannequin. The computer interface shows that the user is touching the cervical os at six pressure units (1 PU = 0.125 psi). Consequently, the corresponding register bar rises to a level of six, the indicator button in the cartoon diagram turns blue, and a check mark appears in the “Exam Checklist” window. During a simulated pelvic exam, this interface enables students and instructors to see where the examiner is touching and how much pressure is being used. The prostate and breast models have similar computer interfaces. Figures 2A and 2B show line graph representations of a pelvic examination plotted as pressure over time. Each line represents a different anatomical area. When reviewing performance using the line graphs, the examiner’s touch can be tracked by anatomical location and several palpation characteristics including the level and type of pressure. In Figure 2A, the examining medical student applies several bursts of pressure to the fundus of the uterus. This is represented by a series of narrow spikes (see black arrow) in the 8–10 PU range. Simultaneously, there is constant pressure on the left posterior cervix (L-post) at a level of 6 PUs. These data quantify bimanual examination of the uterine fundus: one hand explores the fundus while the other hand applies counterpressure to the cervix to lift the uterus toward the abdominal wall and facilitate palpation. The combination of pressures used (spikes plus constant pressure) provide detail on the examiner’s approach including anatomical locations explored (i.e., sensor locations) and palpation characteristics used (i.e., pressure levels and spikes plus constant pressure). To emphasize a range in performance, the medical student in Figure 2B applies a combination of pressures to the cervix in the 6–8 PU range. The first spike (600 time units) represents pressure applied to the cervical os. The second waveform (750–1166 time units) combines a series of spikes and constant pressure on the left posterior cervix. Starting around 1200 time units, the examination continues with several manipulations of the cervix in the 5–8 PU range and ends with two low pressure (4 PU) spikes 56

on the uterine sensors. Compared to the student in Figure 2A, the student in Figure 2B spends a lot of time applying pressure to several areas on the cervix; takes twice as long to accomplish bimanual examination of the uterine fundus; and uses much lower pressure and a smaller number of peaks when palpating the uterine fundus. While the second student was eventually able to accomplish the goal of bimanual examination, there are noticeable differences in performance when comparing the two students. These graphs show that specific physical examination events can be captured and quantified. Use of this data for performancerelated decisions will be discussed using the standards as a guiding framework. Metrics While the line graph representations of the data provide a high-level, qualitative view of performance differences, to quantify individual differences, the data must be converted to measureable variables. The most commonly used performance assessment variables extracted from the sensor data include (1) examination time, (2) number of sensors touched, (3) maximum pressure, and (4) frequency.21–27 The operational definition of these variables is as follows: Time Variable

The time variable is equivalent to the length of time necessary for an examiner to perform a complete examination. Mathematically, exam completion time was defined as the time at which the last sensor was touched minus the time at which the first sensor was touched. The exam was considered to have begun when the pressure on any given sensor reached 1 full pressure unit above baseline.21–27 Critical Areas Variable

The critical areas variable represents the number of sensors touched during the simulated clinical examinations. For the pelvic exam there were seven sensors; four on the cervix, one on the uterus, and one on each ovary. For the digital rectal exam there were also seven sensors on the prostate, three on the right lateral lobe of the prostate, three on the left lateral lobe of the prostate, and one in the median raphe. For the breast exam there were eleven sensors; approximately twothree sensors in each of the four quadrants of the breast.21–27 Maximum Pressure Variable

The maximum pressure variable represents the highest pressure reading recorded for a sensor during the simulated examination. For example, if the highest pressure readings were recorded for sensor no. 1 as seven PU and sensor no. 2 as ten PU for mathematical purposes, average maximum pressure would be the average of those two sensor variables; however, maximum pressure would be the individual highest pressure for each sensor.21–27 MILITARY MEDICINE, Vol. 178, October Supplement 2013

Simulation-Based Assessments of Clinical Palpation Skills

FIGURE 1.

The pelvic examination simulator.

Frequency Variable

The frequency variable represents the number of times an individual sensor was touched near the maximum pressure during the examination. The mathematical formulation for creating this variable involved counting the number of times MILITARY MEDICINE, Vol. 178, October Supplement 2013

the given sensor was sampled within 0.5 PU of the maximum for that sensor.21–27 In addition to extraction of these variables using MATLAB code (MathWorks, Natick, MA), specific data mining techniques have been applied to the sensor data to gain 57

Simulation-Based Assessments of Clinical Palpation Skills

FIGURE 2. Line graph representation of the sensor-generated performance data collected during a pelvic examination. (A) The arrow points to several, high-pressure palpations of the uterus. (B) The arrows point to high pressure-palpation of the right and left cervix.

an understanding of quality and usefulness, including Markov models19,23–25 and visual analytics.26 Application of National Testing Standards Use of the sensor-generated data for performance-related decisions requires a standardized approach to ensure validity, reliability, fairness, and appropriate use.20 The Standards provide a three part guide for evaluating a wide variety of assessments. The following sections present a detailed review of our past work. The Standards are used as a guide to structure our discussion. All of the studies reviewed in this article were performed after approval by the local Institution Review Board. Standards Part I—Test Construction, Evaluation, and Documentation

When designing the simulation-based assessments, test construction required a review of all elements that may be used to gather data for evaluation purposes. The two main elements included (1) the written clinical assessment form and (2) the computer-generated sensor data. Correct answers for the written clinical assessment vary according to the clinical scenario (normal or pathologic variation) represented by the simulation. Correct answers for the computer-generated variables are also closely linked to the clinical presentation. While there is ongoing work to determine the best construc58

tion and administration for our simulation-based assessments, for the purpose of this article, each clinical scenario represents a specific content domain, hence a major section within a test. Most administrations using the clinical simulations involve at least two different clinical scenarios. Opportunities for partial credit exist for both the written and the sensor data components. The following sections provide an overview of the various studies we have performed to guide the process of test construction and administration. Validity, reliability, and measurement error are important to consider during test construction. Validity was initially assessed by evaluating a basic content construct—does this technology capture any performance measures of interest? To answer this question we used a sensor-enabled pelvic examination simulator with second-year medical students (N = 73). In addition to collecting sensor-generated performance data, we assessed diagnostic accuracy rates using participant’s written clinical assessments of two, clinically different, pelvic models.27 Using a 2-tailed Pearson’s correlation, we found that three of the four sensor variables were significantly associated with participants’ ability to generate an accurate clinical assessment of the simulator after performing an examination. The highest correlation was for the “number of critical areas touched” during an examination (r = 0.311, p = 0.007). The second highest correlation was for the “mean maximum pressure” used during the examination (r = 0.279, p = 0.017). Finally, the last variable with a significant correlation was MILITARY MEDICINE, Vol. 178, October Supplement 2013

Simulation-Based Assessments of Clinical Palpation Skills

“mean frequency” used during the examination (r = 0.267, p = 0.022). While the correlation values for these three variables are moderate to low, they indicate that there is a relationship between the simulator variables; accuracy on the written assessments; and the use of direct, hands-on, contact during an examination. In essence, these three variables capture aspects of the clinical pelvic examination that are important in achieving an accurate diagnosis. Reliability of the three variables in this setting was as follows: (1) time, r = 0.72; (2) critical areas, r = 0.63; (3) mean maximum pressure, r = 0.77; and (4) mean frequency, r = 0.50. While the time variable did not have a significant correlation with diagnostic accuracy in this setting, it appeared to have moderate to high reliability and we continued to evaluate this variable in other settings (i.e., different participants and clinical scenarios). Validity was also assessed by evaluating the ability to use the simulator data to differentiate between experience levels. When assessing the experience level using the pelvic simulators, we compared medical student (N = 43) performance with that of experienced clinicians (N = 20).28 For the written assessment, mean examination scores showed medical students were less accurate than clinicians (students = 10.18/18, clinicians = 15.60/18.0, p < 0.001). There were also differences noted in examination techniques. Students were noted to spend more time on the exam (students = 82.01 seconds, clinicians = 31.07 seconds, p < 0.001) and used greater palpation frequencies when examining each area (students = 42.45 Hz, clinicians = 20.30 Hz, p < 0.005).28 When evaluating experience level using the digital rectal examination simulator we conducted a study involving surgical residents (N = 24) and medical students (N = 30).16 Participants were grouped according to the number of prior digital rectal examinations performed. Group 1 (N = 27) was the less experienced group having performed five or fewer previous examinations. Group 2 (N = 27) had performed six or more rectal examinations. Each participant examined two different simulators: Simulator A (easy diagnosis—normal rectum and a firm 2 mm prostate nodule) and Simulator B (difficult diagnosis—enlarged prostate + a subtle 3 cm rectal mass). When comparing technical performance on Simulator A, the more experienced group (G2) was noted to spend more time on the examination (G2 = 12.34 seconds, G1 = 7.22 seconds, p < 0.01). For this simulator (Simulator A—easy diagnosis), there were no significant differences in accuracy. The less experienced group had an accuracy rate of 80% and the more experienced group had an accuracy rate of 85%. In contrast, for the more difficult simulator (Simulator B) there were significant differences in technical performance and accuracy. When comparing performance on Simulator B, the more experienced group G2 spent more time on the exam (G2 = 17.52 seconds, G1 = 11.94 seconds, p < 0.05) and were more accurate in their assessment and documentation of the prostate findings (G2 = 64% accurate, G1 = 33% accurate, p < 0.05).16 MILITARY MEDICINE, Vol. 178, October Supplement 2013

Validity was also assessed by evaluating the ability to use the simulator data to differentiate between clinical specialties and gender. When assessing clinical specialty using the breast examination simulators, we compared the performance of four clinical groups: (1) surgeons (N = 37), (2) nonsurgical MD’s (N = 36), (3) nurses (N = 12), and (4) medical assistants (N = 15) on three simulators: (1) Simulator A—dense breast, 2 cm hard mass; (2) Simulator B—fatty breast, no masses; and (3) Simulator C—dense breast with right upper quadrant thickening.15 When assessing overall approach to the breast examination, nurses were noted, on average, to palpate more anatomical areas during the examination (nurses = 9.4/11 areas, surgeons = 7.38/11 areas, nonsurgeons = 6.22/11 areas, and medical assistants = 6.71/11 areas, p < 0.01). In addition, there was a trend for the nurses to spend more time on the breast examination. However, this finding was only significant for one of the three models—Simulator A (nurses = 58.57 seconds, surgeons = 40.94 seconds, nonsurgeons = 32.33 seconds, and medical assistants = 39.65 seconds, p < 0.05). Despite these observed differences in technical approach, and differences in mean accuracy rates for each clinical scenario (A = 87%, B = 76%, and C = 68%), there were no significant differences in accuracy when comparing the four specialties on the three breast models. Although there were differences in hands-on contact and approach among the specialties in this study, there was no difference in accuracy. As such, the three clinical scenarios and simulator variables are not a valid tool for assessing important differences in surgical, nonsurgical, and nursing specialties. When assessing gender, we compared male (N = 38) and female (N = 57) performance on three breast simulators: (1) Simulator A—dense breast, 2 cm hard mass; (2) Simulator B—fatty breast, no masses; (3) Simulator C—dense breast with right upper quadrant thickening.15 The results showed that females, on average, spent more time (M = 42.09 seconds, F = 56.66 seconds, p < 0.05), touched more anatomical areas (M = 6.30/11 areas, F = 7.97/11 areas, p < 0.05), and used greater pressures (M = 4.82 mmHg, F = 5.21 mmHg, p < 0.05) when compared to male clinicians. While there was a trend toward females being less accurate (M = 83.7% correct, F = 72.4% correct) the difference was not statistically significant. Although there were differences in hands-on contact and approach, when comparing males and females, there was no difference in accuracy. As such, Simulators A–C and the computer-generated simulator variables are not a valid tool for assessing gender-related accuracy differences. Moreover, it is possible that there are no gender-related differences. In summary, the simulator variables appear to capture data that correlates with hands-on performance during simulated clinical examinations. In addition, the variables show promising results in discriminating experience levels.16,28 When evaluating specialty and gender, there appears to be differences in technical approach but not overall accuracy. As such, the variables may not be useful in discriminating between specialties and gender. Moreover, there may not 59

Simulation-Based Assessments of Clinical Palpation Skills

be any important differences between specialty and gender except approach. Standards Part II and III

Standards Part II and III deal with fairness and testing applications. Fairness issues include lack of bias, equitable treatment in the testing process, and equal opportunities to learn. As the simulations are physical models and the sensors quantify hands-on touch, biases in language and linguistics are limited. Test taker rights such as access to test results and rights when reviewing testing irregularities are issues that will need to be addressed before formal use of the simulations for performance related decisions. Issues relating to testing applications largely address general responsibilities of those who administer, interpret, and use test results. When using previously validated tests in different venues, test users must ensure that there is continued test validity and reliability in the new setting (Standard 11.19). Use of tests for psychological evaluation, educational assessment, employmentrelated decisions, and program evaluation are other areas that must be considered as part of test applications. These areas will be considered as part of our future work in evaluating the simulation-based assessments. SPECIAL CONSIDERATIONS Content Domain and Creation of Clinical Scenarios There are inherent difficulties in simulating human body parts. These difficulties present several challenges to the use of simulation as an assessment tool. From a manufacturing standpoint, the developer’s goal is to find the right combination of materials and molds that, once fabricated, are the most realistic representation of human tissue possible. In essence, the goal is the best match for context and functionality.29,30 When building a breast model for example, it may be possible to perfect the mold and achieve an extremely realistic look; the shape, the color, the skin detail may all be perfect. However, after achieving this perfection, the materials may not feel or behave like real breast tissue. Likewise, a breast model may feel realistic to touch but be found deficient in achieving a realistic look.31 An additional challenge in using simulation as an assessment tool relates to human perception. For example, when manufacturing a breast model that represents a patient with an obvious breast mass, the expectation would be that most clinicians examining the model would detect the mass on palpation. However, diagnostic perception may be affected by the manufacturer’s materials as well as the clinician’s clinical skills.32 From a validity perspective, it is difficult to determine whether lower than expected accuracy rates represent a fabrication problem or reasonable variations in perception. The problem is further confounded when a developer desires to fabricate a clinical scenario that is less obvious. While the overall goal, from an assessment perspective, is to provide a range of easy and difficult test questions 60

(simulations), lower accuracy rates on the more difficult clinical scenarios must be evaluated from a validity perspective in a similar fashion and with the same rigor as multiplechoice questions.20 Defining this process for simulation is imperative to the success and applicability of simulationbased assessments. The Science of Touch In health care, despite the many technological advances, human touch remains important for many diagnostic and therapeutic interventions. Unfortunately, without performance measures, objective and formative feedback to health care trainees and practitioners is nearly impossible. As such, there are no standards and health care professionals continue to graduate and become credentialed to practice medicine without any real measure of their ability to perform hands-on procedures or make sound clinical judgments using palpation. Part of the problem lies in the complexity of touch (palpation). Despite its importance, the human sense of touch is poorly understood and understudied. Touch is extremely difficult to convey verbally and there are no objective means of explaining one’s own experience or perception based on the sense of touch.32,33 Over 20 years of extensive research on the sense of touch reveals a set of specific hand maneuvers that humans use to detect object characteristics.32,34,35 The hand maneuvers, called exploratory procedures, are stereotyped movement patterns consisting of certain characteristics, which are largely subconscious and reproducible in a variety of settings. Key findings from this work show (1) human beings are very good at recognizing common objects on the basis of touch alone; (2) object recognition is strongly based on specific object characteristics including texture, hardness, shape, temperature, and weight; (3) during object recognition, specific hand maneuvers are used to detect object characteristics. Our work using sensor-enabled, patient-centered simulations to assess palpation skills in clinical medicine has shown promising results regarding the relationship between sensor outputs and specific exploratory maneuvers used during palpation. When using force sensing resistors (FSRs) on anatomical models that simulate common medical examinations (breast, pelvic, and digital rectal examinations), we found that specific palpation maneuvers were detectable in our data. Figures 3A–3F show laboratory-generated waveforms for specific palpation maneuvers.36 Laboratory participants were asked to perform the following maneuvers for a minimum of 4 seconds on a sensored plate: (1) balloting (multiple, vertical bursts of firm pressure), (2) circular motion (rubbing counterclockwise), (3) constant pressure, and (4) rubbing (firm back and forth pressure across a vertical line). While the resulting waveforms show similar patterns for rubbing and circular maneuvers, balloting in multiple areas is distinguishable from circular motion in multiple areas and constant pressure. Linking the waveforms to specific exploratory maneuvers will facilitate our understanding of palpation characteristics MILITARY MEDICINE, Vol. 178, October Supplement 2013

Simulation-Based Assessments of Clinical Palpation Skills

FIGURE 3. Waveforms extracted from the sensor data during specific palpation maneuvers: (A, B) Balloting; (C, D) Circular pressure movements; (E) Constant pressure; and (F) Rubbing.

during physical examination. For example, as shown in Figure 2A, we now understand that the examinee was using the balloting maneuver when examining the fundus. Our future work will continue to explore the sensor data using MILITARY MEDICINE, Vol. 178, October Supplement 2013

Klatzky and Lederman’s classification of palpation maneuvers. We believe this may help to generate a better understanding of how the sensor-generated data can be used in performance assessments. 61

Simulation-Based Assessments of Clinical Palpation Skills

SUMMARY The purpose of this article was to present a focused review of sensor-based assessments of palpation skills using patientcentered simulations. The technology components, capabilities, and metrics were reviewed. In our review of validity, we found that there were significant correlations between the sensor-generated performance variables and accuracy when reporting and documenting clinical findings. While the correlation values were moderate to low (r = 0.267–0.311), they warrant additional investigation of the relationship between hands-on performance and rate of accuracy during a clinical assessment. In our review of validity evidence, we focused on three constructs: (1) experience level, (2) clinical specialty, and (3) gender. The results for experience level revealed promising results in discriminating between groups based on technical approach and accuracy.16,28 The results for specialty and gender showed differences in technical approach but not overall accuracy. As such, the current variables and clinical scenarios are not valid discriminatory variables for these constructs. The use of palpation remains important for many diagnostic and therapeutic interventions in the health care environment. Development of performance metrics and assessments to ensure minimum performance standards is an important endeavor that should be closely guided by national standards. ACKNOWLEDGMENTS The following researchers have made significant contributions to this body of work: Jacob Rosen, Lawrence Salud, Jonathan Salud, Alec Peniche, Abby Kaye, and Brandon Andrew. This body of work has been funded by the following foundations and agencies: National Board of Medical Examiners (NBME) Stemmler Fund; Media X Grant, Stanford University; Augusta Webster Educational Innovation Grant, Northwestern University; Eleanor Wood-Prince Grants Initiative, Northwestern Memorial Hospital; National Cancer Institute-Supplement Grant-3U01CA116875-03S1; The Baum Family Fund; National Institutes of Health R01EB011524. The work reported herein was also partially supported by a grant from the Office of Naval Research, Award Number N00014-10-1-0978.

REFERENCES 1. Cano AM, Gaya´ F, Lamata P, Sa´nchez-Gonza´lez P, Gomez EJ: Laparoscopic tool tracking method for augmented reality surgical applications. Proc Biomedical Simulation 2008; 5104: 191–6. 2. Dosis A, Aggarwal R, Bello F, et al: Synchronized video and motion analysis for the assessment of procedures in the operating theater. Arch Surg 2005; 140: 293–9. 3. Chmarra MK, Bakker NH, Grimbergen CA, Dankelman J: TrEndo, a device for tracking minimally invasive surgical instruments in training setups. Sens Actuators A Phys 2006; 126: 328–34. 4. Rosen J, Brown JD, Barreca M, Chang L, Hannaford B, Sinanan M: The Blue DRAGON–a system for monitoring the kinematics and the dynamics of endoscopic tools in minimally invasive surgery for objective laparoscopic skill assessment. Stud Health Technol Inform 2002; 85: 412–8. 5. Pagador JB, Sa´nchez LF, Sa´nchez JA, Bustos P, Moreno J, Sa´nchezMargallo FM: Augmented reality haptic (ARH): an approach of electromagnetic tracking in minimally invasive surgery. Int J Comput Assist Radiol Surg 2011; 6: 257–63.

62

6. Bann SD, Khan MS, Darzi AW: Measurement of surgical dexterity using motion analysis of simple bench tasks. World J Surg 2003; 27: 390–4. 7. Datta V, Mackay S, Mandalia M, Darzi A: The use of electromagnetic motion tracking analysis to objectively measure open surgical skill in the laboratory-based model. J Am Coll Surg 2001; 193: 479–85. 8. Murphy TE, Vignes CM, Yuh DD, Okamura AM: Automatic motion recognition and skill evaluation for dynamic tasks. Proc EuroHaptics 2003; 363–73. 9. Oropesa I, Sa´nchez-Gonza´lez P, Cano AM, Lamata P, Sa´nchezMargallo FM, Go´mez EJ: Objective evaluation methodology for surgical motor skills assessment. Minim Invasive Ther Allied Technol 2010; 10: 55–6. 10. Leong JJH, Nicolaou M, Atallah L, Mylonas GP, Darzi AW, Yang GZ: HMM assessment of quality of movement trajectory in laparoscopic surgery. Comput Aided Surg 2007; 12: 335–46. 11. Dipietro L, Sabatini AM, Dario P: Evaluation of an instrumented glove for hand-movement acquisition. J Rehabil Res Dev 2003; 40(2): 179–89. 12. Cook JR, Baker NA, Cham R, Hale E, Redfern MS: Measurements of wrist and finger postures: a comparison of goniometric and motion capture techniques. J Appl Biomech 2007; 23(1): 70–8. 13. Gentner R, Classen J: Development and evaluation of a low-cost sensor glove for assessment of human finger movements in neurophysiological settings. J Neurosci Methods 2009; 178(1): 138–47. 14. Gu¨lke J, Wachter NJ, Geyer T, Scho¨ll H, Apic G, Mentzel M: Motion coordination patterns during cylinder grip analyzed with a sensor glove. J Hand Surg Am 2010; 35(5): 797–806. 15. Pugh CM, Domont ZB, Salud LH, Blossfield KM: A simulation-based assessment of clinical breast examination technique: do patient and clinician factors affect clinician approach? Am J Surg 2008; 195(6): 874–80. 16. Balkissoon R, Blossfield-Iannitelli K, Salud L, Ford D, Pugh C: Lost in translation: unfolding medical students’ misconceptions of how to perform the clinical digital rectal examination. Am J Surg 2009; 197(4): 525–32. 17. Pugh CM, Rosen J: Qualitative and quantitative analysis of pressure sensor data acquired by the E-Pelvis simulator during simulated pelvic examinations. Stud Health Technol Inform 2002; 85: 376–9. 18. Rosen J, Hannaford B, Richards CG, Sinanan MN: Markov modeling of minimally invasive surgery based on tool/tissue interaction and force/ torque signatures for evaluating surgical skills. IEEE Trans Biomed Eng 2001; 48(5): 579–91. 19. Rosen J, Solazzo M, Hannaford B, Sinanan M: Objective laparoscopic skills assessments of surgical residents using hidden Markov models based on haptic information and tool/tissue interactions. Stud Health Technol Inform 2001; 81: 417–23. 20. American Educational Research Association (AERA), American Psychological Association (APA), and National Council for Measurement in Education (NCME): Standards for Educational and Psychological Testing. Washington, DC, American Educational Research Association, 1999. 21. Pugh CM: Evaluating Simulators for Medical Training: The Case of the Pelvic Exam Model. Ann Arbor, MI, ProQuest/Bell and Howell Dissertations Publishing, 2001. Available at http://disexpress.umi.com/ dxweb; accessed September 26, 2013. 22. Medical Examination Teaching System. U.S. Patent Number 6,428,323, 2002. Available at http://www.google.com/patents/US6428323; accessed May 7, 2013. 23. Mackel T, Rosen J, Pugh C: Application of hidden Markov modeling to objective medical skill evaluation. Stud Health Technol Inform 2007; 125: 316–8. 24. Mackel T, Rosen J, Pugh CM: Markov model assessment of subjects’ clinical skill using the E-Pelvis physical simulator. IEEE Trans Biomed Eng 2007; 54(12): 2133–41. 25. Mackel T, Rosen J, Pugh C: Data mining of the E-pelvis simulator database: a quest for a generalized algorithm for objectively assessing medical skill. Stud Health Technol Inform 2006; 119: 355–60.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

Simulation-Based Assessments of Clinical Palpation Skills 26. Silverstein J, Selkov G, Salud L, Pugh C: Developing performance criteria for the e-Pelvis simulator using visual analysis. Stud Health Technol Inform 2007; 125: 436–8. 27. Pugh CM, Youngblood P: Development and validation of assessment measures for a newly developed physical examination simulator. J Am Med Inform Assoc 2002; 9(5): 448–60. 28. Pugh CM, Heinrichs WL, Dev P, Srivastava SS, Krummel T: Objective assessment of clinical skills with a simulator. JAMA 2001; 286(9): 1021–3. 29. Verschuren P, Hartog R: Evaluation in design-oriented research. Qual Quant 2005; 39(6): 733–62. 30. Kirschner P, Carr C, van Merrie¨nboer J, Sloep P: How expert designers design. Perform Improv Quart 2002; 15(4): 86–104.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

31. Salud LH, Ononye CI, Kwan C, Salud JC, Pugh CM: Clinical examination simulation: getting to real. Stud Health Technol Inform 2012; 173: 424–9. 32. Klatzky R, Lederman SJ: Tactile object perception and the perceptual stream. In: Albertazzi L, ed. Unfolding Perceptual Continua. Netherlands, John Benjamin Publishing Company, 2002. p. 147–62. 33. Minogue J, Jones MG: Haptics in education: exploring an untapped sensory modality. Rev Educ Res 2006; 76(3): 317–48. 34. Lederman SJ, Klatzky RL: Extracting object properties through haptic exploration. Acta Psychol (Amst) 1993; 84(1): 29–40. 35. Lederman SJ, Klatzky RL: Hand movements: a window into haptic object recognition. Cogn Psychol 1987; 19(3): 342–68. 36. Salud LH, Pugh CM: Use of sensor technology to explore the science of touch. Stud Health Technol Inform 2011; 163: 542–8.

63

MILITARY MEDICINE, 178, 10:64, 2013

Evaluation of Medical Simulations William L. Bewley, PhD*; Harold F. O’Neil, PhD† ABSTRACT Simulations hold great promise for medical education, but not all simulations are effective, and reviews of simulation-based medical education research indicate that most evaluations of the effectiveness of medical simulations have not been of sufficient technical quality to produce trustworthy results. This article discusses issues associated with the technical quality of evaluations and methods for achieving it in evaluations of the effectiveness of medical simulations. It begins with a discussion of the criteria for technical quality, and then discusses measures available for evaluating medical simulation, approaches to scoring simulation performance, and methodological approaches. It concludes with a summary and discussion of future directions in methods and technology for evaluating medical simulations.

INTRODUCTION Since the first written clinical simulations were used for assessment nearly 50 years ago, simulations have become common in medical education.1 Defined broadly as a “person, device, or set of conditions which attempts to present evaluation problems authentically,”2 medical simulations emulate patients, anatomical areas, or clinical tasks. They include standardized patients,3–8 part-task trainers (e.g., pelvic replicas),9–14 virtual reality systems,15 computer simulations16,17 and games,18 mannequins,19–22 and even multiple-choice questions presenting information on a case to be evaluated.23 Simulations can be used for instruction or assessment, and are currently used by many medical schools for end-of-course comprehensive examinations,24 by the Medical Council of Canada as part of the licensure process25 and as part of the United States Medical Licensing Examination, among many others.26,27 Simulation-based training has become popular because it is usually less costly, and it provides experiences without risk to patients.28 In addition to the benefits of cost and risk avoidance, there are also benefits to learning.29 Training can be directed at specific knowledge and skills, especially procedures and higher level cognitive processes, and some simulations can unobtrusively collect detailed data providing assessment information that can be used to automatically score performance and diagnose learning problems.30 Simulations can also be used to provide experiences not possible in the real environment, such as repeated practice on parts of a task that cannot be isolated in the real world (e.g., intubation, venipuncture, tying surgical knots, or incision and drainage of abscesses). This is not to say that simulation-based training can replace training with real patients supervised by a knowledgeable instructor—nobody would want a surgeon *National Center for Research on Evaluation, Standards, and Student Testing (CRESST), University of California, Los Angeles, 10945 Le Conte Avenue, Suite 1400, Mailbox 957150, Los Angeles, CA 90095-7150. †Rossier School of Education/National Center for Research on Evaluation, Standards, and Student Testing (CRESST), University of Southern California, 15366 Longbow Drive, Sherman Oaks, CA 91403. The findings and opinions expressed here do not necessarily reflect the positions or policies of the Office of Naval Research. doi: 10.7205/MILMED-D-13-00255

64

trained only on simulations—but a useful level of knowledge and skill can be developed cost-effectively and safely with simulation-based training in preparation for training in the real environment. Medical simulations have great promise, but not all simulations are effective, and, unfortunately, reviews of simulation-based medical education research indicate that most evaluations of the effectiveness of medical simulations have not been of sufficient technical quality to produce trustworthy results.31–34 This article discusses issues associated with technical quality and methods for achieving it in evaluations of the effectiveness of medical simulations. Note that the focus is on effectiveness, not cost. The article in this supplement by Fletcher and Wind35 describes approaches to economic analyses that, with data on effectiveness using methods discussed in this article, can be used to determine cost-effectiveness or cost-benefit. The article begins with a discussion of the criteria for technical quality, the measures available for evaluating medical simulations, approaches to scoring simulation performance, methodological approaches, and then describes an evaluation model. It concludes with a summary and discussion of future directions in methods and technology for evaluating medical simulations. TECHNICAL QUALITY OF EVALUATIONS Evaluations must satisfy two major criteria for technical quality: reliability and validity. This section discusses each. There are also two lesser but, nevertheless, important criteria that warrant mentioning in brief: fairness and usability. Fairness is an aspect of validity, and its absence is discussed later as a “threat to validity.” Fairness means that inferences based on the results of the evaluation are appropriate for most people, of most backgrounds. In the measurement literature,36 fairness is defined in terms of four properties: – The test is free of bias. – There is equal opportunity to show proficiency. – In tests of knowledge and skill, there is equal opportunity to learn. – Score distributions are as equal as possible across different groups. MILITARY MEDICINE, Vol. 178, October Supplement 2013

Evaluation of Medical Simulations

Of the four properties, bias has received the most attention in the measurement literature. Bias is defined as any construct-irrelevant source of variance that systematically affects the performance of different groups of examinees, e.g., groups defined by gender, ethnic or cultural background, socioeconomic status, or age.37 Usability refers to practical considerations in conducting the evaluation, such as the cost of implementation as well as time requirements, ease of administration, and the comprehensibility of results to the intended audience. Usability is important, but not as important as reliability and validity. Reliability Reliability concerns the consistency of measurement, e.g., internal consistency or test/retest. It requires that results are consistent from one measurement to another, e.g., at different times, with different raters, or even with different (but considered equivalent) tasks. It requires that the evaluation methodology give the same result each time it is used. This is achieved through the use of well-defined and standardized procedures and measurement instruments. Perfect consistency is not possible because people are not perfectly consistent. Simulation users may have learned or forgotten things, or may be under more or less stress on different days. Raters may not agree on interpretations of all judgment criteria, and a rater’s criteria may change over time. Tasks may be more or less difficult for different users, depending on prior experience. All these factors introduce measurement error into evaluation results. Methods for determining reliability are based on determining the measurement error. The greater the consistency of results, the smaller the measurement error, and thus the greater the reliability.36 These methods are based on traditional psychometrics or classical test theory,38 which is based on assumptions about how a test is constructed: linear, static, and hom*ogeneous, providing many samples of behavior, and focused on between-individual differences39—think standardized tests, such as the Scholastic Aptitude Test.40 Most simulations, however, have fewer of these characteristics. Simulations are nonlinear, i.e., with more than one pathway to success or failure. They are frequently short, dynamic, adaptive, and heterogeneous, and provide relatively few samples of behavior. Finally, these assessment simulations are often focused on within-individual differences, including changes in performance during use of the simulation, as well as interindividual differences. In addition, classical test theory is not well suited for handling the complex correlations often found in data produced by simulations, for providing the real-time scoring and feedback often required for simulation-based assessments, or for providing measures of changes in proficiency over time. In this supplement, Li Cai41 describes alternatives to classical test theory appropriate for the psychometrics of medical simulation. These alternatives are based on a new generation of latent variable models applying Bayesian inferential MILITARY MEDICINE, Vol. 178, October Supplement 2013

methods to make inferences about latent variables from observed variables. Simulations provide one long or a few short samples of behavior, rather than answers to many short questions (i.e., multiple choice), making the usual approaches to reliability inappropriate. As a result, approaches to reliability for simulations (and all performance assessments) have focused on the reliability of judges or raters scoring the performance rather than the “score” reliability of individuals.42 As noted earlier, the use of judges or raters introduces a source of error, along with characteristics of simulation users, the tasks, factors associated with testing occasion, e.g., time of day, and interactions of these sources. Generalizability theory is designed to allow identification of the sources of error and estimation of the contribution of each to a behavioral measurement.43–45 Sources of error are called facets of the measurement. To evaluate the reliability of a measurement, a generalizability study is conducted to estimate the contribution of each facet and the interaction of facets. A decision study is then conducted to determine elements of a measurement procedure that minimizes error. For example, we can use generalizability theory to determine how many judges we need to make reliable assessments of performance. If judges differ in their interpretation of criteria or the evaluation is complex, more judges are needed to obtain an accurate measurement. But if judges agree on criteria or the evaluation is simple, fewer judges will be required. In addition, because computer simulations are complex and take longer to complete, it may be the case that a small number of simulation trials can be administered in the time available for collection of data. This limits the generalizability of the results because, unlike selected response tests that provide equivalent forms, the problem of designing equivalent simulation scenarios (tasks) has not been solved. If time is available for only one assessment task, there is uncertainty as to whether performance on a different task thought to require the same knowledge and skills would provide the same results. Performance in one scenario will not necessarily be a good predictor of performance in another. Validity Validity is the degree to which evidence supports the interpretations and uses of results. Of the two major criteria for technical quality, reliability and validity, validity is the most important. The consistency measured by reliability makes it possible to have validity, but it is possible to have consistent results that are not valid.36 Validity is not a property of the evaluation; it is a property of the inferences made based on the results.36 Validation should be thought of as an argument presenting evidence to make a case, and not, as with reliability, the calculation of a statistic. A validity argument must be developed that marshals a wide range of evidence to make the case.36,37 This argument is very different from early conceptions of validity46 in which specific validity types are considered, e.g., face 65

Evaluation of Medical Simulations

validity (Does the test performance look like what is supposed to be measured?), content validity (Is the performance measured related to content goals or domains?), predictive validity (Do people with higher scores do better on a future criterion measure?), and criterion validity (Does performance on the new measure relate in predictable ways to an existing measure of known quality?). Although all these questions may be considered in making a validity argument, one no longer looks at a list of validity types and chooses 1 or 2 as most appropriate or, more likely, easiest to implement. According to Standards for Educational and Psychological Testing,36 there are five major sources of evidence that might be used to support a validity argument: evidence based on content, response processes, internal structure, relations to other variables, and consequences of testing. These are described below, along with two additional sources of evidence: threats to validity and sensitivity to instruction and experience. –Evidence based on content. This is the weakest form of evidence for a validity argument. It is concerned with the representativeness of the content on which the simulation is based, not with examinee performance or the interpretation of the meaning of the performance. –Evidence based on response processes. This has to do with the validity of interpreting examinee performance as evidence for the cognitive processes the examinees use when responding, e.g., some aspect of simulation performance is taken as evidence for situation assessment or problem solving skills. Evidence about response processes might be obtained by questioning the examinee about strategies used, or by using think-aloud protocols.36 –Evidence based on internal structure. Simulations are often designed to provide instruction and/or assessment on several knowledge or skill dimensions, such as situation awareness, planning, decision making, and communication. Evidence that these dimensions could be reliably distinguished based on examinee performance, by using the results of a confirmatory factor analysis,47 would support the validity argument. –Evidence based on relations to other variables. Correlations of examinee performance with other measures thought to be related also provide support for the validity argument.36 Such evidence includes predictive accuracy, in which scores are correlated with a criterion measure that simulation performance is intended to predict, e.g., diagnosis performance with a standardized patient3–8 and subsequent diagnosis with a real patient. Other examples are correlations with other measures designed to measure the same knowledge or skill, e.g., diagnosis performance with a standardized patient correlated with performance on a multiple-choice test presenting cases for diagnosis. Lack of correlation with measures designed to measure different knowledge or skill is another source of evidence. An example would be the relation of diagnosis performance with a standardized patient to intubation performance with a mannequin. 66

–Evidence based on consequences of testing. Use of a simulation has consequences for the examinee, especially when it is used for assessment. If results are due to knowledge or skills the simulation was designed to assess, this obviously supports the validity argument. If, however, results are due, at least in part, to knowledge or skills unrelated to what is to be assessed, such as a lack of computer skills interfering with performance on a computer simulation, validity should be questioned. This is an example of a “threat” to validity—an alternative explanation for good and poor performance. It is also an example of a lack of validity due to consequences of testing if it can be linked to an examinee characteristic that has nothing to do with the goal of the assessment, including membership in a particular socioeconomic group. –Threats to validity. A validity argument is weakened by “threats” to validity, alternative explanations for good and poor performance unrelated to the knowledge or skill that is to be assessed. There are many potential threats: poor reliability; misalignment of the simulation experience and the knowledge/skill objectives; misalignment of the measures and objectives of the simulation; inadequate instructions, user interface defects, or lack of computer skills for computer simulations; unfair use of administration, such as inadequate instructions or time; inappropriate scoring models, e.g., scoring that does not accommodate all acceptable strategies; poor examinee sampling; and poor scenario selection (content sampling). To support the validity argument, all threats to validity should be identified and eliminated. –Sensitivity to instruction and experience. A valid simulation should be sensitive to instruction and experience, eliciting higher scores for people who have received instruction or who have more experience or acknowledged expertise in the targeted knowledge or skill.

KIRKPATRICK MODEL The Kirkpatrick model48,49 is an evaluation framework that supports the idea of marshaling evidence to make a validity argument. It is also an approach for evaluation that has been successful in many different training and educational settings, and has become an industry standard in the training world. It has been adapted and modified over time, but the basic structure has not changed. As shown in Figure 1, the model describes four levels of evaluation. The levels are intended to represent a sequence of evaluation questions, each level providing information that affects the next level. An evaluation is conducted at each level, beginning at Level 1 and moving up. Each level provides evidence for a validity argument and information supporting interpretation of results at the next level. For example, if there is no evidence for student learning at Level 2, reactions at Level 1 may tell us why—students may not be motivated to learn from the MILITARY MEDICINE, Vol. 178, October Supplement 2013

Evaluation of Medical Simulations

task is performed, in addition to measures focused on the outcome of the process such as a rating of overall success, for example, measurement of the value of a physiological indicator like blood glucose level, albumin level, or blood pressure. As noted earlier, a key requirement for achieving validity is the use of appropriate measures aligned with the intended objectives of the simulation, usually related to knowledge and skill required to perform the simulated task. This seems obvious, but there are many examples of misalignment of measures with objectives. An extreme example is the evaluation that measures learning using reaction forms or opinion surveys asking students how much they learned.51,52 This provides information on how much students think they learned, not how much they actually learned. Figure 2 shows examples of measures for each Kirkpatrick level. Measures must tap the entire range of knowledge and skills at the same level of complexity addressed by the simulation, and they must be validated for the purposes and situations to which they are applied. Swick et al53 provide an excellent treatment of assessing the Accreditation Council

FIGURE 1.

The Kirkpatrick evaluation model.

simulation. Similarly, a failure at Level 3 (no behavior change back on the job) may be explained by an absence of learning at Level 2. Difficulty increases as you move up, but the value of information also increases at each level. Kirkpatrick recommends evaluating at all levels, but in practice, because the difficulty and cost increase at each level and because Level 3 and especially Level 4 may be difficult in the work environment, it may be tempting to stop at Level 2, or even Level 1, but Kirkpatrick emphasizes the impact of misalignment of measures to goals on validity. For example, if the objective is transfer of knowledge, skills, or attitudes to performance on the job, you need to go to Level 3 for a valid evaluation. And if the objective is organizational/patient benefit, a Level 4 evaluation is required.

SIMULATION PERFORMANCE MEASURES (PROCESS VS. OUTCOME) A measure is a number indicating the presence and amount of something, such as the number of errors, time, or ratings of some aspect of simulation performance on a five-point scale. McNulty et al50 provide an excellent overview of computerbased testing in the medical curriculum. We will focus on computer simulations. One of the great advantages of a simulation is the ability to measure knowledge and skills in performing procedures and higher level cognitive processes. This measurement is based on the examinee’s actions as the MILITARY MEDICINE, Vol. 178, October Supplement 2013

FIGURE 2.

Typical measures for Kirkpatrick evaluation model levels.

67

Evaluation of Medical Simulations

for Graduate Medical Education competencies in psychiatric programs. Bru¨nken et al54 provide indicators for measuring cognitive load, and Hays55 provides various rating scales for evaluating computer-based instruction. To evaluate simulations targeting procedural or higher level knowledge and skills, measures derived from simulation performance are desirable. There are two sources of measures: (1) human raters score performance using checklists based on scoring rubrics, and (2) automated scoring based on measures embedded in the simulation itself. For example, in tasks performed by manipulating objects on a computer screen, a mannequin, or an anatomic model, it may be possible to record the actions of the examinee in performing the task, including mouse clicks on a computer screen or actions on a physical device, with the associated location, time, and task context as appropriate.56

CHECKLISTS The easiest and most widely used approach to scoring (and the only feasible approach when automated scoring is not possible) is to use checklists consisting of explicit outcome and/or process criteria. Scoring rubrics are used to assign scores to each item, and the scores can be weighted to account for the importance of the item. Checklists are used with standardized patient-based tests (e.g., Swanson,57 van der Vleuten and Swanson58) with written and computer-based clinical simulations or computer-based case simulations, also called patient management problems,1 and with mannequins.59–61 The standardized patients may do the rating in standardized patient-based tests. People with clinical expertise serve as raters for the other simulation types and for some standardized patient-based tests. Ratings can be done live or by reviewing videotapes. In addition to being the only feasible approach when automated scoring using embedded measures is not possible, checklists have the benefit of being objective for recording clearly observable examinee actions such as questions and physical examination maneuvers. Rater training is required, and with training, raters can be very accurate.62 Inter-rater reliability, the degree of agreement among raters, should always be measured. Potential problems with checklists include the difficulty in developing rubrics that appropriately reward different strategies that are similar in quality and similar strategies that differ in quality.1 Also, it can be difficult to develop weights to accommodate more and less important actions, and if weights are large or negative, scoring can be complex, which can lead to inconsistencies that compromise reliability, and the examinee could get a high or low score based on a single action. Holistic scoring, focusing on the outcome or process as a whole rather than breaking it into separate parts (i.e., analytic scoring) has also been used. It has been criticized as subjective, but, with good rater training, has been shown to work.62,63 68

AUTOMATED SCORING There have been multiple frameworks for evaluation and use of automated scoring (see Williamson et al64 and Shermis and Burstein65). We organize the literature into three major approaches: expert-based methods, data-driven methods, and domain-modeling methods. Expert-Based Methods There are two expert-based methods: using expert performance and modeling expert judgment. In the first approach, actual expert performance is considered the gold standard against which student performance is compared,66,67 not what experts say should be competent performance or how experts rate student performance. This approach has been used to develop tasks for content understanding using essays67 and knowledge maps.68 A related approach is to model experts’ rating of examinees’ performance on various task variables. Expert judgment is considered the gold standard against which student performance is compared, not actual expert performance. This scoring approach has been used successfully to model expert and rater judgments in a variety of applications including essays69 and patient management skills.30 One of the major issues with expert-based scoring is the selection of the expert.70,71 Problems include experts’ biases, the influences of the experts’ content and world knowledge, linguistic competency, expectations of student competency, and instructional beliefs.72 Data-Driven Techniques In data-driven techniques, performance data are subjected to statistical or machine-learning analyses (e.g., artificial neural networks with hidden Markov models). Using artificial neural network and hidden Markov model technologies, Ron Stevens et al73 have developed a method for identifying learner problem-solving strategies and modeling learning trajectories, or sequences of performance states. Applying the method to chemistry, they were able to identify trajectories revealing learning problems that include not thoroughly exploring the problem space early, reaching a performance state that makes it unlikely to reach a more desirable end state, and reaching a state from which the learner could transition to a better or worse state with equal likelihood. With this information, it may be possible to perform a fine-grained diagnosis of what learners do not know and to use learning trajectories to guide the sequence of instruction and the type and form of remediation, and to do it impromptu. Validation of data-driven methods is complicated because there is no a priori expectation of what scores mean and no inherent meaning of the classification scheme. Interpretation is post hoc, which creates the potential for the introduction of bias in assignments to groups after the groups have been defined.74 A second problem is that machine learning techniques can be highly sample-dependent and the scoring MILITARY MEDICINE, Vol. 178, October Supplement 2013

Evaluation of Medical Simulations

process is driven by statistical rather than theoretical issues.71 Because of these issues, validity evidence is particularly important when using data-driven techniques to score student responses. Domain Modeling This approach attempts to model the cognitive demands of the domain itself. The model specifies how knowledge and skills influence each other and the task variables on which observations are being made. The approach relies on a priori linking of student performance variables to hypothesized knowledge and skill states. Student knowledge and skills are then interpreted in light of the observed student performance. This approach has been used successfully in a variety of domains and modeling types, from canonical items (e.g., Hively et al75); to Tatsuoka’s rule-space methodology;76 to the use of Bayes nets to model student understanding in domains such as Web searching,77 rifle marksmanship,78 hydraulic troubleshooting,79 dental hygiene skills,80 network troubleshooting,81 and circuit analyses.82 The most important issue in domain modeling is identifying the essential concepts and their interrelationships. This can be mitigated through cognitive task analyses and direct observation of performance, but it is critical to gather validity evidence to validate the structure of and inferences drawn by the Bayes net. For examples of empirical validation techniques, see Chung et al78 and Williamson et al.83 METHOD SELECTION For evaluations conducted at each Kirkpatrick level, the methods used are important because they affect the quality of the evaluation. Method selection and design are not easy tasks because medical simulation evaluation is very difficult, for all the reasons any educational research is difficult, and there are additional obstacles that come with the use of technology. The effectiveness of a simulation is due to a combination of factors, not one, and these factors may interact in complex ways. The instructional experience depends on many variables, including instructor background, teaching philosophy, training, and experience; the support of school management; and characteristics of the students.84 And when technology is part of the experience, there are additional variables, including availability of hardware, software, and technical support; curriculum integration strategies; students’ prior experience with and expertise in using technology; and instructor expertise in technology and skill in implementing the simulation.84 This section presents an overview of three major methodological approaches, the random-assignment experiment, quasi-experiments, and alternatives based on qualitative methods, and then we discuss combined methods. We end with a discussion of heuristics for matching methods to situations (or research questions). For an excellent and detailed treatment of these issues see Shadish et al.85 MILITARY MEDICINE, Vol. 178, October Supplement 2013

Random-Assignment Experiments A random-assignment experiment requires random assignment of the unit of treatment application, e.g., students, instructor, or the school, to experimental and control groups. The unit of treatment application is the unit of analysis, and it defines the sample size. Random assignment is required to achieve equivalent groups in terms of variables not explicitly controlled by the evaluator. Variables explicitly controlled by the evaluator are the treatment—the introduction of the simulation—and all measures and procedures that may affect the results. For examples of the use of random-assignment experiments see Adler et al,86 Boulet and Swanson,23 and Robinson et al.87 The argument for the use of random-assignment experiments is that they provide better evidence for causal inferences than any other method. This is true, assuming that the conditions required for experiments are met. The difficulty of meeting these conditions has led to strong objections to experiments in education research, including simulation evaluations. The key problem is the requirement for random assignment to experimental groups. Medical schools do not typically assign students to classrooms and instructors randomly, and students and instructors are not randomly assigned to schools. It is also difficult to meet the requirement for a control group not receiving the treatment. Students (and instructors) do not readily accept withholding the use of technology for the sake of an experiment. It may also be the case that simulation use in other classes is so widespread that it is difficult or impossible to have a control group with no experience that might be relevant. And many argue that the goal of simulation is to provide experiences not possible without the simulation, which means that it is impossible to have a control group receiving the same experience but without the simulation. A related problem is the need for an adequate sample size. The point of conducting an experiment, either a randomassignment experiment or a quasi-experiment as described below, is to detect a difference between groups in the study sample when a difference actually exists in the populations from which the samples are drawn. The probability of detecting such a difference is called the power of a statistical test. Obviously, the power should be high, so that if there is no difference between groups in the experiment, it is reasonable to conclude that there is no difference in reality. The power of a study depends on several factors, including the statistical test, significance criterion, measurement error, and the size of the experimental effect, but the general approach to increasing power is to increase the sample size. Despite this, as reported by Moher et al,88 researchers often use sample sizes too small to achieve power adequate to detect real effects, and most do not even report a sample size calculation. For information on calculating sample size, see Cohen89,90 and Lenth.91 Lenth92 provides an online tool for power and sample size calculations. Another criticism of the experimental approach is that although it provides better evidence for causal inferences, it 69

Evaluation of Medical Simulations

does not provide information on why the simulation had its effects. The argument is that the experiment is a black box that provides evidence of connections between causes and effects, but does not provide information on the processes inside the box that explain why the simulation caused the effects, many of which are based on the context of the simulation. Finally, there are the practical problems of cost and time. Experiments are expensive and time-consuming. They may require all the funds available for evaluation and take so long to complete that decisions are made before results are available. Whether this is unique to random-assignment experiments is arguable, but it is a common criticism nonetheless. Quasi-Experiments Quasi-experiments have many of the features of experiments except random assignment to experimental and control groups and appropriate control of selected variables, such as the timing of exposure to the simulation.85 One example is the time-series experiment, in which periodic measurements are taken over time and an experimental change is inserted at some point in the time series of measurements. Changes after insertion may indicate an effect caused by the experimental change, but may also be caused by other events occurring during the time series because there is no control over events other than the introduction of the experimental change. Another example is the nonequivalent control group design, one of the more common designs in educational research. There is an experimental group and a control group. Both are given a pretest and a posttest, but only the experimental group receives the experimental treatment between the two tests. This is similar to an experimental design, but students are not randomly assigned to each group. Causation can be inferred if there is an experimental versus control group difference in the posttest score. Because the two groups are naturally assembled, e.g., students in two different classes, not randomly assigned, they cannot be considered equivalent; and it is possible that some difference affecting the groups other than the experimental treatment could be the cause. Although this may seem unlikely, it is possible. The point is that the evidence from quasi-experiments is not as strong as the evidence from random-assignment experiments, but it is also true that quasi-experiments are usually more feasible and practical in an education setting. For an example of a quasi-experiment, see the article by Giuliano et al.93 Qualitative Methods Qualitative methods do not attempt to compare experimental and control groups at all, or to control variables. They investigate the simulation through observation, review of artifacts, and interviews, studying cases in their natural setting to consider variables as they appear in all the complexity of the context.94 These methods are very popular in 70

education research, including evaluation of simulations, due in part to the difficulties in doing experimental research in educational settings, and in part to the desire to obtain information on why the simulation had its effects—the processes and mechanisms that lead from specifics of the simulation to effects—and the contextual conditions under which the simulation is more or less effective. The focus is on the context of the simulation, such as local engagement, collaboration, and feedback, and investigating why those results occurred. Understanding the cause of the result involves developing a theory of change, a description of the processes through which the effects are produced. Qualitative methods are weak on causal inference, but the contextualization makes them very useful to decision makers by providing models (theories of change) describing how and why the simulation works or does not work in the existing system and information needed to decide whether, how, and when to use the simulation. Qualitative methods are especially useful for studying a broad range of naturally occurring practices found in many different parts of the school, not from a particular simulation, which would usually be evaluated with an experiment. Such studies are often descriptive, interested in the frequency of various instructional technology uses and practices, not their effects. Some correlate descriptive data with student outcomes to attempt to identify relationships, if not causes. Concluding anything about causation from correlations is, of course, problematic. For an example of a technology evaluation using qualitative methods, see the article by Overly et al.95 Combined Methods As is usually the case when there is a debate over the merits of radically different points of view, the practical truth lies somewhere in between. There is no one right way to do technology evaluation. The approach depends on the purpose of the evaluation, the nature of the simulation, and the context in which it is situated. Some will require quantitative methods, some will require qualitative methods, and usually the evaluation will benefit from a combination providing both quantitative and qualitative data on student learning and attitude outcomes, context, the military environment, and the implementation of the simulation. Selecting Methods

This section describes a heuristic process for deciding when to use what research methods and combinations of methods. The decision depends on the purpose of the evaluation, the nature of the simulation, the context in which it is situated, and practical constraints including site cooperation and available time, funding, equipment, and support resources. The choice need not be limited to a single design. Depending on the purpose, simulation, context, and practical constraints, the evaluation may and usually should consist of a combination of methods. MILITARY MEDICINE, Vol. 178, October Supplement 2013

Evaluation of Medical Simulations

FIGURE 3.

A heuristic process for selecting evaluation methods.

Figure 3 summarizes a heuristic process for matching evaluation methods to situations and requirements. The process is organized into the following set of guidelines, presented as questions followed by recommendations. 1. Is the evaluation concerned with the impact of a specific simulation or with identifying promising practices? –If it is identifying promising practices, the evaluation should start with a quantitative study to identify successful sites based on some measure, and then qualitative methods should be used to understand the differences between successful and unsuccessful sites and the practices related to success. MILITARY MEDICINE, Vol. 178, October Supplement 2013

–If the investigation is concerned with a specific simulation, there is a question on the purpose of the evaluation—question 2. 2. Is the purpose of the evaluation to improve the simulation or determine its effectiveness? –If the purpose is to improve the simulation, the evaluation is a “formative” evaluation. Formative evaluations are used to improve early stage projects by collecting information that can be used to guide the development and implementation of the intervention. This requires the use of qualitative methods to provide information on how the simulation works. The evaluator will be 71

Evaluation of Medical Simulations

interested in how features of the environment interact with features of the simulation, and how features of the simulation will influence behavior. –If the purpose is to determine the effectiveness of the simulation, the evaluation is a “summative” evaluation. In this case there is a question on the need for causal information—question 3. 3. Is causal information needed? –If causal information is not needed, qualitative methods are appropriate. –If causal information is needed, quantitative methods are indicated. Random-assignment experiments are best for determining causation and should be considered first, but before selecting an experiment, there is a question on the feasibility of random assignment—question 4. 4. Is it possible for students, classes, or schools to be randomly assigned to conditions? –If the answer is yes, a random-assignment experiment may be possible, depending on the answer to question 5. –If the answer is no, a quasi-experiment may be possible, depending on the answer to question 5. 5. Is an experiment feasible? Before selecting a randomassignment experiment or quasi-experiment, the feasibility of conducting either must be determined. For either experiment type to be feasible, it must satisfy the following requirements: —Use of the simulation must be different from standard practice in order to achieve a meaningful comparison. —Use of the simulation must be maintainable, that is, it must continue unchanged for the course of the experiment. —Participation must not deny students access to an entitlement, e.g., access to an instructional experience. —Human subjects protection requirements must be met. —Participants and the site must be willing to cooperate. —An adequate sample size must be available. —Time, funding, equipment, and support resources must be available. –If feasibility requirements cannot be met, qualitative methods should be used. –If feasibility requirements can be met for either experiment type, there is a question on the need for information on context—question 6. 6. Is there a requirement for information on conditions of applicability or the process producing the outcomes? –If the answer is yes, and this should usually be the case, an experiment (random-assignment or quasi-experiment, whichever is indicated in question 4) combined with qualitative methods for the contextual information is appropriate. –If the answer is no, the experiment is sufficient. 72

If random assignment is not possible, but feasibility requirements can be met, and there is a requirement for information on conditions of applicability or the process producing the outcomes, a quasi-experiment combined with qualitative methods would be appropriate. If there is no requirement for conditions of applicability or process, which should not be the usual case, a quasi-experiment is appropriate. And as with the random-assignment experiment branch of the method selection process, if the quasi-experiment or quasi-experiment/ qualitative method combination are not appropriate, qualitative methods are the choice. SUMMARY AND DISCUSSION This article has presented an overview of issues and approaches relevant to evaluating medical simulations. It discusses criteria for the technical quality of evaluations, and methods for achieving it. It introduces the Kirkpatrick model, a proven evaluation model supporting the idea of marshaling evidence to make a validity argument. It discusses measures, approaches to scoring, and research methods used to provide evidence, with guidelines for selecting appropriate methods. Takeaway Message Medical simulations have great promise for training complex high-value tasks at less cost and without risk to patients. However, great promise and impressive technical capability are not sufficient to conclude effectiveness. To realize the promise, practitioners must assess the systems and the learning they help produce, and the evaluations must have technical quality. The article’s central takeaway message is the importance of technical quality—reliability and, especially, validity—as the fundamental requirement for any evaluation. The message is linked to two supporting ideas: 1. Validity is not a general quality of an evaluation. An evaluation’s validity depends on the context of its use and the inferences to be drawn based on the results. A validity argument must be made using a wide range of evidence for the appropriateness of the inferences for the particular context.36 2. Begin with a definition of the objectives. The first step in evaluation design is to define the objectives of the simulation—the knowledge and skill required for success. This leads to defining measures, operationalizing the scoring, and then validating the approach with empirical evidence. 3. Align measures, scoring, and research methods with the objectives. Validity requires alignment with the objectives. Evaluate at all levels of the Kirkpatrick model if possible, but always at the level matching the objectives. Future Directions Although not widely used in current medical simulations, we expect greater use of automated scoring based on measures embedded in the simulation itself. Because of the growing MILITARY MEDICINE, Vol. 178, October Supplement 2013

Evaluation of Medical Simulations

sophistication of computationally supported data collection, and the importance of formative information about the trainee’s process during learning, in the future outcome measures will merge with process measures to create learner profiles rather than scores or classifications. We anticipate that these will have domain-independent components that may predict learners’ likely success in a range of other tasks. We see the study of expertise continuing to add to our knowledge of performance measurement and its validity, and we also predict an increased use of artificial intelligence and advanced decision analysis techniques to support assessment and evaluation. These include ontologies, Bayes nets, artificial neural networks, hidden Markov models, lag sequential analysis, and constraint networks. Test development guidelines have been developed from lessons learned in the assessment of clinical competence literature.96 The same is needed for medical simulation design and evaluation based on lessons learned in the evaluation of medical simulations. The Federal Medical Simulation Training Consortium, a partnership of the Department of Defense and other federal institutions involved in medical training and education, is taking a major step in this direction, working with the University of California, Los Angeles Center for Research on Evaluation, Standards, and Student Testing to develop a framework to guide evaluation and refinement of existing curricula (including but not limited to simulations) and development of new curricula, and a set of training effectiveness metrics to allow comparison of curricula. ACKNOWLEDGMENTS The work reported herein was supported by a grant from the Office of Naval Research, Award Number N00014-10-1-0978.

REFERENCES 1. Swanson DB, Norcini JJ, Grosso J: Assessment of clinical competence: written and computer-based simulations. Assess Eval Higher Educ 1987; 12: 220–46. 2. McGaghie WC, Issenberg SB: Simulations in professional competence assessment: basic considerations. In: Innovative Simulations for Assessing Professional Competence, pp 7–22. Edited by Tekian A, McGuire CH, McGaghie WC. Chicago, Department of Medical Education, University of Illinois at Chicago, 1999. 3. Barrows HS, Abrahamson S: The programmed patient: a technique for appraising student performance in clinical neurology. J Med Educ 1964; 39: 802–5. 4. Collins JP, Harden RM: The Use of Real Patients, Simulated Patients and Simulators in Clinical Examinations (AMEE Medical Education Guide, No. 13). Dundee, UK, Association for Medical Education in Europe, 2004. 5. Wilson L, Rockstraw L (editors.): Human Simulation for Nursing and Health Professions. New York, Springer, 2012. 6. Gerner B, Sanci L, Cahill H, et al: Using simulated patients to develop doctors’ skills in facilitating behaviour change: addressing childhood obesity. Med Educ 2010; 44: 706–15. 7. Betcher DK: Elephant in the room project: improving caring efficacy through effective and compassionate communication with palliative care patients. Medsurg Nurs 2010; 19: 101–5.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

8. Safdieh JE, Lin AL, Aizer J, et al: Standardized patient outcomes trial (SPOT) in neurology. Med Educ Online 2011; 16(1): 1–6. 9. Marecik SJ, Prasad LM, Park JJ, et al: A lifelike patient simulator for teaching robotic colorectal surgery: how to acquire skills for robotic rectal dissection. Surg Endosc 2008; 22: 1876–81. 10. Crochet P, Aggarwal R, Dubb SS, et al: Deliberate practice on a virtual reality laparoscopic simulator enhances the quality of surgical technical skills. Ann Surg 2011; 253(6): 1216–22. 11. Lee JT, Son JH, Chandra V, Lilo E, Dalman RL: Long-term impact of a preclinical endovascular skills course on medical student career choices. J Vasc Surg 2011; 54: 1193–200. 12. Privett B, Greenlee E, Rogers G, Oetting TA: Construct validity of a surgical simulator as a valid model for capsulorhexis training. J Cataract Refract Surg 2010; 36: 1835–8. 13. Coles TR, John NW: The Effectiveness of Commercial Haptic Devices for Use in Virtual Needle Insertion Training Simulations. In: 2010 Third International Conference on Advances in Computer-Human Interactions, pp 148–53. Piscataway, NJ, The Institute of Electronic and Electrical Engineers, 2010. Available at http://www.computer.org/csdl/ proceedings/achi/2010/3957/00/3957a148-abs.html; accessed May 7, 2013. 14. Barsuk JH, McGaghie WC, Cohen ER, O’Leary KJ, Wayne DB: Simulation-based mastery learning reduces complications during central venous catheter insertion in a medical intensive care unit. Crit Care Med 2009; 37: 2697–701. 15. Ahlberg G, Enochsson L, Gallagher AG, et al: Proficiency-based virtual reality training significantly reduces the error rate for residents during their first 10 laparoscopic cholecystectomies. Am J Surg 2007; 193: 797–804. 16. Cook DA, Triola MM: Virtual patients: a critical literature review and proposed next steps. Med Educ 2009; 43(4): 303–11. 17. Cendan JC, Lok B: The use of virtual patients in medical school curricula. Adv Physiol Educ 2012; 36(1): 48–53. 18. Cannon-Bowers JA, Bowers C, Procci K: Using video games as educational tools in healthcare. In: Computer Games and Instruction, pp 47–72. Edited by Tobias S, Fletcher JD. Charlotte, NC, Information Age Publishing, 2011. 19. Crofts JF, Bartlett C, Ellis D, Hunt LP, Fox R, Draycott TJ: Training for shoulder dystocia: a trial of simulation using low-fidelity and highfidelity mannequins. Obstet Gynecol 2006; 108: 1477–85. 20. Alinier G, Hunt WB, Gordon R: Determining the value of simulation in nurse education: study design and initial results. Nurse Educ Pract 2004; 4(3): 200–7. 21. Radhakrishnan K, Roche JP, Cunningham H: Measuring clinical practice parameters with human patient simulation: a pilot study. Int J Nurs Educ Scholarsh 2007; 4: Article 8. 22. Cendan JC, Johnson TR: Enhancing learning through optimal sequencing of web-based and manikin simulators to teach shock physiology in the medical curriculum. Adv Physiol Educ 2011; 35(4): 402–7. 23. Boulet JR, Swanson DB: Psychometric challenges of using simulations for high-stakes assessment. In: Simulators in Critical Care Education and Beyond, pp 119–30. Edited by Dunn WF. Des Plaines, IL, Society of Critical Care Medicine, 2004. 24. Scalese RJ, Obeso VT, Issenberg SB: Simulation technology for skills training and competency assessment in medical education. J Gen Intern Med 2008; 23(Suppl 1): 46–9. 25. Dauphinee WD, Reznick R: A framework for designing, implementing, and sustaining a national simulation network: building incentive-based network structures and iterative processes for long-term success: the case of the Medical Council of Canada’s Qualifying Examination, Part II. Simul Healthc 2011; 6(2): 94–100. 26. Dillon GF, Boulet JR, Hawkins RE, Swanson DB: Simulations in the United States Medical Licensing Examination (USMLE). Qual Saf Health Care 2004; 13(Suppl 1): i41–5. 27. Dillon GF, Clauser BE: Computer-delivered patient simulations in the United States Medical Licensing Examination (USMLE). Simul Healthc 2009; 4: 30–4.

73

Evaluation of Medical Simulations 28. Bradley P: The history of simulation in medical education and possible future directions. Med Educ 2006; 40: 254–62. 29. Larsen CR, Soerensen JL, Grantcharov TP, et al: Effect of virtual reality training on laparoscopic surgery: randomised controlled trial. BMJ 2009; 338: b1802. 30. Margolis MJ, Clauser BE: A regression-based procedure for automated scoring of a complex medical performance assessment. In: Automated Scoring of Complex Tasks in Computer-Based Testing, pp 123–67. Edited by Williamson DM, Behar II, Mislevy RJ. Mahwah, NJ, Erlbaum, 2006. 31. Issenberg SB, McGaghie WC, Petrusa ER, Gordon DL, Scalese RJ: Features and uses of high-fidelity medical simulations that lead to effective learning: a BEME systematic review. Med Teach 2005; 27(1): 10–28. 32. Bordage G, Caelleigh AS, Steinecke A, et al: Review criteria for research manuscripts. Acad Med 2001; 76: 897–978. 33. Lurie SJ: Raising the passing grade for studies of medical education. JAMA 2003; 290: 1210–2. 34. McGaghie WC, Issenberg SB, Petrusa ER, Scalese RJ: A critical review of simulation-based medical education research: 2003-2009. Med Educ 2010; 44: 50–63. 35. Fletcher JD, Wind AP: Cost considerations in using simulations for medical training. Mil Med 2013; 178(10)(Suppl): 37–46. 36. American Educational Research Association, American Psychological Association, and National Council for Measurement in Education: Standards for Educational and Psychological Testing. Washington, DC, American Educational Research Association, 1999. 37. Miller MD, Linn R, Gronlund N: Measurement and Assessment in Teaching, Ed 11. Upper Saddle River, NJ, Prentice Hall, 2012. 38. Gulliksen HO: Theory of Mental Tests. New York, John Wiley, 1950. 39. Nunnally JC, Bernstein IH: Psychometric Theory, Ed 3. New York, McGraw-Hill, 1994. 40. Liu J, Harris DJ, Schmidt A: Statistical procedures used in college admissions testing. In: Handbook of Statistics, Volume 26: Psychometrics, pp 1057–94. Edited by Rao CR, Sinharay S. New York, Elsevier, 2007. 41. Cai L: Potential applications of latent variable modeling for the psychometrics of medical simulation. Mil Med 2013; 178(10)(Suppl): 115–20. 42. Patz RJ, Junker BW, Johnson MS, Mariano LT: The hierarchical rater model for rated test items and its application to large-scale educational assessment data. J Educ Behav Stat 2002; 27(4): 341–84. 43. Shavelson RJ, Webb NM: Generalizability Theory: A Primer. Thousand Oaks, CA, Sage, 1991. 44. Brennan RL: Generalizability Theory. New York, Springer-Verlag, 2001. 45. Chiu CWC: Scoring Performance Assessments Based on Judgements: Generalizability Theory. New York, Kluwer, 2001. 46. Messick S: Validity. In: Educational Measurement, Ed 3, pp 13–103. Edited by Linn R. Phoenix, AZ, The Oryx Press, 1993. 47. Thompson B: Exploratory and Confirmatory Factor Analysis: Understanding Concepts and Applications. Washington, DC, American Psychological Association, 2004. 48. Kirkpatrick DL, Kirkpatrick JD: Evaluating Training Programs: The Four Levels, Ed 3. San Francisco, Berrett-Koehler, 2006. 49. Kirkpatrick DI: Evaluating Training Programs: The Four Levels, ED 2. San Francisco, Berrett-Koehler, 1998. 50. McNulty JA, Halama J, Espiritu B: Evaluation of computer-aided instruction in the medical gross anatomy curriculum. Clin Anat 2004; 17: 73–8. doi: 10.1002/ca.10188 51. Via DK, Kyle RR, Trask JD, Shields CH, Mongan PD: Using highfidelity patient simulation and an advanced distance education network to teach pharmacology to second-year medical students. J Clin Anesth 2004; 16(2): 144–51. 52. Fitch MT: Using high-fidelity emergency simulation with large groups of preclinical medical students in a basic science course. Med Teach 2007; 29: 261–3. 53. Swick S, Hall S, Beresin E: Assessing the ACGME competencies in psychiatry training programs. Acad Psychiatry 2006; 30: 330–51.

74

54. Bru¨nken R, Seufert T, Paas F: Measuring cognitive load. In: Cognitive Load Theory, pp 181–202. Edited by Plass J, Moreno R, Bru¨nken R. New York, Cambridge University Press, 2010. 55. Hays RT: The Effectiveness of Instructional Games: A Literature Review and Discussion. Technical report 2005–004. Orlando, FL, Naval Air Warfare Center Training Systems Division, 2005. Available at http://www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA441935; accessed May 7, 2013. 56. Bewley WL, Chung GKWK, Delacruz GC, Baker EL: Assessment models and tools for virtual environment training. In: The PSI Handbook of Virtual Environments for Training and Education: Developments for the Military and Beyond, Vol. 1, pp 300–13. Edited by Schmorrow D, Cohn J, Nicholson D. Westport, CT, Greenwood Publishing, 2009. 57. Swanson DB: A measurement framework for performance-based tests. In: Further Developments in Assessing Clinical Competence, pp 13–45. Edited by Hart I, Harden R. Montreal, Can-Heal Publications, 1987. 58. van der Vleuten C, Swanson DB: Assessment of clinical skills with standardized patients: state of the art. Teach Learn Med 1990; 2: 58–76. 59. Morgan PJ, Cleave-Hogg D, DeSousa S, Tarshis J: High-fidelity patient simulation: validation of performance checklists. Br J Anaesth 2004; 92(3): 388–92. 60. Murray D, Boulet J, Ziv A, Woodhouse J, Kras J, McAllister J: An acute care skills evaluation for graduating medical students: a pilot study using clinical simulation. Med Educ 2002; 36: 833–41. 61. Boulet JR, Murray D, Kras J, Woodhouse J, McAllister J, Ziv A: Reliability and validity of a simulation-based acute care skills assessment for medical students and residents. Anesthesiology 2003; 99: 1270–80. 62. Boulet JR, McKinley DW, Whelan GP, Hambleton RK: Quality assurance methods for performance-based assessments. Adv Health Sci Educ Theory Pract 2003; 8: 27–47. 63. Regehr G, MacRae H, Reznick R, Szalay D: Comparing the psychometric properties of checklists and global rating scales for assessing performance on an OSCE-format examination. Acad Med 1998; 73: 993–7. 64. Williamson DM, Xi X, Breyer FJ: A framework for evaluation and use of automated scoring. Educ Meas 2012; 31(1): 2–13. 65. Shermis MD, Burstein JC (editors): Automated Essay Scoring: A CrossDisciplinary Perspective. Mahwah, NJ, Erlbaum, 2003. 66. Baker EL: Model-based performance assessment. Theory Pract 1997; 36(4): 247–54. 67. Baker EL, Freeman M, Clayton S: Cognitive assessment of history for large-scale testing. In: Testing and Cognition, pp 131–53. Edited by Wittrock MC, Baker EL. Englewood Cliffs, NJ, Prentice-Hall, 1991. 68. Herl HE, O’Neil HF Jr., Chung GKWK, Schacter J: Reliability and validity of a computer-based knowledge mapping system to measure content understanding. Comput Human Behav 1999; 15: 315–33. 69. Burstein J: The e-rater scoring engine: automated essay scoring with natural language processing. In: Automated Essay Scoring: A CrossDisciplinary Perspective, pp 113–22. Edited by Shermis MD, Burstein JC. Mahwah, NJ, Erlbaum, 2003. 70. Bennett RE, Bejar II: Validity and automated scoring: it’s not only the scoring. Educ Meas 1998; 17(4): 9–17. 71. Bennett RE: Moving the field forward: Some thoughts on validity and automated scoring. In: Automated Scoring of Complex Tasks in Computer-Based Testing, pp 403–12. Edited by Williamson DM, Behar II, Mislevy RJ. Mahwah, NJ, Erlbaum, 2006. 72. Baker EL, O’Neil HF Jr.: Performance assessment and equity. In: Implementing Performance Assessment: Promises, Problems, and Challenges, pp 183. Edited by Kane MB, Mitchell R. Mahwah, NJ, Erlbaum, 1996. p. 183–99. 73. Stevens R, Soller A, Cooper M, Sprang M: Modeling the Development of Problem Solving Skills in Chemistry with a Web-Based Tutor, pp 580–91. Proceedings of the 7th International Conference on Intelligent Tutoring Systems. Berlin, Springer-Verlag, 2004. 74. Baker EL, Chung GKWK, Delacruz GC: Design and validation of technology-based performance assessments. In: Handbook of Research on Educational Communications and Technology, Ed 3, pp 595–604.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

Evaluation of Medical Simulations

75. 76. 77. 78.

79.

80.

81.

82. 83.

84.

Edited by Spector JM, Merrill MD, van Merrie¨nboer JJG, Driscoll MP. Mahwah, NJ, Erlbaum, 2008. Hively W, Patterson HL, Page SH: A “universe defined” system of arithmetic achievement tests. J Educ Meas 1968; 5: 275–90. Birenbaum M, Kelly AE, Tatsuoka KK: Diagnosing knowledge states in algebra using the rule-space model. J Res Math Educ 1993; 24: 442–59. Bennett RE, Jenkins F, Persky H, Weiss A: Assessing complex problem solving performances. Assess Educ 2003; 10: 347–59. Chung GKWK, Delacruz GC, Dionne GB, Bewley WL: Linking assessment and instruction using ontologies. Proceedings of the I/ITSEC 2003; 25: 1811–22. Available at http://ntsa.metapress.com/ link.asp?id=td7v9u19wddex1dd; accessed May 7, 2013. Mislevy R, Gitomer DH: The role of probability-based inference in an intelligent tutoring system. User Model User-adapt Interact 1996; 5: 253–82. Mislevy RJ, Steinberg LS, Breyer FJ, Almond RG, Johnson L: Making sense of data from complex assessments. Appl Meas Educ 2002; 15: 363–89. Williamson DM, Almond RG, Mislevy RJ, Levy R: An application of Bayesian networks in automated scoring of computerized simulation tasks. In: Automated Scoring of Complex Tasks in Computer-Based Testing, pp 201–57. Edited by Williamson DM, Behar II, Mislevy RJ. Mahwah, NJ, Erlbaum, 2006. Darwiche A: A differential approach to inference in Bayesian networks. J ACM 2003; 50: 280–305. Williamson DM, Almond RG, Mislevy RJ: Model criticism of Bayesian networks with latent variables. In: Uncertainty in artificial intelligence: Proceedings of the 16th conference, pp 634–43. Edited by Boutilier C, Goldzmidt M. San Francisco, CA, Morgan Kaufmann, 2000. Haertel GD, Means B (editors): Evaluating Educational Technology: Effective Research Designs for Improving Learning. New York, Teachers College Press, 2003.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

85. Shadish WR, Cook TD, Campbell DT: Experimental and QuasiExperimental Designs for Generalized Causal Inference. Boston, Houghton-Mifflin, 2002. 86. Adler MD, Vozenilek JA, Trainor JL, et al: Development and evaluation of a simulation-based pediatric emergency medicine curriculum. Acad Med 2009; 84(7): 935–41. 87. Robinson JD, Bray BS, Willson MN, Weeks DL: Using human patient simulation to prepare student pharmacists to manage medical emergencies in an ambulatory setting. Am J Pharm Educ 2011; 75(1): Article 3. 88. Moher D, Dulberg CS, Wells GA: Statistical power, sample size, and their reporting in randomized controlled trials. JAMA 1994; 272: 122–4. 89. Cohen J: Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ, Erlbaum, 1988. 90. Cohen J: A power primer. Psychol Bull 1992; 112(1): 155–9. 91. Lenth RV: Some practical guidelines for effective sample size determination. Am Stat 2001; 55: 187–93. 92. Lenth RV: Java applets for power and sample size [Computer software]. 2006. Available at http://www.divms.uiowa.edu/rlenth/Power/; accessed May 7, 2013. 93. Giuliano KK, Johannessen A, Hernandez C: Simulation evaluation of an enhanced bedside monitor display for patients with sepsis. AACN Adv Crit Care 2010; 21(1): 24–33. 94. Herman JL, Morris LL, Fitz-Gabbon CT: Evaluator’s handbook. In: Program Evaluation Kit. Edited by Herman JL. Newbury Park, CA, Sage, 1, 1987. 95. Overly FL, Sudikoff SN, Shapiro MJ: High-fidelity medical simulation as an assessment tool for pediatric residents’ airway management skills. Pediatr Emerg Care 2007; 23(1): 11–5. 96. Newbie D, Dawson B, Dauphinee D, et al: Guidelines for assessing clinical competence. Teach Learn Med 1994; 6: 213–20.

75

MILITARY MEDICINE, 178, 10:76, 2013

Prevention of Surgical Skill Decay Ray S. Perez, PhD*; Anna Skinner, MA†; Peter Weyhrauch, PhD‡; James Niehaus, PhD‡; Corinna Lathan, PhD, PE†; Steven D. Schwaitzberg, MD, FACS§; Caroline G. L. Cao, PhD∥ ABSTRACT The U.S. military medical community spends a great deal of time and resources training its personnel to provide them with the knowledge and skills necessary to perform life-saving tasks, both on the battlefield and at home. However, personnel may fail to retain specialized knowledge and skills if they are not applied during the typical periods of nonuse within the military deployment cycle, and retention of critical knowledge and skills is crucial to the successful care of warfighters. For example, we researched the skill and knowledge loss associated with specialized surgical skills such as those required to perform laparoscopic surgery (LS) procedures. These skills are subject to decay when military surgeons perform combat casualty care during their deployment instead of LS. This article describes our preliminary research identifying critical LS skills, as well as their acquisition and decay rates. It introduces models that identify critical skills related to laparoscopy, and proposes objective metrics for measuring these critical skills. This research will provide insight into best practices for (1) training skills that are durable and resistant to skill decay, (2) assessing these skills over time, and (3) introducing effective refresher training at appropriate intervals to maintain skill proficiency.

INTRODUCTION Prevention of skill decay is and should be a high priority for any training organization. Skill decay is the partial or full loss of trained or acquired skills and knowledge following periods of nonuse.1 Loss of critical skills and knowledge is of great concern, and is problematic in situations in which medical personnel receive initial training but may not have an opportunity to use these skills or knowledge for extended periods of time. Arthur et al1 in a meta-analysis examining the results of 53 articles within the skill decay and retention literature, found that after 365 days of nonuse or nonpractice, the average participant’s performance was reduced by almost a full standard deviation (d = −0.92). There is evidence that declarative knowledge (e.g., facts, principles, and concepts) decays at a slower rate than procedural knowledge (e.g., multiple steps that must be performed in a specified order to solve a problem or complete a task).2 Psychom*otor skills also appear to be more resistant to skill decay than cognitive tasks, in the absence of mental rehearsal.1 Although the research literature on skill decay is comprehensive and extensive, this is not true of the literature specifically pertaining to laparoscopic surgical (LS) skill decay. One study that did focus on these skills was conducted by Brunner and Korndorffer3; they examined *Office of Naval Research, 875 North Randolph Street, Arlington, VA 22203. †AnthroTronix, Inc., 8737 Colesville Road, L203 Silver Spring, MD 20910. ‡Charles River Analytics, Inc., 625 Mount Auburn Street, Cambridge, MA 02138. §Cambridge Health Alliance, 1493 Cambridge Street, Cambridge, MA 02139. k Biomedical, Industrial and Human Factors Engineering, Wright State University, 3640 Colonel Glenn Hwy, Dayton, OH 45435. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of any of the funding agencies. doi: 10.7205/MILMED-D-13-00216

76

skill decay of LS skills, echoing the findings by Arthur and his colleagues, demonstrating that LS skills could be trained in a virtual reality learning environment and that nonuse exhibited the same pattern of skill decay reported in the meta-analysis conducted by Arthur et al1 Despite this specific conclusion, a primary limitation within this area is the lack of research examining the acquisition and retention of the cognitive skills of LS (e.g., decision making and problem solving). Within the domain of general surgery, Jacklin et al4 studied the effects of providing cognitive feedback to surgical trainees in an attempt to improve their risk assessment of surgery (i.e., judgment of postoperative mortality risk). They found that the use of cognitive feedback improved the accuracy of trainees’ estimates of postoperative mortality risk. However, very little research has been conducted to examine the skill decay of these critical cognitive skills, and similar research has not been conducted within the LS domain. Section 1 of this article reviews skill acquisition and retention research related to general skills, military skills, and LS skills. Section 2 introduces models that identify critical skills related to laparoscopy, and Section 3 proposes objective metrics for measuring these critical skills. Section 4 addresses the development of skill decay curves, and Section 5 proposes a model that examines refresher training of laparoscopic skills. The article concludes with a summary and conclusions that include future areas of research. This article includes two perspectives and approaches to addressing the issues surrounding LS skill assessment, training, and sustainment: (1) the development and use of a Laparoscopic Surgery Training System (LASTS) prototype for learning, refreshing, and assessing LS skills and (2) the design, development, and validation of a portable, open architecture Surgical Skills Training and Assessment Instrument (SUSTAIN) to support acquisition and retention of fundamental psychom*otor, perceptual, and cognitive laparoscopic skills. Both systems seek to leverage the development and validation of an empirical model of skill acquisition and decay. MILITARY MEDICINE, Vol. 178, October Supplement 2013

Prevention of Surgical Skill Decay

SECTION 1: BACKGROUND Skill Acquisition and Decay To analyze skill decay, we must first examine the process by which knowledge and skills are acquired. The rate of skill acquisition depends on many factors; however, there is evidence that rates of learning tend to follow a logarithmic or exponential function, rather than a linear relationship between the time to perform a task and the number of practice attempts.5,6 There are typically three stages of skill development: a cognitive stage, associative stage, and an autonomous stage; it is during the autonomous stage that expertise is achieved.7 Research has shown that the major factors influencing skill decay and retention include the length of retention (nonuse) interval, degree of overlearning, task characteristics (e.g., closed loop versus open loop, physical versus cognitive), methods of testing for initial learning and retention, conditions of retrieval (recognition versus recall), instructional strategies or training methods, and individual differences in trainees.2 Skill Acquisition and Decay Within Military Tasks The rate of skill decay within military tasks has been researched extensively in the past. For example, Wisher et al2 investigated the decay of skills and knowledge with 20,000 reservists, and found that gross motor skills decayed after approximately 10 months, whereas cognitive skills, such as recall of procedures, decayed within approximately 6 months. Thus, while both motor and cognitive skills are subject to decay with nonuse, the respective rates of decay differ. The Naval Education and Training Command developed a categorization of navy tasks on a scale of proneness to decay; tasks most prone to decay included recalling procedures and voice communications tasks, whereas those least prone to decay included gross motor skills and attitude learning. Wisher et al2 also conducted an extensive review of general skill acquisition and retention/decay literature. Based on this review, they categorized military tasks into 3 components: (1) knowledge, (2) decision, and (3) execution. The knowledge category is based on the recall of domain-specific information; the decision category depends on cognitive processing of the domain-specific information; and the execution category refers to both the perceptual and motor requirements of a task. Wisher et al2 also identified specific task factors that affect skill acquisition and decay, such as task complexity and task demands. Other factors identified as affecting decay included task time pressure, whether or not job aids were used (job aids decreased decay), and the quality of job aids used (higher quality job aids decreased decay). Although past research efforts have provided a greater understanding of the mechanisms underlying skill acquisition and decay, including the relative rates of decay of various types of skills, there are no detailed models and skill decay curves of military medical tasks. A need exists for validated models and skill decay curves within the context of specialized MILITARY MEDICINE, Vol. 178, October Supplement 2013

military medical skills, such as LS skills, because these specialized procedures are not usually performed during deployments, and are therefore susceptible to decay. Validated LS skill decay models would support the development of guidelines for accurately timed refresher training to prevent LS skills decay. Skill Acquisition and Decay Within Laparoscopic Surgical Tasks Over the past decade, attempts have been made to establish standards for laparoscopic surgical skills training and evaluation. The McGill Inanimate System for Training and Evaluation of Laparoscopic Skills (MISTELS) has been shown to be reliable and valid as an educational tool.8 This method of assessment has been incorporated into the manual skills training practicum portion of the Society of American Gastrointestinal and Endoscopic Surgeons’ (SAGES) Fundamentals of Laparoscopic Surgery (FLS) training program, and includes a portable video trainer box for rehearsal of five basic manual skills tasks: peg transfer, pattern cutting, endo-loop placement, extracorporeal suturing, and intracorporeal suturing (see Fig. 1). FLS trainer scores have been shown to be predictive of intraoperative laparoscopic performance as measured by the Global Operative Assessment of Laparoscopic Skills (GOALS) manual assessment framework,9 making the FLS training program the current “gold standard” in laparoscopic training, and resulting in its rapid adoption as a primary component of many general surgery residency programs.10,11 FLS manual skills training results in laparoscopic manual skills that have shown to be durable for up to 11 months,12 and skill retention has been demonstrated for three tasks similar to FLS tasks for up to one year.13 However, these studies relied only on time to complete the specified manual skills tasks, and retention of the cognitive components of LS training was not considered. Also, although maintenance of laparoscopic manual skills through rehearsal and retraining has been shown to prevent decay,12 no standards currently exist for retraining, and few deployable systems exist that can be used where they are most needed—in far forward military

FIGURE 1.

FLS box trainer.

77

Prevention of Surgical Skill Decay

medical facilities—to provide critical skills refresher training during long deployments. Stefanidis et al14 provided preliminary evidence that a visual–spatial secondary task assessing spare attentional capacity may help distinguish among individuals of variable laparoscopic expertise when standard FLS performance measures fail to do so, and that automaticity metrics may improve current simulator training and assessment methods. Stefanidis et al15 tested this theory further, demonstrating that after approximately 10 hours of training (average of 84 trials) on the FLS intracorporeal suturing task with concurrent performance of a visual–spatial secondary task, novices demonstrated improvements in both suturing and secondary-task performance compared with baseline scores; however, none of the subjects achieved expert level secondary task proficiency. Recently, Stefanidis et al11 demonstrated that training to automaticity (overlearning) on the FLS intracorporeal suturing task while concurrently performing a visual–spatial secondary task resulted in improved operating room performance on a porcine model as compared to training to proficiency without the secondary task. This level of training required, on average, 163 trials, and nearly half of the participants assigned to the automaticity training group were unable to achieve expert performance levels on the secondary task. Therefore, while superior training and skill transfer was demonstrated, the training costs are high in terms of time and resources. Also, the secondary task used was unrelated to surgical tasks. It is possible that by incorporating a secondary task that addresses interoperative skills, including cognitive skills, training would be enhanced further, providing justification for the extended time associated with training to automaticity. This article not only explores the impact of automaticity on the acquisition and durability of an LS skill but also raises a number of questions, such as: (1) How much training is actually necessary to competently perform LS?; (2) Do fundamental LS skills that become automated generalize to other minimally invasive surgery skills and specific procedures?; and (3) What is the impact of automaticity on the training of cognitive LS skills? Recent research has sought to develop a deeper understanding of the cognitive factors involved in training. For example, Park et al16 have attempted to understand the role of declarative memory processes during psychom*otor skills learning using a cognitive simulation based on the Adaptive Control of Thought—Rational model. However, they did not directly assess the role of decision making involved in surgical procedures. Palter,17 in a review of training curriculum for Minimally Invasive Surgery (MIS), suggests that MIS training should include cognitive teaching to compensate for knowledge gaps. Finally, given that LS tasks require surgeons to rely heavily on 2-dimensional visual cues and haptic cues to perform complex tasks within a 3-dimensional space, visual–spatial skills are likely to play an integral role in the performance of these tasks. Visual–spatial skills involve the perception and processing of spatial relationships, such as mental trans78

formation (e.g., mental rotation) and relative distance perception. Hassan et al18 found that among novices, visual–spatial perception is associated with manual skills performed on a laparoscopic skills virtual reality (VR) simulator; novice participants with a high degree of spatial perception performed laparoscopic VR tasks faster than those who had a low degree of spatial perception and also scored better for economy of motion, tissue damage, and total error. Ritter et al19 demonstrated that spatial perceptual abilities correlated well with duration of the learning curve (i.e., number of trials required to meet the specified proficiency criterion) on a validated VR flexible endoscopy simulator. Visual–spatial ability assessments included determination of the 3-dimensional orientation of a 2-dimensional grayscale cube via the pictorial surface orientation (PicSOr) test and the Cube Comparison test, which involves mental rotation of an object about its center. The PicSOr test was also shown in three studies by Gallagher et al20 to consistently predict performance on the FLS circle cutting task, as well as significantly predicting laparoscopic surgeons’ performance. Keehner et al21 also examined changes in performance as novices learned to use an angled laparoscope within a virtual environment; initial performance showed considerable variability among novices, with performance related to both general, nonverbal reasoning ability (assessed via the abstract reasoning task of the Differential Aptitude Test battery) and spatial abilities (assessed using the Mental Rotation Test and Visualization of Views Test). As learning progressed, the correlation of performance with general reasoning ability diminished after the first few sessions, whereas by contrast, the significant correlation with spatial ability persisted even after the group variance had diminished. This finding provided further support for a previous study by Keehner et al,22 which found a significant correlation between spatial ability and interoperative videoscopic skills performed on animals for a novice group; however, no significant correlation was found within a group of experienced surgeons. Thus, the importance of spatial ability in performance of laparoscopic skills seems to diminish with experience. SECTION 2: CRITICAL SKILLS The first step in creating a system to train and refresh LS skills is to identify and model the critical underlying skills. Currently, no standard method exists for identifying critical skills related to laparoscopy. Current understanding is also limited on how best to refresh decayed skills, and no standard refresher training currently exists. Previous task analyses of (MIS) procedures have focused on psychom*otor skills.23 However, the psychom*otor ability to operate the laparoscopic tools represents only a portion of the skills needed to successfully perform MIS. Visual–spatial skills are needed to orient within the abdomen and recognize structures such as tissue, organs, ducts, and blood vessels. Cognitive skill is needed to maintain proper situational awareness, understand the ramifications of observations, make the correct decisions, and act in MILITARY MEDICINE, Vol. 178, October Supplement 2013

Prevention of Surgical Skill Decay

the best interest of the patient and the operation. In addition, different surgeons perform procedures with different techniques, based on varying degrees of expertise and experience, as well as training background. For example, establishing initial laparoscope and instrument access during LS cholecystectomy can be performed using the Hassan technique or using the Veress needle technique, and expert opinions differ on which is optimal. Therefore, the model of skills must account for both normative errors and actions that are not correct in any of these techniques, as well as quasinormative errors, which consist of actions that are correct for some techniques but not others.24 The following details two complementary approaches to identifying and modeling the critical skills underlying LS proficiency, with the goal of developing improved training, assessment, and maintenance of these skills. LASTS Critical Skills The LASTS uses models of surgical skill acquisition and decay, objective methods of surgical skill assessment, and individual skill and training models to maximize training effectiveness and minimize skill decay by creating training curricula customized to each surgeon’s knowledge and skills. The LASTS skill model design goals are detailed enough to represent the skills needed for surgical procedures, momentby-moment, and are sufficiently powerful to support an assessment of the skills that will differentiate experts from novices. The model accounts for variations in the skills required by different procedures by expanding on previous hierarchical models of LS skills23 and representing the skill hierarchy of each procedure. Surgical procedures are decomposed into a tree of steps, tasks, subtasks, and where appropriate, into motions. The tree describes a choreographed sequence of motions, views, information elements, and decisions, as well as deviations for nonstandard anatomy, emergencies, or repairs for adverse events. The skill model includes psychom*otor, visual–spatial, and cognitive skills, and uses annotations to represent the goals, dangers, techniques, expectations, information requirements, and situational awareness needed by the step, task, or subtask.23 In addition to the base actions of the procedure, there may be a need to perform optional and repair actions, which are not necessary in every case. For example, a small amount of bleeding often occurs when detaching the gallbladder from the liver in LS cholecystectomy, and this may require cauterization or other repair at any point during the removal step. These action specifications also include annotations that indicate when the action should be performed. Figure 2 shows a portion of the skill tree for the “Remove Gallbladder” step of a cholecystectomy procedure. Our skill tree graphics include the set of steps for the procedure (each page is one step, so this figure shows a single step), as well as the tasks (the black boxes) and subtasks (the white boxes) of each step. Tasks and subtasks are linked via decomposition MILITARY MEDICINE, Vol. 178, October Supplement 2013

links (solid lines) and ordering links (dotted lines with arrows). Not depicted in these graphics are the possible deviations and repair steps. Because of the diagram’s space constraints and because these visualizations are meant as an overview of the procedure only, we do not show the annotations in this graphic. However, the model could be presented with an overlay of the cognitive complexity, motor complexity, or danger ratings to show how specific conditions or complications can affect the skills needed by the procedure. SUSTAIN Critical Skills In our ongoing research and development effort, we are supplementing the LASTS cognitive task analysis with empirical skill acquisition and retention data to inform the LASTS skill models, as well as developing and validating a SUSTAIN that includes novel, objective assessment metrics and training modules within a modular design to support training within a variety of environments, including deployed settings. We focused initially on identifying assessment metrics that could assess LS skill acquisition, proficiency, and decay. Our initial investigations have focused on assessment of the current tasks and metrics used to train and assess LS psychom*otor skills. Later investigations will leverage the outcomes of the LASTS cognitive skills identification. The manual skills testing component of the FLS training curriculum is intended to train and measure psychom*otor skill performance during basic laparoscopic surgical maneuvers; however, some visual–spatial skills are involved as well. Figure 3 provides a task description and decomposition of the constituent psychom*otor and visual–spatial skills involved in each of the FLS manual skills tasks. Additionally, the most challenging aspect of each task and common strategies used were identified based on expert interviews and are included in Figure 3. Currently, the cognitive skills associated with laparoscopic surgery are trained separately within the FLS didactic program, and are assessed via a written test. Based on skill decay literature, it is cognitive skills that decay most rapidly and therefore are most susceptible to decay during periods of nonuse, whereas psychom*otor skills decay at slower rates. Therefore, it is not surprising that the FLS manual skills retention studies have shown little decay over long periods of retention. However, the current metrics for assessing the FLS psychom*otor skills are limited primarily to time to complete each task, and secondarily to overt errors. Some research within the domain of robot-assisted laparoscopic surgery has begun to examine novel and objective metrics for assessing similar skills.25,26 We suggest that more sensitive metrics may identify early indications of skill decay, and provide a means for assessing and refreshing training during and following deployments. We propose that additional research is needed to examine the retention of the relevant perceptual and cognitive skills, 79

Prevention of Surgical Skill Decay

FIGURE 2.

Illustration of a portion of the LASTS skill tree for a single variation of a laparoscopic cholecystectomy procedure.

and also to explore potential novel objective metrics related to psychom*otor skill acquisition and decay. SECTION 3: OBJECTIVE METRICS To develop, validate, and effectively use models of surgical skill for training and refreshing, objective metrics are needed that can determine a surgeon’s level of expertise with respect to the identified critical surgical skills. The surgical skill assessment must reliably differentiate expert surgeons from novices. An assessment that cannot make this distinction is not valid as a tool for understanding, training, and certifying surgical skills. The skill assessment metrics must provide measures of the trainee’s skill level on the spectrum from novice to expert to enable the selection of training materials and practice tasks for an individual profile. Ideally, these 80

materials and tasks will meet the surgical trainee at his or her current skill level and provide an opportunity to improve. Metrics must also effectively measure the underlying psychom*otor, visual–spatial, and cognitive skills associated with LS proficiency, both independently and in conjunction with one another. Finally, the established metrics must be objective and empirically validated.

LASTS Objective Metrics LASTS’s design builds on the FLS skill measurement approach and leverages measures recorded automatically in simulation, such as speed, accuracy, or smoothness of motion based on tracking instruments and task outcome.27,28 For example, to assess psychom*otor skills, the surgeon performs MILITARY MEDICINE, Vol. 178, October Supplement 2013

Prevention of Surgical Skill Decay

FIGURE 3.

FLS manual skills task decomposition.

a physical task, such as suturing, using a virtual trainer. To provide a comprehensive view of surgical skills, LASTS is designed to assess not only psychom*otor skills but also visual–spatial and cognitive skills (yet to be determined) as well. Visual–spatial tasks include manipulating a virtual laparoscope to obtain the proper view on an organ; finding structures, such as a gallbladder, from video of LS; and recognizing abnormal conditions, such as cirrhosis of the liver, from video of LS. Cognitive knowledge can be assessed via test questionnaires, such as the multiple choice questions used by the FLS didactic exam.29 A number of decision tasks can be derived from the model of surgical skills for the procedure, such as where to place trocars for patients of varying MILITARY MEDICINE, Vol. 178, October Supplement 2013

body habitus (i.e., overweight, underweight, or normal weight) or whether to convert to open surgery after watching a video of a complication event. These measures of decision tasks can be formatted into multiple choice or select-on-a-diagram forms, or the measures may occur during a training procedure within the context of a simulated scenario. In addition to assessing individual skill performance, LASTS’s design also assesses performance of combined skills through tasks that incorporate two or more skills simultaneously. An example of a combined psychom*otor and cognitive task is dissecting the gallbladder from the liver during a simulated LS cholecystectomy procedure. LASTS specifies that the surgeon must understand the cognitive elements of 81

Prevention of Surgical Skill Decay

the task—the goals, risks, techniques, expectations, and situational awareness—and use those elements to modulate the psychom*otor movements of the task. For example, in some cases during a specific step, the surgeon must be careful not to unintentionally puncture the gallbladder or the liver causing excessive bleeding. Finally, LASTS’s design assesses combined psychom*otor, visual–spatial, and cognitive skill by simulating multiple steps within a procedure. During the simulation, LASTS uses the skill tree to assess the performance of each step, task, and subtask independently. Specific scoring mechanisms are in development, which will take into account correlations between both individual and combined skills with task performance outcomes. SUSTAIN Objective Metrics Several objective metrics have been identified that may support effective assessment of LS skills acquisition, proficiency, and decay, including simulator-based metrics, motionand vision-tracking metrics, and cognitive assessment metrics. For example, laparoscopic virtual environment training systems have incorporated a number of automated performance assessment metrics, such as task completion time, economy of motion, instrument collisions, peak instrument force, and peak strain. While further validation of such metrics is needed, including demonstration of predictive validity within live tissue and cadaveric models, these metrics may provide useful objective assessment and tracking tools if integrated into the FLS video trainer and/or curriculum. Laparoscopic simulation trainers can detect various motions as part of their mechanical input and visual feedback systems; however, the current FLS trainer boxes are relatively low-tech and do not include instrumentation tracking capabilities. Although the low-tech nature of the current FLS training system supports low-cost training, the goal of the SUSTAIN research and development effort is to investigate potential objective metrics that might provide insight into FLS skills acquisition, proficiency, and decay, including more sensitive metrics of psychom*otor skills to enable detection of skill decay despite slower decay rates of these skills as compared to cognitive skills. This effort seeks to develop metrics based on instrument tracking, as well as metrics that are novel within this domain, such as motion tracking via wearable instruments (i.e., gloves), and additional metrics to support assessment of perceptual and cognitive skills, both independently and in conjunction with the FLS psychom*otor skills tasks. A wide variety of motion-tracking technologies, such as instrumented gloves, currently exist for applications ranging from graphical animation to recognition of sign language. Data that may be extracted from these technologies include hand tremors, tool grip, and tool state (e.g., activating tool, rotating tool), as well as movement certainty. Instrumentation of the LS surgical tools may provide valuable data, such as time metrics, efficiency metrics, and task-specific metrics (e.g., object grasping, transfer, and placing times), as well as efficiency metrics, 82

such as overall efficiency (left hand, right hand, average), pickup efficiency (left hand, right hand, reverse, average), passing efficiency (left to right, right to left, average), and placing efficiency (left hand, right hand, average). Vision tracking technologies are used in a variety of domains to provide insight into human–computer interaction, perceptual processing, and cognitive process. Vision tracking technologies have become increasingly affordable and unobtrusive, and their use has increased in both research and operational applications as a result. Vision tracking enables researchers to identify visual scan and fixation patterns during perceptual–cognitive and perceptual–motor tasks, providing insights into learning. In addition to enabling researchers to study learning patterns to develop more robust models of skill acquisition and decay, vision tracking can also enable surgical instructors and training proctors to see where trainees are focusing their visual attention during task completion in real time. A variety of cognitive assessment technologies and strategies exist that incorporate indicators of cognitive workload and learning, such as concurrent performance of a secondary task.11,14,15 A subset of the metrics identified, such as hand and instrument tracking, are currently being empirically tested, and will be considered for inclusion in the SUSTAIN prototype system. In addition, we will consider cognitive assessment metrics such as assessing the ability to concurrently perform a secondary task more effectively as performance on the primary task of interest becomes more automated. While Stefanidis et al11,14,15 demonstrated that secondary task performance provides a means for differentiating between individuals of variable LS expertise when standard performance measures fail to do so, and that training to automaticity using a secondary task correlates to improved intraoperative performance, the secondary task used in these studies was not directly related to surgical skills. It is possible that by incorporating a secondary task that addresses intraoperative skills, including cognitive skills, training would be enhanced further, providing justification for the extended time associated with training to automaticity.

SECTION 4: SKILL DECAY CURVES Reviews of the skill decay literature have identified a core set of factors that influence the decay of trained skills. These factors include (a) length of the nonuse interval, (b) degree of overlearning (training beyond mastery), (c) task characteristics (e.g., psychom*otor versus cognitive, number of steps involved), (d) method of assessing original acquisition and retention (i.e., type of test), (e) condition of retrieval (e.g., recall versus recognition), (f) instructional strategies and training methods, (g) individual differences (e.g., spatial ability), and (h) motivation.1,2 Modeling these factors within skill decay curves helps to predict skill degradation over time and introduce refresher training at appropriate intervals. MILITARY MEDICINE, Vol. 178, October Supplement 2013

Prevention of Surgical Skill Decay

methodology to create a unique set of learning curves for each procedure, step, and task. A key contribution of LASTS is the ability to combine these curves across skills. We expect that initial learning rates (i.e., time to reach proficiency) will decrease and decay rates will increase when the surgeons are assessed on multiple skills simultaneously. Based on our initial understanding of the literature on multiple learning curves,28 our hypothesis is that the curves will all have a similar shape because proficiency at a given skill will increase with each trial until a plateau is reached. The exact values of the curves are dependent on the skill as well as the student’s experience level and past performance. On average, decay curves show decreasing proficiency over time, as the skill decays.11 FIGURE 4. (1885).

Representation of forgetting curve proposed by Ebbinghaus

Skill acquisition curves have been developed, for example, based on the relationship between the time to perform a task and the number of practice attempts.5,6 Ebbinghaus30 proposed the first formal decay curve, which he called the “forgetting curve.” This model, shown in Figure 4, demonstrates that the rate of forgetting depends on several factors, but that typically single trial learning results in exponential skill decay, with each additional learning trial resulting in increased retention. The solid line indicates the rate of forgetting for a single learning trial; each dashed line to the right represents the rate of forgetting for additional learning trials on subsequent days. LASTS Skill Decay Model The skill decay models used in LASTS predict how surgical skills change over time and practice sessions, in relation to specific learning and decay factors. These models have three related parts: (1) how skills are initially acquired through training, including factors such as the order, duration, and repetition of training, as well as individual trainee differences, such as skill level and experience; (2) how skills decay over time through nonuse; and (3) how skills are reacquired through refresher courses. It is important to note that reacquisition is different from the initial acquisition because of previous exposure to the skills.12 The model includes three skill curves for each skill assessed: (1) learning, (2) decay, and (3) relearning. The learning and relearning curves measure skill level against training attempts (i.e., number of times performing the training task). LASTS includes these factors because they have been shown by Arthur et al to modulate performance.1 A learning curve shows how a surgeon’s skill improves with each attempt.12,31 The decay curves measure skill level over time.11 Learning curves for LS are typically defined in terms of the time it takes to complete a task versus how many sessions have been trained. LASTS extends the task completion time measure to all of the factors that comprise the assessment MILITARY MEDICINE, Vol. 178, October Supplement 2013

SUSTAIN Skill Decay Model The skill decay model proposed for SUSTAIN is both theoretical and empirical, and will provide input to the LASTS computational decay models. To begin developing an empirical model of LS skills that includes novel metrics, we conducted a pilot study32 in which both traditional (i.e., time and error) and novel (i.e., instrumented gloved hand motion tracking) FLS assessment metrics were collected for 11 medical students with no prior exposure to the FLS training curriculum and one expert laparoscopic surgeon. FLS manual skills performance data were collected for the novice participants before training (pretest), following completion of training to proficiency (post-test), and following an 8- to 10-week retention period (follow-up). Novice spatial abilities were also measured via three validated computer tests that assessed egocentric spatial ability, allocentric spatial ability, and mental rotation. Despite a small sample size, based on traditional FLS scores, novice performance was shown to be significantly higher at post-test, as compared to pretest; FLS scores decreased at follow-up, but did not differ significantly from post-test scores. A significant difference was also found between the novice pretest and expert’s scores for the five tasks, but not between the novice post-test and expert’s scores, or between the novice follow-up and expert’s scores; however, the novice post-test and follow-up scores would likely be significantly differentiated from expert’s scores with a larger sample population. These pilot data are indicative of skill acquisition and decay trends, providing the basis for an initial empirical skill model. We also established initial technical feasibility for the use of instrumented glove motion tracking to assess smoothness of surgeon hand motions, and for the incorporation of these novel metrics within a skill decay model. Although limited by the small sample size, this pilot study provided a basis for an initial model of laparoscopic surgical skill acquisition and decay that incorporates a variety of metrics, which will be further refined and expanded to include cognitive, perceptual, and psychom*otor aspects, and will be validated under ongoing empirical research efforts. 83

Prevention of Surgical Skill Decay

SECTION 5: DESIGN OF REFRESHER TRAINING Stefanidis and Heniford33 suggest that a successful laparoscopic skills curriculum should “encompass goal-oriented training, sensitive and objective performance metrics, appropriate methods of instruction and feedback, deliberate, distributed, and variable practice, an amount of overtraining, maintenance training, and a cognitive component” (p. 77). The LASTS approach includes a curriculum generator, which will use the LASTS and SUSTAIN assessment methods, learning curves, and decay curves to create a curriculum of refresher training for a specific surgeon based on that surgeon’s initial skill level across various tasks and the length of retention interval for various skills and tasks. At this time, we have designed the framework of the curriculum generator, which we describe here, but we have not yet implemented its components. The curriculum generator will leverage the skill decay curves developed, which predict the rate at which various skills will decay, and will be designed to use a surgeon’s training history, history of procedures performed, and skills assessments to determine which skills are most likely to require refreshing for a specific procedure. The curriculum generator then creates a curriculum to refresh these skills. Because the curriculum focuses on refreshing the skills that have decayed for the surgeon (and not all skills), it decreases retraining time and addresses potential problems proactively. This curriculum may then be used as one part of a surgical training, certification, and recertification program. By addressing the factors of the skill decay curves that most influence decay, the curriculum generator may be able to train new surgeons and retrain experts to build robust skills that decay slowly, if at all. For each student, the curriculum generator creates a training regimen and keeps a profile of the trainee to assess progress. The trainee can be tracked through the training program to recommend when they are ready for operating room experience, and to test this recommendation at computed intervals in collaboration with attending surgeons’ review. The system first determines the trainee’s skill levels using our assessment methods. It then selects training elements, such as didactics and tasks, from a library and generates new simulation training scenarios as needed. For a deploying surgeon, the system takes a last predeployment reading on skill level, and then receives the record of procedures performed during deployment. Additional information such as a surgeon’s training and individual competency as rated by supervisors is collected and inputted into the curriculum generator. With this information, the curriculum generator develops a refresher course. For example, a surgeon that performed few procedures, had little access to refresher material, and had long periods of time without operations will require more extensive refresher training than a surgeon that performed many operations, refreshing their skills with related procedures. Psychom*otor and visual–spatial skills are modeled by continuous values along multiple skill dimensions, and training elements are selected to challenge the surgeon at his current 84

estimated skill levels using item response theory. Cognitive skills are modeled as knowledge bases of correct and incorrect rules, and training elements are selected that extend these knowledge bases using intelligent tutoring systems techniques. Finally, the curriculum generator generates surgical scenarios for practice on surgical simulators. It uses narrative generation techniques34–36 to present realistic scenarios that also accomplish training goals. LASTS determines which skills need refreshing by using the decay curves and continuous skill assessments. To augment the individual training elements, LASTS uses a scenario-based training methodology, where surgical situations are presented to the trainees instead of abstract tasks. This approach is designed to enable a higher level of skill acquisition and decrease decay. Techniques in searchbased interactive narrative and narrative planning models are particularly suited for this type of generation, since they can be used to generate a sequence of realistic scenarios (the narrative content) while optimizing the ability for the student to learn the content. Narrative planning approaches provide this functionality.32 The role of the narrative planning model is to create realistic, scenario-based training sessions that cover the skills in the necessary quantity and order. The LASTS computational models and curriculum generator will be incorporated into the SUSTAIN prototype, which includes a modular design to support training within a variety of environments, including deployed settings. SUMMARY AND CONCLUSIONS Although the research literature on skill acquisition and decay is comprehensive and extensive, this is not true of the literature that pertains specifically to LS skill decay. Based on the current research literature, it is evident that a wide variety of factors related to task and training characteristics, retention intervals, and transfer task environments impact the rate of decay of various skill sets, and that subcomponents of skill sets decay at different rates. However, relatively little is known about the nature of LS skill retention. Empirical research is needed to develop and validate predictive skill decay models related to these highly specialized skills. This research will help provide guidelines for efficient and effective refresher training that can be implemented within standardized medical training curricula, particularly for military surgeons, to prevent laparoscopic skill deterioration during long deployment cycles. The research questions yet to be addressed in this area are many: Are there different decay rates for different types of learning? Which instructional methods/strategies are the most effective for skill/knowledge acquisition to prevent skill decay? What are the most effective strategies for refresher training (reacquisition)? How do individual factors, such as spatial ability and handedness, impact skill decay? Studies assessing decay of LS skills often do not assess cognitive skills or knowledge, which predicts decay. Also, the studies that address skill acquisition and decay suffer MILITARY MEDICINE, Vol. 178, October Supplement 2013

Prevention of Surgical Skill Decay

from a lack of objective, reliable, and valid measures of skill level and decay, as well as a lack of data on reacquisition of LS skills. When acquisition is studied, assessments are usually in relation to new training techniques compared to a traditional method, and it is difficult to interpret and draw conclusions from these studies without objective, reliable, and valid metrics. Further, many of the studies assessing acquisition and retention are based on psychom*otor tasks, and do not address cognitive or perceptual skills. To remedy this situation, we suggest the following approach: (1) develop and identify assessment metrics that are objective, reliable, and valid to assess these skills over time; (2) acquire a comprehensive understanding of the nature of laparoscopic surgical skill acquisition and decay, including identification of the requisite critical skills; (3) report skill decay curves that enable prediction of which critical skills decay and at what rate; (4) develop training strategies for simulation-based training that support rapid acquisition and long-term retention; and (5) develop retraining for sustainment of these critical and perishable skills. An ongoing research and development effort has been funded by the Office of Naval Research and the Telemedicine and Advanced Technology Research Center to further the development and empirical validation of novel objective metrics for laparoscopic surgical skill acquisition, decay, and reacquisition model; design, develop, and validate a prototype simulation-based assessment and training module; and develop curriculum recommendations for the prevention of LS skills decay. The goal of this effort is to conduct research on LS to develop generalized methods, procedures, and training methods that could be used to prevent decay of other critical medical skills and knowledge. ACKNOWLEDGMENTS This work is being conducted in collaboration with the Uniformed Services University of the Health Sciences, the National Capital Simulation Center, and Harvard Medical School. This research is funded by the Office of Naval Research and the Telemedicine and Advanced Technology Research Center under two Small Business Innovation Research contracts, AnthroTronix, Inc. (N0001411C0420) and Charles River Analytics, Inc. (N0001411C0426). The work reported herein was also partially supported by a grant from the Office of Naval Research, Award Number N00014-10-1-0978.

REFERENCES 1. Arthur W, Bennett W, Stanush PL, McNelly TL: Factors that influence skill decay and retention: a quantitative review and analysis. Human Performance 1998; 11(1): 57–101. 2. Wisher RA, Sabol MA, Ellis J: Staying sharp: retention of military knowledge and skills (Rep. No. ARI Special Report 39). Alexandria, VA, U.S. Army Research Institute for the Behavioral and Social Sciences, 1999. Available at http://www.dtic.mil/cgi-bin/GetTRDoc? AD=ADA366825; accessed May 7, 2013. 3. Brunner WC, Korndorffer JR: Laparoscopic virtual reality training: are 30 repetitions enough? J Surg Res 2004; 122(2): 150–6. 4. Jacklin R, Sevdalis N, Darzi A, Vincent CA: Efficacy of cognitive feedback in improving operative risk estimation. Am J Surg 2009; 197(1): 76–81.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

5. Newell A, Rosenbloom PS: Mechanisms of skill acquisition and the law of practice. In: Cognitive Skills and Their Acquisition, pp 1–55. Edited by Anderson JR, Hillsdale, NJ, Erlbaum, 1981. 6. Heathcote A, Brown S, Mewhort DJK: The power law repealed: the case for an exponential law of practice. Psychon Bull Rev 2000; 7(2): 185–207. 7. Fitts P, Posner M: Human Performance. Belmont, CA, Brooks/Cole, 1967. 8. Fried GM, Feldman LS, Vassiliou MC, et al: Proving the value of simulation in laparoscopic surgery. Ann Surg 2004; 240(3): 518. 9. McCluney AL, Vassiliou MC, Kaneva PA, et al: FLS simulator performance predicts intraoperative laparoscopic skill. Surg Endosc 2007; 21(11): 1991–5. 10. Soper NJ, Fried GM: The fundamentals of laparoscopic surgery: its time has come. Bull Am Coll Surg 2008; 93(9): 30. 11. Stefanidis D, Scerbo MW, Montero PN, Acker CE, Smith WD: Simulator training to automaticity leads to improved skill transfer compared with traditional proficiency-based training: a randomized controlled trial. Ann Surg 2012; 255(1): 30. 12. Stefanidis D, Korndorffer JR, Markley S, Sierra R, Scott DJ: Proficiency maintenance: impact of ongoing simulator training on laparoscopic skill retention. J Am Coll Surg 2006; 202(4): 599–603. 13. Hiemstra E, Kolkman W, van de Put MAJ, Jansen FW: Retention of basic laparoscopic skills after a structured training program. Gynecol Surg 2009; 6(3): 229–35. 14. Stefanidis D, Scerbo MW, Korndorffer JR, Scott DJ: Redefining simulator proficiency using automaticity theory. Am J Surg 2007; 193(4): 502–6. 15. Stefanidis D, Scerbo MW, Sechrist C, Mostafavi A, Heniford BT: Do novices display automaticity during simulator training? Am J Surg 2008; 195(2): 210–3. 16. Park SH, Suh IH, Chien JH, Paik JH, Ritter FE, Siu KC: Modeling surgical skill learning with cognitive simulation. In: Medicine Meets Virtual Reality 18. Edited by Westwood JD, Westwood SW, Fella¨nderTsai L, et al. Amsterdam, The Netherlands, IOS Press, 2011. 17. Palter VN: Comprehensive training curricula for minimally invasive surgery. J Grad Med Educ 2011; 3(3): 293–8. 18. Hassan I, Gerdes B, Koller M, et al: Spatial perception predicts laparoscopic skills on virtual reality laparoscopy simulator. Childs Nerv Syst 2007; 23(6): 685–9. 19. Ritter EM, McClusky DA III, Gallagher AG, Enochsson L, Smith CD: Perceptual, visuospatial, and psychom*otor abilities correlate with duration of training required on a virtual-reality flexible endoscopy simulator. Am J Surg 2006; 192(3): 379–84. 20. Gallagher AG, Cowie R, Crothers I, Jordan-Black JA, Satava RM: PicSOr: an objective test of perceptual skill that predicts laparoscopic technical skill in three initial studies of laparoscopopic performance. Surg Endosc 2003; 17(9): 1468–71. 21. Keehner M, Lippa Y, Montello DR, Tendick F, Hegarty M: Learning a spatial skill for surgery: how the contributions of abilities change with practice. Appl Cogn Psychol 2006; 20(4): 487–503. 22. Keehner MM, Tendick F, Meng MV, et al: Spatial ability, experience, and skill in laparoscopic surgery. Am J Surg 2004; 188(1): 71–5. 23. Cao CGL, MacKenzie CL, Ibbotson JA, et al: Hierarchical decomposition of laparoscopic procedures. In: Medicine Meets Virtual Reality: The Convergence of Physical and Informational Technologies: Options for a New Era in Healthcare. Edited by Westwood JD, Hoffman HM, Robb RA, Stredney D., Amsterdam, The Netherlands, IOS Press, 1999. 24. Scott-Connor CEH: The SAGES Manual, Fundamentals of Laparoscopy, Thoracoscopy, and GI Endoscopy. New York, NY, Springer Science + Business Media, 2006. 25. Chang L, Satava RM, Pellegrini CA, Sinanan MN: Robotic surgery: identifying the learning curve through objective measurement of skill. Surg Endosc 2003; 17(11): 1744–8. 26. Verner L, Oleynikov D, Holtmann S, Haider H, Zhukov L: Measurements of the level of surgical expertise using flight path analysis from Da Vinci robotic surgical system. Stud Health Technol Inform 2003; 94: 373–8. 27. O’Connor A, Schwaitzberg SD, Cao CGL: How much feedback is necessary for learning to suture? Surg Endosc 2007; 22: 1614–9. Available at

85

Prevention of Surgical Skill Decay

28.

29.

30.

31.

86

http://link.springer.com/content/pdf/10.1007%2Fs00464-007-9645-6.pdf; accessed May 7, 2013. Zhou M, Tse S, Derevianko A, Jones DB, Schwaitzberg SD, Cao CGL: The effect of haptic feedback on laparoscopic suturing and knot-tying: a learning curve study. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting 2008. Santa Monica, CA, Human Factors and Ergonomics Society, 2008. Peters J, Fried GM, Swanstrom LL, et al: Development and validation of a comprehensive program of education and assessment of the basic fundamentals of laparoscopic surgery. Surgery 2004; 135(1): 21–7. ¨ ber das geda¨chtnis: Untersuchungen zur experimentellen Ebbinghaus H: U psychologie. Leipzig, Germany, Duncker and Humboldt, 1885. [English translation: Memory: A Contribution to Experimental Psychology. New York, NY, Dover Publications, 1964.] Korndorffer JR, Dunne JB, Sierra R, Stefanidis D, Touchard C, Scott D: Simulator training for laparoscopic suturing using performance goals translates to the operating room. J Am Coll Surg 2005; 201(1): 23–9.

32. Skinner A, Lathan C: Assessment of laparoscopic surgical skill acquisition and retention. In: Studies in Health Technology and Informatics: Medicine Meets Virtual Reality 19. Edited by Westwood JD, Westwood SW, Fellander-Tsai L, et al. Amsterdam, The Netherlands, IOS Press, 2012. 33. Stefanidis D, Heniford BT: The formula for a successful laparoscopic skills curriculum. Arch Surg 2009; 144(1): 77. 34. Niehaus J, Riedl MO: Scenario adaptation: an approach to customizing computer-based training games and simulations. In: Proceedings of the AIED 2009 Workshop on Intelligent Educational Games 2009. Brighton, UK, AIED, 2009. Available at http://www.cc.gatech.edu/riedl/ pubs/aied-ieg09.pdf; accessed May 7, 2013. 35. Riedl M, Young RM: An intent-driven planner for multi-agent story generation. Proceedings of the Third International Conference on Autonomous Agents and Multi Agent Systems, July 2004. 36. Weyhrauch P: Guiding interactive drama. School of Computer Science, Carnegie Mellon University, Doctoral Thesis, Pittsburgh, PA, 1997.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

MILITARY MEDICINE, 178, 10:87, 2013

Effects of Simulation-Based Practice on Focused Assessment With Sonography for Trauma (FAST) Window Identification, Acquisition, and Diagnosis Gregory K. W. K. Chung, PhD*; Ruth G. Gyllenhammer, MA*; Eva L. Baker, EdD†; Eric Savitsky, MD‡ ABSTRACT We compared the effects of simulator-based virtual ultrasound scanning practice with classroom-based ultrasound scanning practice on participants’ knowledge of focused assessment with sonography for trauma (FAST) window quadrants and interpretation, and on participants’ performance on live patient FAST examinations. Novices with little or no ultrasound training experience received simulation-based practice (n = 24) or classroom-based practice (n = 24). Participants who received simulation-based practice scored significantly higher on interpreting static images of FAST windows. On live patient examinations where participants scanned the right upper quadrant (RUQ), left upper quadrant (LUQ), and suprapubic quadrant of a normal patient and an ascites-positive patient, the classroom-based practice condition had a shorter scan time for the LUQ and a higher number of participants attaining high-quality window on the RUQ (normal patient only) and suprapubic quadrant (positive patient only) and correct window interpretation on the LUQ (normal patient only). Overall, classroom-based practice appeared to promote physical acquisition skills and simulator-based practice appeared to promote window interpretation skills. Accurate window interpretation is critical to identification of blunt abdominal trauma injuries. The simulator used (SonoSimulator) appears promising as a training tool to increase probe time and to increase exposure to FAST windows reflecting various anatomy and disease states.

INTRODUCTION The use of portable ultrasound is increasing in military settings, such as combat support hospitals as a triage and an evaluation tool.1,2 Ultrasonography complements standard evaluation techniques and can improve the speed and accuracy of diagnosis of blunt abdominal trauma (e.g., bruising or laceration to the liver resulting in localized internal bleeding). Unfamiliarity with ultrasonography, the cost of training users on ultrasound-guided procedures, and the lack of training opportunities are limiting the use of this beneficial technology. One potentially cost-effective method for providing users with ultrasound-guided procedural training is the use of simulator-based training. Some potential advantages over traditional medical training include: (i) it presents no risk to trainees or patients during practice attempts; (ii) it is more cost-effective than current training methods; (iii) it provides multiple modes of sensory interaction to maximize learning; and (iv) it aids reduction in skill decay. In this study, we compared simulation-based practice of ultrasound scanning *CRESST/University of California, Los Angeles, Peter V. Ueberroth Building (PVUB), 10945 Le Conte Avenue, Suite 1355, Mailbox 957150, Los Angeles, CA 90095-7150. †CRESST/University of California, Los Angeles, 300 Charles E. Young Drive North, GSE&IS Building, 3rd Floor, Mailbox 951522, Los Angeles, CA 90095-1522. ‡UCLA Emergency Medicine Center, 10833 Le Conte Avenue, BE-144 CHS, Los Angeles, CA 90095. The views, findings, and opinions expressed in this article are those of the authors and do not necessarily reflect the positions or policies of Pe´lagique or the Office of Naval Research, nor should they be construed as an official Department of the Army position, policy, or decision unless so designated by other documentation. doi: 10.7205/MILMED-D-13-00208

MILITARY MEDICINE, Vol. 178, October Supplement 2013

to classroom-based practice of ultrasound scanning on both knowledge and performance measures. We focused on one type of procedure, the focused assessment with sonography for trauma (FAST) examination, as the context for the comparison. FAST Examination The FAST examination is an emergency ultrasound procedure that focuses on the detection of free fluid in hemoperitoneum, hemopericardium, pneumothorax, and hemothorax.3 Unlike other trauma screening procedures such as the physical examination, diagnostic peritoneal lavage, and the CT scan, the FAST examination is noninvasive, bedside, and repeatable; requires 5 minutes to complete; and does not require stable patients.4–6 These properties make portable ultrasound scanning an invaluable evaluation tool in combat support hospitals, where there is often a sudden intake of a large number of patients who have life-threatening blunt trauma injuries. The standard windows of the FAST examination are the right upper quadrant (RUQ), left upper quadrant (LUQ), subxiphoid (or subcostal), and suprapubic. The window quadrants are examined for fluid in the gutters between the organs. Free fluid is consistent with internal bleeding resulting from blunt trauma injuries. In the RUQ (also called Morrison’s pouch), evidence of fluid may be found between the liver and the kidney. Fluid may also exist between the kidney and the spleen in the LUQ view or in the pelvis in the suprapubic view.7 Cognitive Demands of FAST Window Acquisition and Interpretation Recognition of anatomical landmarks is fundamental to ultrasound window acquisition. Both diagnostic medical sonographers 87

Effects of Simulation-Based Practice on FAST

and emergency care physicians require training and proficiency in abdominal anatomy as a component of medical training and professional certification.8 Ultrasound scanning involves manipulating a probe or transducer against the patient’s body, wherein the probe acquires a 2-dimensional slice through a 3-dimensional anatomical volume. Minute changes to the pitch, yaw, and roll of the probe cause changes in the window image, and facile probe manipulation is required to acquire clear anatomical landmarks. This skill is not easily taught and precise verbal instructions relating the trainer’s actions to the image on screen can be difficult. Experienced sonographers use nuanced hand movements instinctively during image acquisition.9 The FAST examinations and ultrasonography in general are “operator dependent.”10 To make a correct diagnosis, the sonographer must first acquire an adequate window by using anatomical landmarks and probe manipulation techniques such as fanning. Complicating the task is that the detection of fluid may depend on positioning of either the patient or the probe, with subtle changes in either required for precise window analysis.7,11 Use of Ultrasound Simulators for Training The goals of FAST training are for the trainee to be able to acquire an adequate window and to identify both the free fluid and the type of window when given a single ultrasound scan. A typical classroom curriculum for the FAST examination includes both didactic and hands-on instructions. Didactic training presents instruction on the principles of ultrasonography, an introduction to ultrasound mechanics or knobology, and discussion of the purpose, method, and interpretation of the FAST examination. Because window acquisition is dependent on probe movement, physical probe time is thought to be absolutely necessary to training but often the hands-on practice component is constrained because of limitations of practicing with model patients.7,12 Ideally, a trainee would gain experience by performing the FAST examination on as many cases as possible under the guidance of someone competent in the FAST examination. However, extended training experience with model patients is rarely practical. For example, scanning the LUQ, RUQ, and suprapubic quadrants would take 5 to 8 minutes in a training context per trainee. Assuming a typical class size of 20 students, with each student given two opportunities to scan, the total time required would be between 3 and 5 hours. Thus, using a classroom setting to provide trainees with practice with varying anatomy and varying free-fluid states remains impractical. One of the greatest benefits of simulators may be the capability of the simulator to provide interactive learning opportunities across a large number of cases. Previous research points to the effectiveness of the simulator training in preparing trainees to both perform and interpret the FAST examination. For example, in a 4-hour FAST training course 88

for emergency medicine resident physicians, the UltraSim sonographic (mannequin) training model was found to be as effective in preparing trainees to detect the presence of intraperitoneal free fluid in various FAST windows as the training with a live patient model (using peritoneal dialysis patients).7 In another study, the use of the SonoTrainer sonography simulation system enabled physicians trained with the system (vs. theoretical training alone) to make diagnoses of secondtrimester fetal abnormalities with a detection rate of 86% and specificity of 100%.13 Various systems have been developed with a range of capabilities (e.g., UltraSim, VirUS, EchoComJ, and SONOSim3D, SonoTrainer).13 For the FAST examination, the simulator used in the current study (SonoSimulator)14–16 focuses on providing trainees the capability to visualize free fluid in two ways: as a single static image or as a continuously changing image similar to what would appear on an actual ultrasound machine window. However, we could find no research that directly compared simulator-based practice to classroombased practice with respect to knowledge of given FAST windows and on execution of FAST examination procedures with live patients. Research Question The main research question was to what extent does practice using an ultrasound simulator with virtual patients affect participants’ knowledge and subsequent performance on live FAST examinations, compared to practice in a typical classroom context with a portable ultrasound machine and a live model patient? This study, when conceptualized in Kirkpatrick’s framework,17 would be considered an investigation of Level One: Reaction and Level Two: Learning.

METHOD Design A pretest/intervention/posttest control group design was used to examine the effects of the two practice conditions on participants’ knowledge of and performance on a FAST examination. Participants in the control condition received guidance and feedback from an instructor and hands-on practice with a model patient. Participants in the experimental condition practiced scanning on virtual patients using an ultrasound simulator, and received quality of scan and accuracy of diagnosis feedback from an ultrasound expert. Sample Forty-nine participants were recruited through e-mail announcements to two local medical schools and received $100 for participating. All procedures were approved by our institutional review board for research involving human subjects. One participant was an undergraduate student who was subsequently dropped. MILITARY MEDICINE, Vol. 178, October Supplement 2013

Effects of Simulation-Based Practice on FAST

Forty-eight participants were randomly assigned to either the group-based practice condition (n = 24) or the simulationbased practice condition (n = 24). The mean age was 25.48 years (SD = 3.09 years). Forty-four participants were medical students, three were nursing students, and one was a medical resident. The mean Medical College Admission Test (MCAT) score (n = 41) was 33.63 (SD = 3.22). Twenty-eight participants reported no prior training in ultrasound procedures, 16 participants reported receiving 0 to 2 hours of training, and four participants reported receiving 2 to 4 hours of training. Fifteen participants reported receiving ultrasound scanning procedure training in a lecture or classroom and nine participants reported receiving some hands-on practice. Overall, the sample represented novices with limited or no experience with ultrasound scanning. Model Patients

Patients with ascites were recruited as model patients, for the purpose of simulating the abdominal condition of patients with blunt trauma resulting in hemorrhage. Ascites is a pathological condition resulting from liver disease wherein fluid accumulates in the abdominal cavity.18,19 Data collection occurred over three occasions. For the first two occasions, the same model patient (moderate degree of free fluid) was used. Scheduling conflicts required the use of a second model patient (severe degree of free fluid) for the third data collection. Both patients were positive for free fluid in the RUQ, LUQ, and suprapubic regions. Tasks The four major tasks participants engaged in were tests of knowledge of the FAST examination, FAST examination instruction, FAST examination practice, and conducting FAST examinations on model patients. Tests of Knowledge of the FAST Examination

Participants were administered a pretest of their knowledge of the FAST examination and abdominal anatomy, and a posttest of their knowledge of the FAST examination. The tests included items depicting normal and positive conditions revealed by a FAST examination. FAST Examination Instruction

In this study, the FAST examination was limited to the RUQ (Morrison’s Pouch), LUQ (spleen), and the suprapubic (bladder) windows, as the model patients were ascites positive with free fluid in only these quadrants. Participants viewed two instructional video modules in sequence.5,20 The first video was on the physics of ultrasound and general principles of sonography. The second video focused specifically on the FAST examination, describing how to view and interpret each of the four window quadrants: RUQ, LUQ, suprapubic, and subxiphoid. The video demonstrated a 3-minute FAST examination, discussed the MILITARY MEDICINE, Vol. 178, October Supplement 2013

implications of the mirror image artifact, discussed major pitfalls, and provided case study examples of conditions wherein free fluid is present. FAST Examination Practice

Participants in the control condition received classroombased practice and those in the experimental condition received simulator-based practice. Classroom-based practice

The instructor provided participants 5 minutes of instruction on the basics of machine operation followed by a demonstration of the FAST examination. Each participant then practiced the FAST examination on the normal model patient while the instructor provided guidance and feedback (RUQ, LUQ, suprapubic quadrant). The instructor demonstrated anatomic landmarks in each quadrant and other procedures consistent with typical training procedure.7,21 Each participant was given two opportunities to conduct the FAST examination. One hour was allotted and used for the classroom practice session. Simulator-based practice

The simulator system we used, SonoSimulator, was developed by Pe´lagique (Santa Monica, California). The virtual patients used in SonoSimulator were modeled with real-patient ultrasound scans of both normal and pathologic cases with a generic ultrasound design. The simulator database contained a range of cases with varying amounts of free fluid: absence of fluid, and minimal, moderate, and severe amounts of fluid. The variety of patients used to populate the virtual-patient database also provided scans with a range of anatomy. The simulator presented up to 10 cases with varying levels of normality and severity. Participants manipulated the probe and the probe movements were reflected in a scan window that was updated in real time in a real ultrasound machine. Figure 1 shows a screenshot of the user interface for the RUQ. Each case in the simulation was subdivided into three views: RUQ, LUQ, and suprapubic. Participants were told to find the ideal diagnostic window and then “freeze” the scan and provide a diagnosis. A report with the participant’s response was then printed and given to the expert sonographer. The expert sonographer evaluated the participant’s diagnosis and window quality and determined whether the participant should advance to the next view. Criteria for advancement were an accurate diagnosis and an excellent or fair window. One hour was allotted and used for the simulator-based practice session. FAST Performance Test

The performance test required participants to conduct a FAST examination on two model patients (one normal, one ascites positive). No help or feedback was given to the participants. Participants used a portable ultrasound machine (the M-Turbo,14 MicroMaxx,15 or S-FAST16). Images of the ultrasound window were captured to disk and later evaluated 89

Effects of Simulation-Based Practice on FAST

FIGURE 1.

Screenshot of ultrasound simulator user interface for right upper quadrant (Morrison’s pouch).

by an expert sonographer. During the first data collection we observed a few patients who could not find the appropriate landmarks and took an excessively long time to complete a scan. Thus, on the second and third data collection occasions we imposed a time limit of 2 minutes each for the RUQ and suprapubic scans, and 4 minutes for the LUQ. Measures Two types of measures were developed for this study: (i) knowledge-based measures used to evaluate participants’ knowledge of FAST-examination-related concepts; and (ii) performance-based measures used to evaluate participants’ skill at executing a FAST examination. Knowledge-Based Measures

Knowledge-based measures were derived from information covered in the instructional videos5,20 as well as FASTexamination-specific concepts. Three broad areas were sampled: prior knowledge of anatomy, basic FAST examination procedures, and window interpretation. The sources were consulted to develop or adapt items for the knowledge measures: instructional materials (e.g., textbooks, guidebooks, TABLE I.

and prior research),22–25 and the computer-based instruction5 used in the study. Items were reviewed by various experts (director of emergency ultrasound, director of ultrasound and breast imaging, director of an emergency ultrasound department, an experienced sonographer, and two emergency physicians). Table I shows the final distribution of the items by topic and the scale reliabilities. The knowledge measures were embedded in a 69-item pretest, which included items related to the FAST examination (15 constructed response items and 46 selected response items) and participants’ knowledge of abdominal anatomy (8 selected response items). The anatomy items served as a check of prior knowledge; participants were assumed to have knowledge of abdominal anatomy and thus these items were not of interest on the posttest. The posttest contained the same FAST examination questions without the anatomy questions, in addition to questions about participants’ backgrounds. As may be seen in Table I, the pretest a reliabilities were acceptable (a > 0.70) except for identification of abdominal organs (a = 0.46). The posttest a reliabilities, however, were much lower. The drop in item reliability can be explained by the decrease in variance from the pretest to posttest that is

Knowledge-Based Measures (N = 48) Internal Consistency (Cronbach’s a)

90

Equality of Variance Test

Scales

No. of Items

Pretest

Posttest

Test–Retest Reliability

F

p Value

Prior Knowledge of Anatomy Basic FAST Scanning Procedures Anatomical Interpretation of FAST Windows Window Interpretation Identification of FAST Window Quadrants Diagnostic Interpretation of FAST Windows

8 17 16

0.45 0.70 0.79

n/a 0.44 0.53

n/a 0.47 0.22

n/a 3.75 1.69

n/a < 0.001 < 0.05

14 14

0.91 0.85

0.57 0.23

0.46 0.19

1.84 2.02

< 0.05 < 0.01

MILITARY MEDICINE, Vol. 178, October Supplement 2013

Effects of Simulation-Based Practice on FAST

consistent with participants’ performance becoming more uniform after training and practice. The variance on each posttest scale was significantly lower than the pretest scale. For example, on 6 items in the posttest identification of FAST window quadrants scale, over 85% of the participants got the item correct compared with 16% of the participants on the pretest. The low test–retest reliability suggests that the rank ordering of participants changed from pretest to posttest. Performance-Based Measures (Live Patient FAST Examination) Window acquisition time

The window acquisition time was measured with a stopwatch and represented the period between first contact of the probe with the model patient’s body and when the participant said “stop” to indicate an adequate window or the participant’s judgment that he or she could not acquire the window. Quadrant scan time was computed for the RUQ, LUQ, and suprapubic quadrant, summing across the two model patients. Cronbach’s a was 0.56 for the RUQ, 0.81 for the LUQ, and 0.86 for the suprapubic quadrant. Window quality

For each acquired window, an expert evaluated the quality of the window. The window was rated as “excellent, fair, poor, or other.” “Other” captured situations where the window acquired was nondiagnostic. Window quality was dichotomized into two categories, “excellent or not excellent,” and subsequent analyses examined the number of participants that acquired “excellent” windows by quadrant (RUQ, LUQ, and suprapubic quadrant) and patient type (normal, positive). Diagnostic accuracy

For each acquired window, the participant rendered a diagnosis of that window. An expert evaluated the quality of the diagnosis. The diagnosis was rated as “correct, incorrect, or other.” “Other” captured situations where the window acquired was nondiagnostic. For analysis purposes, diagnostic accuracy was dichotomized into two categories, “correct or not correct,” and subsequent analyses examined the number of participants that interpreted the window correctly by quadrant (RUQ, LUQ, and suprapubic quadrant) and patient type (normal, positive). Perceived Utility of Training

Participants were asked three questions about how well they thought the ultrasound practice session (either with the simulator or in the group session) prepared them to perform a “real” ultrasound examination of a live patient with respect to (i) probe manipulation, (ii) window acquisition, and (iii) window diagnosis. For each question, participants were asked to rate the amount of practice received as being very inadequate, inadequate, adequate, or too much. Participants were also asked for written comments related to each question. MILITARY MEDICINE, Vol. 178, October Supplement 2013

Background Information

Data regarding demographic information and prior experience with ultrasound training were gathered. Participants were asked their age, gender, current position, and MCAT scores. Participants were also asked what type of prior ultrasound training they had received and the number of hours spent training on ultrasound procedures. Procedure Participants were introduced to the study and then were given the pretest of FAST knowledge (30 minutes). A scheduling error resulted in four participants not being administered the pretest. Participants were then given 120 minutes to view instructional videos on the physics of sonography and the FAST examination procedure. Participants received practice in either a classroom setting or a simulation setting (65 minutes). Following the practice session, participants were then required to conduct a live FAST examination on two patients (40 minutes). After the live patient examination, participants were given the knowledge posttest, filled out the feedback form, and completed paperwork to receive payment (40 minutes). The entire protocol took 285 minutes (4 hours and 45 minutes). RESULTS Preliminary Analyses Prior to the main analyses, the conditions were tested for differences on self-reported number of hours of ultrasound training, MCAT scores, and pretest measures; no differences were found, which suggests the conditions were similar on their experience and knowledge of ultrasound scanning procedures. Table II shows descriptive statistics for the knowledge measures, Table III shows descriptive statistics for the performance measures, and Table IV shows correlations among and between the posttest knowledge measures and performance measures. In general, we expected posttest scores to be higher than pretest scores because our participants were novices and they received instruction and practice on ultrasound scanning procedures. We also expected the simulatorbased practice condition to score higher on interpreting windows because of the greater amount of practice on interpreting windows available with the ultrasound simulator compared with the classroom-based practice condition. Finally, we expected the knowledge measures in general to relate to performance measures, particularly on window quality and window interpretation. As shown in Table II, participants’ pretest scores are much lower than their posttest scores, especially on the scales related to FAST windows. Scores on the performance measures show high variability for the time-based measures, which suggests nonuniform training outcomes. As shown in Table IV, in general, knowledge measures did not correlate 91

Effects of Simulation-Based Practice on FAST TABLE II.

Descriptive Statistics of Knowledge Measures Control (Classroom-Based Practice) (n = 24)

Measure

M

SD

Min.

Max.

M

SD

Min.

Max.

8

6.50

1.22

3.00

8.00

6.65

1.09

5.00

8.00

17 17

6.79 14.25

2.90 1.51

0.00 11.00

11.00 17.00

6.45 14.58

2.96 1.50

3.00 11.00

13.00 17.00

16 16

5.83 12.83

2.91 1.79

0.00 10.00

13.00 16.00

5.10 12.25

2.13 2.17

0.00 6.00

8.00 16.00

14 14

1.75 8.50

2.92 2.40

0.00 3.00

9.00 12.00

1.35 9.25

2.91 1.80

0.00 7.00

11.00 13.00

14 14

1.92 8.46

2.89 1.41

0.00 6.00

11.00 11.00

0.90 9.88

1.62 1.70

0.00 7.00

5.00 12.00

Max. Possible

Identification of Abdominal Anatomy Basic FAST Scanning Procedures Pretest Posttest Anatomical Interpretation of FAST Window Pretest Posttest Identification of FAST Window Quadrant Pretest Posttest Diagnostic Interpretation of FAST Window Pretest Posttest

Experimental (Simulator-Based Practice) (n = 24)

TABLE III.

Descriptive Statistics of Performance Measures Control (Classroom-Based Practice) (n = 24)

Measure

Max. Possible

Diagnostic Interpretation of FAST Window a Acquisition of FAST Window a Window Scan Time b, c RUQ Scan Time (s) LUQ Scan Time (s) Suprapubic Scan Time (s)

6 6 240 (s) 480 (s) 240 (s)

Experimental (Simulator-Based Practice) (n = 24)

M

SD

M

SD

4.13 3.33

1.19 1.58

2.00 0.00

6.00 6.00

3.83 2.00

1.13 1.56

2.00 0.00

6.00 6.00

125.71 155.00 68.87

133.04 132.17 105.76

17.00 25.00 22.00

548.00 548.00 548.00

142.75 271.08 88.38

62.13 181.85 48.82

33.00 53.00 9.00

257.00 673.00 180.00

Min.

Max.

Min.

Max.

a

Summed across RUQ, LUQ, and suprapubic quadrants and two patients. bSummed across two patients. cA time limit was imposed on scan time after the first data collection, where we observed a few participants taking an excessively long time to complete a scan.

TABLE IV.

Intercorrelations (Pearson) Among Posttest Knowledge and Performance Measures (N = 48) Knowledge Measures

Posttest Measure Knowledge Measures 1. Basic FAST Scanning Procedures 2. Anatomical Interpretation of FAST Window 3. Identification of FAST Window Quadrant 4. Diagnostic Interpretation of FAST Window Performance Measures 5. No. of Correct FAST Window Interpretations 6. No. of Excellent FAST Windows 7. Total Scan Time (s) 8. RUQ Scan Time (s) 9. LUQ Scan Time (s) 10. Suprapubic Scan Time (s)

1

2

— 0.32* 0.16 0.06

— 0.20 0.07

0.13 −0.05 0.16 0.16 0.16 0.14

0.11 0.09 0.07 −0.02 0.09 0.12

3

4

— 0.53*** 0.21 −0.11 −0.10 −0.02 −0.09 −0.01

Performance Measures 5

6

7

8

9

— −0.21 −0.25 −0.20 −0.12

— 0.76*** 0.87*** 0.66***

— 0.57*** 0.66***

— 0.25

— 0.34* −0.15 0.26 0.13 0.28 0.14

— 0.41** −0.15 −0.20 −0.14 −0.12

*p < 0.05 (two-tailed); **p < 0.01 (two-tailed); ***p < 0.001 (two-tailed).

with performance measures. The exception was diagnostic skill. The more correct interpretations of static (paper) windows, the more correct interpretation of live patient scans (r (47) = 0.34, p < 0.05). No other knowledge measure was related to any of the performance measures. Finally, we checked whether participants benefited from the instructional videos and ultrasound scanning practice. Learning the critical ultrasound knowledge components 92

was presumed to be essential to conducting live patient scans and window interpretation. Separate paired t-tests were conducted on each knowledge measure, as shown in Table V. Participants, independent of condition, learned the content very well as indicated by effect sizes over 1.0. Effect sizes over 1.0 are rare in training research. In general, if the pretest was at the 50th percentile, an effect size of 1.0 would increase the percentile score from 50 to 84. MILITARY MEDICINE, Vol. 178, October Supplement 2013

Effects of Simulation-Based Practice on FAST TABLE V.

Pretest and Posttest Descriptive Statistics and Paired t-Tests (N = 44) Pretest

Basic FAST Scanning Procedures Anatomical Interpretation of FAST Windows Identification of FAST Window Quadrants Diagnostic Interpretation of FAST Windows

Paired t-test

Posttest

M

SD

M

SD

t

p

Effect Size

% Change

6.64 5.50 1.57 1.45

2.90 2.58 2.89 2.43

14.42 12.54 8.88 9.17

1.50 1.99 2.13 1.71

20.64 16.84 17.13 18.43

< 0.001 < 0.001 < 0.001 < 0.001

3.43 2.46 2.80 2.94

117 128 466 532

Was There an Effect of Type of Practice on Knowledge of FAST Examination Procedures and FAST Examination Performance? To address this question, we examined whether there were treatment effects on knowledge and performance outcomes. We checked for differences between conditions on the various knowledge scales and performance was examined by checking for differences on time to scan, and the quality of window acquisition and window interpretation for the RUQ, LUQ, and suprapubic quadrants. Effects of Type of Practice on Knowledge

Separate t-tests were conducted on knowledge of basic FAST scanning procedures, anatomical interpretation of FAST windows, identification of FAST window quadrants, and diagnostic interpretation of FAST windows (Table II). Participants in the experimental condition (M = 9.88, SD = 1.70) scored significantly higher on diagnostic interpretation of FAST windows than those in the control condition (M = 8.46, SD = 1.41), t (46) = 3.14, p = 0.003, d = 0.91. Participants who received simulator-based practice scored 17% higher on items requiring diagnostic interpretation than those who received classroom-based practice. No other significant differences were found. Effects of Type of Practice on Performance

Performance during the live patient examination was evaluated using three measures: (i) time-to-scan; (ii) quality of the acquired window; and (iii) quality of the interpretation of the window. Data were analyzed by type of patient and quadrant. Time-to-scan

Separate analyses were conducted by type of patient (normal, positive) and for each quadrant. For each quadrant, a repeated-measures analysis of variance was conducted, with type of patient (normal, abnormal) the within-subjects factor, and condition (control, experimental) the betweensubjects factor. For the RUQ, an effect of type of patient was also found, with participants in both conditions taking significantly longer to scan the positive patient (M = 79.9 s, SD = 59.3 s) compared to the normal patient (M = 56.5 s, SD = 64.2 s), F (1, 45) = 5.44, p = 0.02, d = 0.38. Participants in general took about 41% longer to scan the positive patient. No other significant differences were found. MILITARY MEDICINE, Vol. 178, October Supplement 2013

Similarly, for the suprapubic quadrant, an effect of type of patient was found, with participants in both conditions taking significantly longer to scan the positive patient (M = 45.4 s, SD = 49.5 s) compared with the normal patient (M = 33.5 s, SD = 37.6 s), F (1, 45) = 6.00, p = 0.02, d = 0.27. Participants in general took about 36% longer to scan the positive patient. No other significant differences were found. For the LUQ, an effect of type of patient was found, with participants in both conditions taking significantly longer to scan the positive patient (M = 124.0 s, SD = 82.3 s) compared with the normal patient (M = 98.1 s, SD = 103.1 s), F (1, 42) = 4.85, p = 0.03, d = 0.28. Participants in general took about 26% longer to scan the positive patient. A significant main effect was also found because of condition, with participants in the control condition taking less time to scan the patients (M = 160.82 s, SD = 132.23 s) than those in the experimental condition (M = 283.36 s, SD = 183.49 s), F (1, 42) = 6.46, p = 0.015, d = 0.77. Participants who received simulatorbased practice took about 76% longer to scan patients compared to those who received classroom-based practice. No other significant differences were found. In summary, these results suggest differences in the timeto-scan between the normal and positive patients in general. The only significant effect of practice was on the LUQ quadrant, with participants who received hands-on classroombased practice performing the scan faster than those who received only simulator-based practice (which did not include practice finding the initial anatomical landmarks). Window quality

Table VI shows the distribution of control and experimental participants who acquired high-quality windows. High quality was defined as a window quality rating of “excellent.” “Other” was defined as a window quality rating of “fair, poor, or other.” Separate c 2-tests were conducted for each quadrant by type of patient. For the normal patient (RUQ), participants in the classroom-based practice condition had a higher rate of high-quality windows acquired (n = 18) than did participants in the simulator-based practice condition (n = 11). Conversely, participants in the simulator-based practice condition had a higher rate of non-high-quality windows acquired (n = 13) than did participants in the classroom-based practice condition (n = 6), c 2 = 5.37, p = 0.02. Seventy-five percent of participants who received classroom-based practice were 93

Effects of Simulation-Based Practice on FAST TABLE VI.

Comparison of Window Quality by Condition (N = 48)

Control (Classroom-Based Practice) Quadrant Normal Patient RUQ* LUQ Suprapubic Positive Patient RUQ LUQ Suprapubic***

No. of Participants With High-Quality Windows

Experimental (Simulator-Based Practice) Other

No. of Participants With High-Quality Windows

Other

18 8 19

6 16 5

11 4 15

13 20 9

10 9 16

14 15 8

8 6 4

16 18 20

*p < 0.05 (two-tailed); ***p < 0.001 (two-tailed).

able to acquire high-quality windows, compared to 46% of participants who received simulator-based practice. Similarly, for the positive patient (suprapubic quadrant), participants in the classroom-based practice condition had a higher rate of high-quality windows acquired (n = 16) than did participants in the simulator-based practice condition (n = 4). Conversely, participants in the simulator-based practice condition had a higher rate of non-high-quality windows acquired (n = 20) than did participants in the classroom-based practice condition (n = 8), c 2 = 12.34, p < 0.001. Sixty-seven percent of participants who received classroom-based practice were able to acquire high-quality windows, compared to 17% of participants who received simulator-based practice. Window interpretation Table VII shows the distribution of control and experimental participants who correctly interpreted windows. “Other” was defined as a rating of “incorrect or other.” Separate c 2-tests were conducted for each quadrant by type of patient. For the normal patient (LUQ), participants in the classroom-based practice condition had a higher rate of high-quality windows acquired (n = 17) than did participants in the simulator-based practice condition (n = 9). Conversely, participants in the simulator-based practice condition had a higher rate of non-high-quality windows acquired (n = 15) than did participants in the classroom-based practice condition (n = 7), c2 = 5.37, p = 0.02. Seventy-one percent of the participants who received classroom-based practice were TABLE VII.

able to interpret windows correctly, compared to 38% of the participants who received simulator-based practice. Conditional analyses were also conducted to examine whether there was a difference in diagnosis quality given an adequate window acquisition (i.e., a window rating of excellent or fair). There were no differences between conditions by quadrant and patient, or by overall diagnosis quality. Participants’ Perceptions of the Effectiveness of Practice We examined participants’ perceptions of the utility and effectiveness of the practice received with respect to preparing them to conduct a FAST examination on a live model patient. Because nearly all participants responded with ratings of inadequate amount of practice or adequate amount of practice, the survey responses were collapsed into two categories, inadequate (comprising ratings of very inadequate and inadequate) and adequate (comprising ratings of adequate and too much practice) (see Table VIII). c 2-tests of independence were conducted for each question, to test for an association between condition and participants’ perceptions of the adequacy of the amount of practice. For the acquisition aspect of scanning, there were no differences by condition. For the physical aspect of scanning, participants in the classroom-based practice condition reported a higher rate of adequate amount of practice (n = 17) than did participants in the simulator-based practice

Comparison of Window Interpretation by Condition (N = 48)

Control (Classroom-Based Practice) Quadrant Normal Patient RUQ LUQ* Suprapubic Positive Patient RUQ LUQ Suprapubic

No. of Participants with Correct Interpretations

Experimental (Simulator-Based Practice) Other

No. of Participants With Correct Interpretations

Other

24 17 18

0 7 6

23 9 22

1 15 2

18 11 11

6 13 13

17 14 7

7 10 17

*p < 0.05 (two-tailed).

94

MILITARY MEDICINE, Vol. 178, October Supplement 2013

Effects of Simulation-Based Practice on FAST TABLE VIII.

Distribution of Participant Responses About the Amount of Practice

Question 1. Physical Aspect of Scanning (i.e., Manipulating the Probe) Classroom-Based Practice Simulator-Based Practice 2. Acquisition of a Scan Window Aspect of Scanning (i.e., Being Able to Acquire a High-Quality Window) Classroom-Based Practice Simulator-Based Practice 3. Diagnostic Aspect of Scanning (i.e., Being Able to Identify Normal or Abnormal Conditions) Classroom-Based Practice Simulator-Based Practice

Inadequate Amount of Practice

Adequate Amount of Practice

6 13

17 11

12 8

12 16

17 11

6 13

condition (n = 11). Conversely, participants in the simulatorbased practice condition reported a higher rate of inadequate amount of practice (n = 13) than did participants in the classroom-based practice condition (n = 6), c 2 = 3.85, p = 0.05. For the diagnostic aspect of scanning, participants in the simulator-based practice condition reported a higher rate of adequate amount of practice (n = 13) than did participants in the classroom-based practice condition (n = 6). Conversely, participants in the classroom-based practice condition reported a higher rate of inadequate amount of practice (n = 17) than did participants in the simulator-based practice condition (n = 6), c 2 = 3.85, p = 0.05. Participants in both conditions appeared to perceive the adequacy of the practice sessions differently. Participants who received classroom-based practice reported adequate practice in general on acquiring a window compared to participants who received simulator-based practice, and inadequate practice on diagnosing window scans compared to participants who received simulator-based practice. DISCUSSION Participants learned from the training materials. There was clear evidence of the effectiveness of the training materials (the instructional videos and the practice). Participants more than doubled their scores on the posttest, with gains of 117%, 128%, 466%, and 532% on basic FAST scanning procedures, anatomical interpretation of FAST windows, identification of FAST window quadrants, and diagnostic interpretation of FAST windows, respectively. How much the instructional videos or the practice contributed to the posttest gains cannot be determined because measures were not administered between the instruction and practice components of the study. There was clear evidence that participants in the simulatorbased practice condition were able to interpret correctly more FAST windows than participants who received classroomMILITARY MEDICINE, Vol. 178, October Supplement 2013

based practice (17% higher scores on the knowledge posttest). Participants in the simulator condition practiced on multiple normal and positive cases, and were able to compare an acquired window to reference windows that represented normal and positive states (minimal, mild, and severe). Participants in the simulator-based practice condition reported that the simulation practice aided them in interpreting scans. Participants in the simulator-based practice condition reported a higher rate of adequate amount of practice than did participants in the classroom-based practice condition. These findings point to a major benefit of the FAST simulator-based practice: the capability to offer practice with multiple cases and conditions synchronized with the probe manipulation. Simulator-based practice appears to be similar to classroombased practice in diagnostic accuracy of windows acquired during live patient examinations. There was no difference across conditions in the number of correct interpretations on five of the six quadrants across one normal and one positive. The exception was for the LUQ (normal patient), where more classroom-based practice participants were able to correctly diagnose the window. Classroom-based practice may promote greater window acquisition skills compared to simulator-based practice. Statistical differences in window quality were found that favored the classroom-based practice condition for the RUQ (normal patient) and suprapubic quadrant (positive). In both cases, a higher number of participants in the classroom-based practice condition were able to acquire excellent windows compared to the simulator-based practice condition. These results may be pointing to one important difference between practice conditions. In the classroom-based practice condition, participants received individualized guidance from the instructor on probe positioning and rotation, and sometimes included the instructor physically guiding the participant’s hand to establish correct placement on the patient. In contrast, the simulator had the probe locked into the ideal initial positions, thereby providing participants in the simulator-based practice condition no opportunity to find the correct initial position or exposure to the various anatomical landmarks that would arise in a search for the target location. (This feature has been incorporated in more recent versions of the simulator.) This difference was a major issue that surfaced in participants’ self-reports. Participants in the classroombased practice condition reported that they received an adequate amount of practice compared to participants in the simulator-based practice condition, who reported an inadequate amount of practice. There appears to be no difference in the time to scan a quadrant between the two practice formats. For the LUQ, the simulation-based practice condition took significantly longer to scan the patients (76% longer). Although there were no statistically different scan times because of condition on two of the three quadrants, these quadrants had large standard deviations in general, with the control condition having significantly larger scan time standard deviations. 95

Effects of Simulation-Based Practice on FAST

In summary, these findings suggest the general effectiveness of the simulator practice: (i) superior diagnostic interpretation on a knowledge-based test; and (ii) similar levels of performance on live patient examinations—despite having no prior hands-on practice (i.e., the first scan with a patient was the live patient test). Limitations There are two limitations to this study. First, the availability of only two ascites-positive model patients limited the range of cases and situations on which a participant was tested. Questions remain about sampling and generalizability—to what extent does performance on the live test in this study adequately represent performance on actual patients likely to be encountered? The second limitation is a potential training effect in the classroom-based practice condition. We do not have information on how representative the classroom instructor was of ultrasound trainers in general. The instructor used in this study reported over 20 years of experience teaching ultrasound concepts and procedures. Our observation of the classroom instruction suggested an instructor who was able to provide clear explanations and demonstrations of procedures, and provide effective feedback and guidance to students who had difficulty acquiring a window and initial probe position. In addition, the class size used (about eight participants) may have been smaller than a typical training class. Implications for FAST Training One of the most interesting findings of this study was that using the virtual trainer (SonoSimulator) for practice did not result in markedly inferior performance on the physical aspects of scanning. The simulator-based training was sufficiently effective to enable participants who received no hands-on practice to perform comparably to the participants who received hands-on practice on most of the performance measures across the normal and positive patients used in this study. However, self-reports of the participants who were given simulator-based practice point to the importance of being able to find the initial probe location for a quadrant. In the broader training context, the findings of this study are consistent with those of design features of effective simulators. The simulator was designed around the cognitive demands of FAST window acquisition and interpretation. Through repeated exposure to cases, users are engaged in the review, identification, acquisition, and interpretation of a FAST window. High fidelity is used judiciously—only to link the real-time probe response to its corresponding window. Users can view interpolated windows of scans of actual patients resulting in exposure to varying free-fluid conditions and varying anatomy, and then compare an acquired window to windows of normal and various positive conditions. This latter capability is an important instructional feature as it helps users identify the window characteristics of the various free-fluid states. 96

One of the most important training benefits of simulators is extended training time. Having available virtual patients to scan with various disease conditions avoids the restrictions associated with model patients, such as availability and willingness to endure long training sessions with a number of trainees. In the case of ascites, patients are often too ill to even participate in a study. Another benefit of the simulator is documented pathology—any anatomical anomalies or a specific severity of condition can be included in the simulator for unlimited number of views. Finally, the utility of the simulator as an anytimeanywhere refresher trainer is clear. Assuming basic competency at initial probe location and landmark identification, the use of the FAST simulator as a means to relearn procedures anytime-anywhere seems ideal as the simulator emphasizes the mapping between probe movement and the window quality. Perhaps the most powerful capability of the simulator is to provide users with practice recognizing various disease conditions or window anomalies (e.g., artifacts) that would be difficult to observe otherwise. ACKNOWLEDGMENTS SonoSimulator is protected by U.S. patent number 8,297,983, “Multimodal Ultrasound Training System” assigned to the Regents of the University of California. Lead Inventor: Eric Savitsky, MD. SonoSim retains exclusive licensing rights to the associated intellectual property associated with the patent. Eric Savitsky is a Founding Partner of Pe´lagique, which funded this study. The work reported herein was supported by a subcontract from Pe´lagique to the National Center for Research on Evaluation, Standards, and Student Testing (CRESST). The work reported herein was also partially supported by a grant from the Office of Naval Research, Award Number N00014-10-10978. Pe´lagique’s work on the SonoSimulator is supported by the U.S. Army Medical Research and Materiel Command under Contract No. W81XWH11-C-0529.

REFERENCES 1. Nelson BP, Melnick ER, Li J: Portable ultrasound for remote environments, part II: current indications. J Emerg Med 2011; 40: 313–21. 2. Hile DC, Morgan AR, Laselle BT, Bothwell JD: Is point-of-care ultrasound accurate and useful in the hands of military medical technicians? A review of the literature. Mil Med 2012; 177: 983–7. 3. Fox JC, Irwin Z: Emergency and critical care imaging. Emerg Med Clin North Am 2008; 26: 787–812. 4. Patel NY, Riherd JM: Focused assessment with sonography for trauma: methods, accuracy, and indications. Surg Clin North Am 2011; 91: 195–207. 5. Fox JC: Focused Assessment with Sonography in Trauma “FAST.” Los Angeles, CA, Pe´lagique, 2010. 6. Nelson BP, Melnick ER, Li J: Portable ultrasound for remote environments, part I: feasibility of field deployment. J Emerg Med 2011; 40: 190–7. 7. Arienti V, Camaggi V: Clinical applications of bedside ultrasonography in internal and emergency medicine. Intern Emerg Med 2011; 6: 195–201. 8. Mateer J, Plummer D, Heller M, et al: Model curriculum for physician training in emergency ultrasound. Ann Emerg Med 1994; 23: 95–102. 9. Weidenbach M, Wild F, Scheer K, et al: Computer-based training in two-dimensional echocardiography using an echocardiography simulator. J Am Soc Echocardiogr 2005; 18: 362–6.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

Effects of Simulation-Based Practice on FAST 10. Tsui CL, Fung HT, Chung KL, Kam CW: Focused abdominal sonography for trauma in the emergency department for blunt abdominal trauma. Int J Emerg Med 2008; 1: 183–7. 11. American Institute of Ultrasound in Medicine: AIUM practice guideline for the performance of the focused assessment with sonography for trauma (FAST) examination. J Ultrasound Med 2008; 27: 313–18. 12. Shackford SR, Rogers FB, Osler TM, Trabulsy ME, Clauss DW, Vane DW: Focused abdominal sonogram for trauma: the learning curve of nonradiologist clinicians in detecting hemoperitoneum. J Trauma 1999; 46: 553–64. 13. Maul H, Scharf A, Baier P, et al: Ultrasound simulators: experience with the SonoTrainer and comparative review of other training systems. Ultrasound Obstet Gynecol 2004; 24: 581–5. 14. SonoSite: M-Turbo Ultrasound System. SonoSite, Bothell, WA, 2009. 15. SonoSite: MicroMaxx Ultrasound System. SonoSite, Bothell, WA, 2008. 16. SonoSite: S Series System. SonoSite, Bothell, WA, 2010. 17. Kirkpatrick DL, Kirkpatrick JD: Evaluating Training Programs: The Four Levels, Ed 3. San Francisco, CA, Berrett-Koehler, 2006.

MILITARY MEDICINE, Vol. 178, October Supplement 2013

18. Runyon BA: Care of patients with ascites. N Engl J Med 1994; 330: 337–42. 19. Yu AS, Hu KQ: Management of ascites. Clin Liver Dis 2001; 5: 541–68. 20. Fox JC: Ultrasound Instrumentation and Image Acquisition. Los Angeles, CA, Pe´lagique, 2010. 21. Salen P, O’Connor R, Passarello B, et al: FAST education: a comparison of teaching models for trauma sonography. J Emerg Med 2001; 20: 421–5. 22. Alberto VO, Kelleher D, Nutt M: Post laparoscopic cholecystectomy ascites: an unusual complication. Internet J Surg 2007; 10. 23. Carnes E: FAST exam, 2007. Available at http://www.medlectures. com/Emergency%20Medicine%20Lectures/Trauma%20Lectures/The% 20Fast%20Exam.ppt; accessed July 29, 2010. 24. Noble VE, Nelson B, Sutingco AN: Manual of Emergency and Critical Care Ultrasound. New York, Cambridge University Press, 2007. 25. Reardon R: Ultrasound in trauma—the FAST exam. Ultrasound Guide for Emergency Physicians, 2008. Available at http://www.sonoguide .com/FAST.html; accessed July 29, 2010.

97

MILITARY MEDICINE, 178, 10:98, 2013

Adaptive and Perceptual Learning Technologies in Medical Education and Training Philip J. Kellman, PhD ABSTRACT Recent advances in the learning sciences offer remarkable potential to improve medical education and maximize the benefits of emerging medical technologies. This article describes 2 major innovation areas in the learning sciences that apply to simulation and other aspects of medical learning: Perceptual learning (PL) and adaptive learning technologies. PL technology offers, for the first time, systematic, computer-based methods for teaching pattern recognition, structural intuition, transfer, and fluency. Synergistic with PL are new adaptive learning technologies that optimize learning for each individual, embed objective assessment, and implement mastery criteria. The author describes the Adaptive Response-Time-based Sequencing (ARTS) system, which uses each learner’s accuracy and speed in interactive learning to guide spacing, sequencing, and mastery. In recent efforts, these new technologies have been applied in medical learning contexts, including adaptive learning modules for initial medical diagnosis and perceptual/adaptive learning modules (PALMs) in dermatology, histology, and radiology. Results of all these efforts indicate the remarkable potential of perceptual and adaptive learning technologies, individually and in combination, to improve learning in a variety of medical domains.

INTRODUCTION Recent advances in the learning sciences offer remarkable potential to improve medical education. These advances are relevant to almost all domains of medicine, and they have direct application to maximizing the benefits of simulation and cutting-edge technologies. In this article, I describe two innovations in training technology that apply to simulation and other aspects of medical learning: perceptual learning (PL) and adaptive learning technologies. PL techniques teach pattern recognition, structural intuition, and fluency. Adaptive learning technologies can optimize learning for each individual, embed objective assessment throughout learning, and implement mastery criteria. Understanding the role and value of these emerging technologies requires some discussion of traditional conceptions of learning and how these are changing, as well as elaboration of basic elements and benefits of each technology. We consider conceptions of learning, perceptual learning technology, and adaptive learning technology in the first three sections. Then, we describe recently developed medical learning applications of perceptual and adaptive learning technology, in the areas of clinical diagnosis, radiology, dermatology, and histopathology. In the final section, we consider Department of Psychology, University of California, Los Angeles, 405 Hilgard Avenue, Los Angeles, CA 90095-1563. Any opinions, findings, and conclusions or recommendations expressed in this article are those of the author and do not necessarily reflect the views of the U.S. Department of Education, the National Science Foundation, or other agencies. The findings and opinions expressed here do not necessarily reflect the positions or policies of the Office of Naval Research. Systems that use learner speed and accuracy to sequence learning events, as well as some aspects of perceptual learning technology described here, are protected by U.S. patent 7052277 and patents pending, assigned to Insight Learning Technology, Inc. For information, please contact either the author or Info@ insightlearningtech.com. doi: 10.7205/MILMED-D-13-00218

98

synergies between these learning technologies and simulation tools and techniques in medicine. REVISITING LEARNING In most instructional settings, learning is organized around two types of knowledge. This is not surprising, as these two types are often considered exhaustive, even in many cognitive psychology texts. Declarative knowledge includes facts and concepts that can be verbalized. Procedural knowledge includes sequences of steps that can be enacted. A conventional view of learning, shared by nonspecialists and researchers alike, is that learning consists of accumulating these facts, concepts, and procedures.1 The standard view has been called a “container” model of the mind: Learning consists of facts, concepts, and procedures that we place into the container (the mind), and for later performance, we retrieve these items.1 Persistent problems in learning and instruction suggest that this learning worldview is defective. Students who have been faithfully taught and have diligently absorbed declarative and procedural inputs fail to recognize key structures and patterns in real-world tasks. Students may know procedures but fail to understand their conditions of application or which ones apply to new problems or situations. And, learners may understand but process slowly, with high cognitive load, making them impaired in demanding, complex, or time-limited tasks. These characteristic problems can be observed in learning domains from mathematics to surgical training. They suggest that much is missing from the typical view of learning. What is it? Some answers are clearly available if one looks, not at the literatures on education or learning, but the literature on expertise. Studies of expertise—what people are like when they are really good at things—recurrently implicate a number of abilities that emerge from changes in the MILITARY MEDICINE, Vol. 178, October Supplement 2013

Adaptive and Perceptual Learning Technologies in Medical Education and Training TABLE I.

Some Characteristics of Expert and Novice Information Extraction Novice

Discovery Effectsa Selectivity Attention to Relevant and Irrelevant Information Units Fluency Effectsb Search Type Cognitive Load Speed

Expert Selective Pickup of Relevant Information/Filtering

Simple Features

“Chunks”/Higher-Order Relations

Serial Processing High Slow

More Parallel Processing Low Fast

a

Discovery effects involve learning and selectively extracting features or relations that are relevant to a task or classification. b Fluency effects involve coming to extract relevant information faster and with lower attentional or cognitive load. (See text.)

way information is extracted: PL. Kellman2 suggested that PL effects fall into two broad categories, discovery and fluency effects. Table I summarizes a number of these in each category. Discovery effects refer to learners finding the information that is most relevant to a task. One important discovery effect is increased attentional selectivity. With practice on a given task, learners come to pick up the relevant information for relevant classifications while ignoring irrelevant variation.3 Practice also leads learners to discover invariant or characteristic relations that are not initially evident (cf., Chase and Simon4) and to form and process higher level units (Goldstone5; for reviews, see Bereiter and Scardamalia,1 Gibson,3 and Goldstone6). Fluency effects refer to changes in the efficiency of information extraction. PL leads to fluent and sometimes automatic processing,7 with automaticity in PL defined as the ability to pick up information with little or no sensitivity to task load. As a consequence, perceptual expertise may lead to more parallel processing and faster pickup of information. It is fair to say that studies of expertise have done more to describe these characteristics of experts than to reveal how these changes come about, except for the observation that expertise grows over long experience.8 More foundational work suggesting how these changes arise was done by Eleanor Gibson3 and her students several decades ago. Gibson defined PL as “changes in the pick up of information as a result of practice or experience” and argued that such changes tended to be domain-specific improvements, resulting from classification experience, involving the discovery of characteristic or invariant properties distinguishing objects or situations from one another.3 Recently, PL has become a major focus of research in cognitive science and neuroscience (for reviews, see Kellman,2 Fahle and Poggio,9 and Kellman and Garrigan10). For present purposes, 3 clear ideas are most relevant. First, PL is a pervasive process of learning that serves to optimize information extraction to improve task performance. Second, with approMILITARY MEDICINE, Vol. 178, October Supplement 2013

priate procedures, all kinds of feature and pattern extraction can be improved by using PL. Third, these improvements are often dramatic, sometimes improving task performance by orders of magnitude. One example of a complex task in which dramatic PL effects have been studied is chess. On a good day, the best human chess grandmaster can beat a chess-playing computer that examines upward of 200 million possible moves per second and incorporates methods for evaluating positions and strategies culled from grandmaster consultants. By comparison, human players do relatively little raw search in chess, examining perhaps as many as 4 possible moves and following these to a depth of several successive possible moves. Despite this huge discrepancy in search ability, humans can play chess at astonishingly high levels. Remarkably, the incredible abilities of skilled chess players, relative to novice players, turn out not to depend primarily on sophisticated reasoning or a greater storehouse of factual knowledge. They depend on perception of structure: learned pattern classification abilities of remarkable flexibility, complexity, and sophistication.4,11 Much of the relevant perception of structure is not verbally accessible. With appropriate learning experiences in a specific domain, PL allows humans to reach almost magical levels of expertise, but the relevant learning experiences are not those of traditional classrooms or tutorials. These observations about the origins of advanced expertise apply to many high-level domains of human competence; in medicine, they are crucial for understanding the skills of the expert radiologist, pathologist, and surgeon. Likewise, PL appears to form the core of the notion of “situation awareness,” which can be described as “being aware of what is happening around you to understand how information, events, and your own actions will affect your goals and objectives.”12 Situation awareness is crucially important to many domains of military training and performance, as well as aviation and air traffic control, and many other complex tasks. The PL effects given in Table I summarize much of what is involved: selectively and automatically picking up task-relevant information, detecting important relationships, and being able to extract information with low-enough cognitive load to allow handling of complex and overlapping task demands. In these and other domains, there is a common misconception about PL effects in expertise, related both to the oftrepeated maxim that becoming an expert requires 10,000 hours of practice and the typical view of learning as storing something in the mind. The misconception is that what happens in the transition from novice to expert has to do with committing to memory a great number of examples. A related idea is the suggestion that stored instances somehow become “mental models.” In chess, for example, it may be asserted that the experts succeed because they have memorized many games. These ideas do not provide a workable account of the expertise furnished by PL. Although experiencing many instances can be an important input to PL, storage of instances does not 99

Adaptive and Perceptual Learning Technologies in Medical Education and Training

produce much of the relevant expertise, nor is it a component of leading computational models of PL (see Kellman and Garrigan,10 for a recent review). The reason involves what is needed to effectively use any facts, procedures, or models stored in memory (especially if there is a lot stored). Effective performance relies crucially on pattern recognition. When faced with a new situation, the question is: Which of the items, procedures, or models stored in the brain is relevant to this situation? This is a problem that requires classifying the new input. PL is the learning process that ultimately, through changes in the attunement, scope, and fluency of information extraction,3,10 distinguishes the expert from the novice who does not see what is relevant or who is blind to the distinguishing features that place the input into one category rather than another. In domains that matter, this can hardly ever be done by use of memorized instances. The skilled radiologist, for example, must detect the pathology in a new image or set of images, where the tumor may be manifest in a different location, size, orientation, and contrast, and situated amidst novel and variable background anatomy and image noise, as compared with any images seen previously. The power of exposure, classification, and feedback involving a wide variety of cases is that information selection and pattern discovery mechanisms are honed, allowing the pickup of relevant structures, and equally important, that the information extraction mechanisms discard or ignore irrelevancies that do not drive important classifications. The expert emerges from PL experience with an attuned information extraction system, not a storehouse of memorized instances. The access to relevant stored information can work effectively in complex domains only after the input is rapidly and accurately classified. It is paradoxical (or instructive) that one encounters the instance memorization account in reference to chess, as this is a domain in which the futility of memorizing can be shown by quantitative proof, based on the fact that the sheer number of possibilities dwarfs any capacity to remember and replay specific games. It has been calculated that after 40 moves of a game of chess, there are about 10120 different possible games. This exceeds by a considerable amount the number of atoms in the universe (about 1080)! Even chess-playing machines, whose memory capacity far exceeds humans, both in volume and accuracy, are not able to play chess primarily by looking up familiar games. PERCEPTUAL LEARNING TECHNOLOGY Most recent PL research has focused on low–level sensory discriminations.9 This focus derives from an interest on understanding plasticity in the brain, and from the fact that sensory coding is best understood in the early cortical levels of the brain. Considerable research, however, indicates that PL is equally applicable to high-level, complex tasks.3,4,13–15 Many of these research efforts in both high- and low-level PL have led to an improved understanding of the conditions that produce PL. 100

These developments are significant, because conventional instructional techniques do little to advance expert pattern recognition and fluency. In many domains, there has been a tacit assumption that we cannot teach this kind of knowing. In accord with this assumption, radiologists, surgeons, and pathologists, as well as chemists, pilots, and air traffic controllers, are told that expert intuitions will arise, not from “book learning,” but from “seasoning,” “experience,” or the passage of time. From the standpoint of cognitive science, the passage of time is not a strong candidate for a learning mechanism. Instead it turns out that this kind of learning can be systematically addressed and accelerated using appropriate computerbased instantiations of principles of PL.13,15 We call these perceptual learning modules (PLMs). A complete description of PL techniques is beyond the scope of this article, but a few basics will serve to characterize the approach. PLMs use interactive learning trials; learning advances through many short trials in which the learner performs some classification task and receives feedback. Classification episodes are the engine that drives PL processes to discover and process fluently key features and relationships relevant to the task. Equally crucial are specific kinds of variation in the display sets. Instances never or seldom repeat. Positive instances must vary in characteristics irrelevant to the classification, to allow learning of invariances. Negative instances must share with positive instances the values and dimensions of irrelevant properties. Research suggests a number of other important considerations about trial formats, spacing, and sequencing.16 The key to understanding PLMs, relative to traditional instructional modes, is that in PLMs one is seldom asked to solve an explicit problem or give a declarative answer; rather the tasks in PLMs call upon the learner to classify, locate, distinguish, or map structure across multiple representations. Work with PLMs shows that relatively brief interventions can produce large learning gains in many domains. Some examples include aviation training,13 mathematics,14 and science learning.17 In some especially novel applications, PLMs are being used to improve intuitions about patterns that may lead to drug discovery in the pharmaceutical industry. A number of studies indicate the role of perceptual structure in science, technology, engineering, and mathematics (STEM) learning domains,18 as well as the potential of PL interventions to accelerate expert information extraction and fluency in mathematics.10,14,16,17,19,20 PL interventions seem to be able to overcome pervasive obstacles in mathematics learning. In a recent series of PLMs targeting interrelated concepts in linear and area measurement, units, fractions, multiplication, and division, middle school students using PLMs in targeted interventions consistently showed strong and long-lasting learning gains on assessments including primarily transfer items, with effect sizes in the range from 0.84 to 2.69.14,16 PLM techniques systematically address aspects MILITARY MEDICINE, Vol. 178, October Supplement 2013

Adaptive and Perceptual Learning Technologies in Medical Education and Training

of expertise for which direct instructional methods have not been previously available. Both in terms of the breadth of applications and the possibility of radically improving learning in high-stakes domains, no area is more promising for PL technology than medical learning. Within radiology alone, there are a huge number of perceptual classifications relating to classification of pathology and normal variation, spanning not only a variety of disease conditions but also several different imaging modalities. Some of these involve a small number of fixed views; others involve 3D models capable of generating many views and requiring perceptual exploration to process fully, whereas in still others, such as ultrasound, the crucial information is often available only in animated sequences. It is well known that the speed and accuracy of the expert radiologist in exploring, seeing, and classifying develop over long and unsystematic experience (and may be highly variable across individuals). Likewise, pathologists must distinguish and classify different tissue conditions and pathogens, and dermatologists must classify skin conditions. Nor is PL confined to visual displays; heart sounds and breathing abnormalities are auditory examples, and we could enumerate haptic and tactile examples as well. Equally important are the perceptual–procedural combinations required in surgical and interventional procedures. Although we are accustomed to thinking about the deft hands of the surgeon, the crucial role of perceptual expertise in guiding procedures, recognizing tissues and organs, and providing feedback from action illustrates Benjamin Franklin’s astute observation: “The eye of the master will do more work than both of his hands.” We have begun to engineer PL technology into a number of domains of medical learning, and the potential appears limitless. Before describing these initial efforts, it will be useful to introduce the companion innovation that allows us to get the most from these efforts in PL interventions and also improves the efficiency of other types of learning: adaptive learning technology. ADAPTIVE LEARNING TECHNOLOGY In most instructional settings, student learning is limited by the failure of instruction to adapt to the individual. Students have different starting points and differ in aspects of lessons they learn well or poorly. Testing often arrives at the end, not in the midst, of learning, and it often involves global scoring rather than rich descriptions of what has and has not been learned. Moreover, testing usually targets accuracy alone, or perhaps speed for an entire test. Seldom are combined accuracy and fluency measures used to assess detailed aspects of learning; nor are assessments fed back continuously to optimize each individual’s learning. Lacking such links between continuous assessment and the flow of learning events, it is also rare for the learner to be guided to mastery criteria involving accuracy and fluency for all components of learning tasks. These limitations can potentially be overcome, and MILITARY MEDICINE, Vol. 178, October Supplement 2013

learning dramatically improved, by the use of adaptive learning technology. Adaptive Response-Time-Based Sequencing (ARTS) System Since the classic work of Atkinson in the 1960s,21 a variety of adaptive learning schemes have been proposed, with the goal of using the learner’s performance along with laws of learning and memory to make learning more efficient. These systems have usually been tested with the learning of discrete items, such as foreign language vocabulary words, and have been shown to outperform random presentation of items. Most systems adapt the presentation of items based on the learner’s accuracy on previous trials, and some guide learning by algorithms that derive estimates of probabilities of items becoming well-learned, based on models of learning.22,23 The success of previous adaptive learning systems suggests the overall promise of adaptive approaches. Existing systems, however, have important limitations. One is that model-based systems require a prior experiment, using similar learners and random presentation of learning materials, to estimate parameters for implementing the adaptive scheme. Another is that reliance on accuracy omits important information that may be provided by response times (RTs). We have developed a new adaptive learning system that uses both accuracy and speed to determine the spacing and sequencing in learning, as well as in implementing mastery criteria. We call it ARTS – Adaptive Response-Time-Based Sequencing.24 We describe some basics of the system and then describe its utility. Consider a set of n items (facts, patterns, concepts, procedures) to be learned. How can we optimize learning of the set for the individual learner? We assume an interactive learning system, in which learning consists primarily of learning trials. On each trial, some item, problem, or situation is presented, and the user must process and make a response. We optimize learning by applying principles of learning to a number of items simultaneously in a priority score system, in which all items (or categories in category sequencing) are assigned scores indicating the relative importance of that item appearing on the next learning trial. Priority scores for each item are updated after every trial, as a function of learner accuracy and RTs, trials elapsed, and in view of mastery criteria. Learning strength is assessed continuously and in some implementations, cumulatively, from performance data. In most applications, the sequencing algorithm chooses the highest priority item on each learning trial. Adjustable parameters allow flexible and concurrent implementation of principles of learning and memory, such as stretching the retention interval automatically for each item as learning strength grows. Our system relies on a database that stores all categories in PL and all instances in factual learning contexts (e.g., multiplication facts, vocabulary, chemical symbols, etc.). Performance 101

Adaptive and Perceptual Learning Technologies in Medical Education and Training

data for every trial and every category or instance are acquired and used by a sequencing algorithm. For simplicity, we describe the system in terms of item sequencing, although it applies also to category learning, in which each presentation involves a novel instance. Another simplification is that even in basic factual learning, multiple formats may be used across trials to test a single item (to produce generalizable learning and enhance interest), but we omit further details. We describe aspects of the system here omitting mathematical and technical detail. (See Mettler et al24 for more information.) Our framework has great flexibility and may use a variety of equations relating elapsed time or trials, accuracy, and RT to the priority for presentation. When any particular function of these variables is used, there are parameters that may be adjusted to suit particular learning contexts or even individual learners. Priority scores for items are dynamically updated after each trial. In many applications, initial priority scores are given to all items, and an item’s score does not change until after it is first selected for presentation. This establishes a baseline priority for feeding in new items that may be balanced against changing priorities for items already introduced. Preset orderings in learning can be accomplished by the assignment of initial priority scores that are higher for some items or categories than for others. The full set of learning principles and objectives that may be embedded in ARTS is too extensive to describe here, but some important ones include: Rapid Item or Category Reappearance After Errors

Errors result in assignment of a high-priority weighting. With ordinary settings, the error weighting will exceed all initial priority score assignments, as well as the highest priority that may result from a slow, correct answer. However, reappearance of missed items is still subject to enforced delay. Interleaving/Enforced Delay

To prevent recurrence of an item while its answer remains in working memory, the system is normally configured to preclude the presentation of the same item on consecutive trials. Joint Optimization for the Entire Learning Set

A priority score system allows joint satisfaction of a number of learning principles applied to an entire set of items, as all factors feed into a priority score for each item or category. Scores are dynamically updated after each trial, and items or categories compete for selection on each learning trial. Retirement and Mastery Criteria

Adaptive learning focuses the learner’s effort where it is needed most. Commonly, learning effort and time are limited; therefore, it often makes sense to prioritize. We use the term retirement to describe removal of a learning item or category from the learning set, based on attainment of mastery criteria. Pyc and Rawson25 used the term “dropout” for 102

this idea and found evidence that greater learning efficiency can be achieved with this feature, especially in highly demanding learning situations. RTs provide important clues to the type of processing the learner is using. When a learner answers a problem by calculating or reasoning, they will tend to be slower than when retrieving the answer from memory. A key effect of PL, for example, is becoming able to extract relevant structure with low attentional load, which is an important contributor to expertise in many domains. These are independent reasons for using RT in mastery criteria. Dynamic Spacing Based on RTs

In our system, the priority for re-presentation of an item is a function of RT and accuracy. Even with an accurate answer, a long RT suggests relatively weak learning strength. The system can use various functions of RT but typically produces increasing priority for longer RTs. Use of RTs in adaptive learning offers a simple, direct framework for implementing important principles to produce efficiencies in learning. We hypothesize an internal variable of learning strength that may be influenced by the arrangement of learning events and inferred to some degree from performance. Learning strength is reflected in accuracy and speed in generating a factual answer or in making a classification in PL. Evidence supports response speed as an indicator of learning strength.25,26 Considerable research suggests that the value of a test trial (with successful retrieval) varies with an item’s learning strength.27,28 Thus, the best time to re-present an item is at the longest interval for which a correct retrieval can still be accomplished.29 Controversy persists about whether and when expanding the retention interval is superior to schedules with equal spacing.27,30,31 Although these issues are subjects of continuing research, considerable evidence supports the idea that difficulty of successful retrieval is an important factor.27,28,32 Pyc and Rawson28 labeled this idea the “retrieval effort hypothesis”: more difficult, but successful, retrievals are more beneficial to learning. In recent work, they studied the relation of number of successful retrievals to later memory performance, while manipulating the difficulty of those retrievals in terms of number of intervening trials. Greater numbers of intervening trials predicted better retention. These investigators also provided evidence that, as had been suggested in other work, larger gaps produced longer average response latencies,28 a finding consistent both with the idea that a larger gap affects an item’s learning strength and that learning strength is reflected in RTs. Other recent research provides evidence for a substantial advantage of expanding the retrieval interval when material is highly susceptible to forgetting or when intervening material is processed between testing events,29 conditions that apply to many formal learning situations, including most medical learning applications. The flexibility of parameter adjustment in the ARTS system makes it possible to accommodate MILITARY MEDICINE, Vol. 178, October Supplement 2013

Adaptive and Perceptual Learning Technologies in Medical Education and Training

varied conditions of learning and even new findings regarding optimal spacing relations. Multipurpose, Multilevel Assessment

ARTS offers not only new opportunities to improve learning but also wide-ranging possibilities for assessment. At the core of adaptive learning is performance tracking and adjustment based on embedded assessment. In our system, every concept and item in the database is tracked in terms of the learner’s accuracy and RTs on past trials. Both the raw data and derived measures are continuously available to gauge a learner’s progress. Aggregating across learners can show a class’s strengths and weaknesses for different categories of learning. Recent research shows that ARTS outperforms random presentation33 and also outperforms a classic adaptive learning system22 in tasks involving learning of factual items.24 Other research indicates that ARTS improves learning in perceptual and category learning relative to other schemes.33 MEDICAL APPLICATIONS OF PERCEPTUAL AND ADAPTIVE LEARNING TECHNOLOGIES We have begun applying perceptual and adaptive learning technologies to medical learning, and the results are remarkably promising. We describe four of these efforts briefly. ARTS Technology for Optimal Sick Call Performance In a recent project funded by U.S. Army RDECOM, (a collaboration of UCLA, Insight Learning Technology, and Pelagique, Inc.), we used ARTS in prototype learning systems for learning of factual material and medical diagnosis. The focus was on initial clinical “sick call” diagnosis by corpsmen and medics, and the goal was to improve factual learning through adaptive factual learning modules and integration of probabilistic information in diagnosis in cognitive task modules. In an efficacy study using premedical students carried out in the UCLA Human Perception Laboratory, the ARTSbased factual learning modules produced highly effective learning of medical material (such as signs and symptoms of meningitis, supraglottitis, etc.) and outperformed a control group using conventional study methods.34 Moreover, the cognitive task modules, which aimed at training information integration and higher level pattern recognition in diagnosis, added substantial benefits beyond mastery of the basic factual information. PL in Radiology Radiological diagnosis includes many domains in which subtle perceptual discriminations must be made, and radiological training could likely be radically improved by appropriate deployment of perceptual and adaptive learning technology. In a pilot project, we have begun to apply these methods to X-ray diagnosis of wrist injuries. Figure 1 shows a sample MILITARY MEDICINE, Vol. 178, October Supplement 2013

FIGURE 1.

Examples of Some Trial Types in the Wrist X-ray PLM.

screenshot. A variety of trial types, including distinguishing normal from injured wrists and classification of single or multiple injuries in particular images, are used in the module to maximize PL. Studies are ongoing, but initial results suggest that this format for learning can produce strong advances in perceptual expertise from relatively short investments of learning time. Perceptual/Adaptive Learning Modules (PALMs) in Dermatology and Histopathology In collaboration with the David Geffen UCLA School of Medicine, we have recently developed and tested two computer-based PALMs in the pre-clerkship curriculum for first- and second-year medical students, one for recognizing pathologic processes in skin histology images (Histopathology PALM) and the other for identifying skinlesion morphologies (Dermatology PALM). The goal was to assess their ability to develop pattern recognition and discrimination skills leading to accuracy and fluency in diagnosing new instances of disease-related patterns. We used pre- and post-test design, with each test consisting of the 103

Adaptive and Perceptual Learning Technologies in Medical Education and Training

presentation of a visual display along with possible answers for categorizing it. No feedback was given in the assessments. The PALM, given to UCLA medical students in between pre- and post-test, consisted of short interactive learning trials requiring the learner to classify images. The PL components in these modules included deploying a large display set, such that instances of categories did not repeat; moreover, as much as possible, irrelevant variables were balanced across categories (For instance, different dermatological conditions involved approximately the same range of body parts in the displays.) The initial PALMs in these domains were simple; they included only a single type of trial (display presentation with verbal category labels). More varied and complex trial types are known to facilitate PL, but these will be explored in subsequent work. The adaptive learning components included use of category sequencing algorithms, which optimized spacing based on individual performance, as well as implementation of mastery criteria for each category, based on both sustained accuracy and fluency criteria. The Dermatology PALM, designed to enhance the skinlesion morphology curriculum presented in Year 2, consisted of 12 categories of lesion morphologies and was completed by 161 of the 162 second-year students. The Histopathology PALM was designed to complement the skin histopathology curriculum of Year 1 students by enhancing their ability to discriminate the different patterns of presentation observed for cell and tissue injury/repair, inflammation, neoplasia, and normal skin histology images, each at high-power and lowpower magnifications. This module was completed by all 161 first-year students. The Histopathology PALM was also required of Year 2 students both to measure retention of the subject from Year 1 and to serve as review and enhanced learning of the material. The Dermatology PALM was offered to Year 1 students, as a control, on a voluntary basis and was completed by 78 students. These modules were completed quickly, with learning criteria typically reached in 15–35 minutes.

TABLE II.

APPLICATIONS OF PERCEPTUAL AND ADAPTIVE LEARNING TECHNOLOGIES IN MEDICAL SIMULATION Although efforts are in their infancy, the promise of perceptual and adaptive learning technologies for improving medical learning is already obvious. Not much work, however, has yet addressed procedural learning and simulation. These areas are ripe for development, as these new technologies are well suited to getting the most from simulation training. Simply having cutting-edge simulations does not solve the problem of how to improve learning. Perceptual–procedural learning technologies and adaptive methods using objective criteria of learning have much to offer in this regard. In this section we note some issues, benefits, and considerations in applying these new technologies to simulation. Perception–Action Loops in Procedural Learning We often think of skilled practitioners, such as pilots or surgeons, as having “good hands,” but the key to their skills

Results of Dermatology and Histopathology PALMs with First and Second Year Medical Students Pre-Testa

Year 1 Histopath Accuracy RT Year 1 Derm (optional) Accuracy RT Year 2 Histopath Accuracy RT Year 2 Derm Accuracy RT

As shown in Table II, substantial improvements between pre- and post-test scores were observed, with large (mean effect sizes >0.7) and highly significant ( p < 0.0001) increases in accuracy and speed in categorizing previously unseen images. Comparing performances for Years 1 and 2 on each of the modules, it can be seen that pre-test scores were much higher for dermatology lesion morphology in Year 2 than in Year 1, which is expected because the students in Year 2 had recently received lectures and an online learning experience. In contrast, this material was touched on only briefly for Year 1 students. Post-test scores, however, were highly similar. Histopathology pre- and post-test scores were similar for Year 1 and 2 students (Table II), showing strong learning gains for both groups. Finally, students reported that the PALMs increased their confidence and were useful, and they indicated that they would like more of these in other units.

Post-Testa

p

t(df )

Effect Size

N

66% (12%) 6.16 (2.33)

4, Missing

>4, Missing

>4, Missing

>4, Missing

£24.4

(24.4–26.6)

100.00% 54

90.00% 31

>26.6 Yes 62.79% 43

>26.6 No 88.89% 54

studies of this issue, provides a much different solution and approach for determining the “best” treatment for the next patient with IDH seeking care. Specific patient profiles showed substantially different probabilities of improvement. As a patient or a clinician, which information would you find more useful in helping to make a treatment choice? (1) The average patient gain from baseline on PF at 1 year was 44 points for patients treated surgically and 28 points for patients treated nonoperatively. (2) The overall rate of patients reporting improvement on PF at 1-year follow-up was 90.75% for patients treated surgically and 77% for nonoperatively treated patients. (3) Patients with your surgical care profile (i.e., baseline PF > 40 with a herniation type classified as a protrusion) reported improvement from surgery about 60% of the time (see Table II), but patients with your nonoperative care profile (i.e., baseline leg pain of 4, and a body mass index of 23 as illustrated in Table III) reported improvement at 1 year 100% of the time. “Perhaps both the patient and treating-clinician should consider all three sets of information when deciding on the “best” treatment for what the ADTO model described as patients of type “X,” meaning a patient with a given profile.” Medical Simulations Simulations as a learning tool have a long history in medical domains involving hand–eye coordination and are especially useful when failure to correctly execute a particular action, often in response to an unanticipated set of circ*mstances, is associated with dire consequences. In these cases, learning by doing would often degenerate into one-trial learning, with the lesson learned is that your patient is dead, or at least much the worse for wear. Thus, simulators associated with piloting aircraft, functioning in outer space, and surgical procedures are common and some would argue indispensable in training personnel to a standard of excellence. The use of simulations in medicine have now expanded to include what might be called hypothetical-construct coordination, since the simulations are designed to help train clinical 127

Use of the Assessment–Diagnosis–Treatment–Outcomes Model

personnel to integrate information from multiple sources, such as patient-reported symptoms, signs observed when examining the patient, other diagnostic tests such as radiographic findings (X-ray, magnetic resonance imaging, and Computed Tomography), and ultrasound with specific diagnoses and treatment options for the purpose of arriving at a particular decision or understanding of the system. Current practices in medicine tend to develop, structure, and parameterize simulations by creating ontologies,70,71 where the knowledge presentation is used to capture information and knowledge about the subject at hand. In medical simulations focusing on hand–eye coordination, high fidelity regarding instrumentation, anatomy, and physiology is an important and perhaps even dominant aspect in the ontology development process. In contrast, in medical simulations focusing on hypothetical-construct coordination, high fidelity is, in essence, the ability of the actor to surface the necessary and sufficient information needed to achieve the simulation goal, which would be generically to identify the circ*mstances when and what: (1) supplemental assessments, beyond gathering typical intake information, are necessary to determine a diagnosis; (2) confirmation protocols for the provisional diagnosis are necessary to confirm the diagnosis; (3) factors are relevant for determining the most appropriate treatment; and (4) outcomes are relevant for documenting best treatment. Thus, in hypothetical-construct-oriented medical simulations, as opposed to hand–eye coordination medical simulations, these ontologisms might well be based on information provided by CPRs and SDM materials. “As such, the cautionary tales associated with CPRs and SDM materials and protocols that rely on expert opinion and aggregated or group-level results from randomized controlled trials that are used to inform the A-D, D-T, and T-O links within the ADTO model apply.” The ability to assess fidelity and validity in hand–eye coordination medical simulations is similar to the notion of representational measurement; it is relatively straightforward by visual inspection to evaluate the similarity of the simulation environment with the real-world environment and many of the metrics associated with success are also relatively straightforward including such things as reaction times, and adherence to protocols. For example (1) time to complete a task, (2) reaction time with the introduction of “adverse” events, and (3) adherence to set protocols, and accuracy and speed of completion of the maneuvers within each protocol. The ability to assess fidelity in hypothetical-construct medical simulations is more akin to nonrepresentational measurement, such as assessing achievement or intelligence by testing. Nonrepresentational measurement is necessary when one wants to quantify hypothetical constructs, and solutions for classic problems of reliability and validity associated with nonrepresentational measurement, along with the large body of measurement theory, are available to evaluate these constructs. 128

On the other hand, hypothetical-construct-oriented medical simulations represent a quantum leap in difficulty for a number of reasons. First, in some medical disciplines the lack of assessment sensitivity (the Probability of a test being positive [Te+] given that, in truth, the patient is positive [Tr+] or Pr[Te+ j Tr+]) and positive predictive value Pr (Tr+ j Te+) are not high-priority concerns; and second, and as documented throughout this article, the CPRs and shared decision making, and expert opinion and EMB clinical guidelines associated with many medical conditions do not provide information that is particularly useful for specifying high-quality ontologisms associated with determining at the patient level the A-D, D-T, and T-O links that the ADTO model suggests are necessary to establish a reliable and valid information system that can be used to further refine reliable and valid best practices for patient treatment. Reliability is the bedrock upon which diagnosis, treatment, and outcomes must stand. One consequence of poor reliability in assessment is showed by Spratt and Koval (presentation at the Orthopaedic Trauma Association 24th Annual Meeting, October 2009). They evaluated inter- and intra-rater consistency associated with measuring distal radius wrist fractures from a series of three X-rays. This study codified aspects of six different distal radius classification systems that describe a distal radius fracture into a series of basic 2- and 3-option questions grouped within the four domains: Fracture (1) location, (2) displacement, (3) comminution, and (4) stability. The study design randomized the order of 24 sets of posterioranterior (PA), Lateral, and Oblique X-ray sets for each of the 7 raters (4 orthopaedic residents, 1 orthopaedic hand surgeon, 1 orthopaedic trauma surgeon, and 1 musculoskeletal radiologist) and created a web-based application that combined Image J software (version 1.40g, NIH) to evaluate the X-rays with a conditional question algorithm where subsequent questions were based on the rater’s prior answer to characterize the fracture with the four domains. The purpose of this study was to begin the validation process of a new classification system by assessing both the inter- and intra-rater agreement. Thus, all raters were required to evaluate all 24 film sets on two occasions to establish intra-rater agreement and all 21 possible rater pairs (7 choose 2 = 21) were evaluated against each other for both the raters’ first and second readings of X-rays. Reliabilities were computed for each question in each of the four domains, the goal being to develop a classification system composed of the reliable components of each of the six various systems. Since the questions were generally 2- or 3-reponse options only, Kappa statistics were used to assess consistency. Both Cohen’s and Brennan–Prediger Kappas were used since Cohen’s Kappa is the de facto standard but is also known to produce unstable results when the distribution of scores across the categories are highly skewed whereas the Brennan–Prediger Kappa statistic does not have this problem. MILITARY MEDICINE, Vol. 178, October Supplement 2013

Use of the Assessment–Diagnosis–Treatment–Outcomes Model TABLE IV.

Summary of Cohen’s and Brennan–Prediger Kappa Distributions for Readings 1 and 2 Inter-rater Agreements for 21 Rater Pairs Defining All Rater Pairs for 7 Raters and the 7 Intra-rater Agreements for Each of the 7 Raters KC (Cohen’s Kappa)

Item Question—Description

Mean

Left or Right Wrist? Intra or Extra Articular Fx?

1.0000a 0.5130c

Left or Right Wrist? Intra or Extra Articular Fx?

1.0000a 0.6190b

Left or Right Wrist? Intra or Extra Articular Fx?

1.0000a 0.5342c

Median

P25

KBP (Brennan–Prediger Kappa) P75

Mean

Median

P25

Distribution of Inter-Rater Kappas for the 21 Rater-Pairs—first Reading 1.0000a 1.0000a 1.0000a 100g 100g 100g 0.5000c 0.4286c 0.6364c 35.22l 35.71l 00.00 l Distribution of Inter-Rater Kappas for the 21 Rater-Pairs—second Reading 1.0000a 1.0000a 1.0000a 100g 100g 100g 0.6667b 0.5833c 0.6667c 57.14k 58.33k 41.67l Distribution of Intra-Rater Kappas across the 7 Raters 1.0000a 1.0000a 1.0000a 100 g 100g 100g 0.5833c 0.4000c 0.6596c 77.68i 81.25h 75.00i

P75 100g 51.61l 100g 75.00i 100g 87.50 h

Key for Evaluating KC: a.81–1.0 = Excellent, b.61–.80 = Good, c.41–.60 = Moderate, d.21–.40 = Fair, e.01–.20 = Poor, f

[PDF] CRESST examined learning and assessment issues in Navy - Free Download PDF (2024)

References

Top Articles
On wealthy Martha’s Vineyard, costly housing is forcing workers out and threatening public safety
On wealthy Martha's Vineyard, costly housing is forcing workers out and threatening public safety
Craigslist St. Paul
Forozdz
Danatar Gym
His Lost Lycan Luna Chapter 5
Beacon Schnider
Craigslist Furniture Bedroom Set
What happens if I deposit a bounced check?
Legacy First National Bank
The Many Faces of the Craigslist Killer
PGA of America leaving Palm Beach Gardens for Frisco, Texas
Miami Valley Hospital Central Scheduling
Hmr Properties
Craigslist Boats For Sale Seattle
Premier Reward Token Rs3
065106619
What is Rumba and How to Dance the Rumba Basic — Duet Dance Studio Chicago | Ballroom Dance in Chicago
Troy Bilt Mower Carburetor Diagram
Lcwc 911 Live Incident List Live Status
Icivics The Electoral Process Answer Key
Titanic Soap2Day
Used Safari Condo Alto R1723 For Sale
How Taraswrld Leaks Exposed the Dark Side of TikTok Fame
Horn Rank
Anonib Oviedo
Craiglist.nj
Marquette Gas Prices
Sorrento Gourmet Pizza Goshen Photos
The Powers Below Drop Rate
Paradise Point Animal Hospital With Veterinarians On-The-Go
Korg Forums :: View topic
Elanco Rebates.com 2022
Craig Woolard Net Worth
Colin Donnell Lpsg
Gasbuddy Lenoir Nc
Solve 100000div3= | Microsoft Math Solver
Murphy Funeral Home & Florist Inc. Obituaries
Weekly Math Review Q4 3
Polk County Released Inmates
Hotels Near New Life Plastic Surgery
One Main Branch Locator
Mid America Clinical Labs Appointments
Silive Obituary
Noaa Duluth Mn
Nid Lcms
Blackwolf Run Pro Shop
Oakley Rae (Social Media Star) – Bio, Net Worth, Career, Age, Height, And More
Sky Dental Cartersville
Madden 23 Can't Hire Offensive Coordinator
How to Find Mugshots: 11 Steps (with Pictures) - wikiHow
Comenity/Banter
Latest Posts
Article information

Author: Chrissy Homenick

Last Updated:

Views: 6569

Rating: 4.3 / 5 (54 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Chrissy Homenick

Birthday: 2001-10-22

Address: 611 Kuhn Oval, Feltonbury, NY 02783-3818

Phone: +96619177651654

Job: Mining Representative

Hobby: amateur radio, Sculling, Knife making, Gardening, Watching movies, Gunsmithing, Video gaming

Introduction: My name is Chrissy Homenick, I am a tender, funny, determined, tender, glorious, fancy, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.