EMERGING DIALOGUES IN ASSESSMENT

Writing wrongs together: A peer-led prescription for AI-enhanced assessments in medical education

June 11, 2026

Anna Kochanowska Karamyan, PhD, M.Pharm, Associate Professor of Pharmacology, Oakland University William Beaumont School of Medicine
Paul Megee, PhD, Associate Professor of Biochemistry and Genetics, Oakland University William Beaumont School of Medicine
Jickssa Gemechu, PhD, Associate Professor of Anatomy and Embryology, Oakland University William Beaumont School of Medicine
Christopher Jaeger, MD, MMSc, Instructor of Urology, Oakland University William Beaumont School of Medicine and Corewell Health William Beaumont University Hospital

Abstract: Artificial intelligence (AI) is undoubtedly transforming many aspects of medical education, including the development, administration, and evaluation of assessments. This reflection details a faculty-led initiative to develop skills for the integration of AI tools into medical education assessment item development. We describe the structure and content of a two-part clinical and foundational sciences faculty training session and reflect upon the discussion of the most common challenges and opportunities identified. The participants overall appreciated the capabilities of currently available AI tools, but recognized the importance of faculty expertise to critically evaluate AI output.

INTRODUCTION: INSTITUTIONAL CONTEXT AND MOTIVATION

At Oakland University William Beaumont (OUWB) School of Medicine, an allopathic medical school, our organ-based courses (e.g., cardiovascular, neurology, etc.) had previously utilized customized National Board of Medical Examiners (NBME) exams, which consist of assessment items retired from licensure exams. These items were deemed inadequate for several reasons: items assessing some course content were often outdated or unavailable, or were framed in a manner that lacked significant clinical context. To address these deficiencies, OUWB faculty decided to move several years ago to instructor-written assessments in the pre-clerkship curriculum. Instructor-written assessment items can be generated that are better aligned with course content, assess high-yield concepts frequently tested on licensure exams, and include rich clinical vignettes that require students to apply clinical knowledge and reasoning to arrive at correct answers. Moreover, the use of instructor-written summative assessments allows students to debrief exams, where learners review questions and answer rationales independently under secure conditions, which further enhances learning and reveals knowledge gaps requiring additional self-directed learning.

While our faculty prefer instructor-written summative assessment items, the creation and refinement of high-quality items and accompanying rationales is time-consuming and requires faculty skill development. Moreover, students in our foundational courses request large numbers of stage-appropriate formative assessment items that are of equivalent quality and difficulty as those used in summative assessments. These factors prompted us to design and deploy a two-part training program targeting clinical and foundational sciences faculty in item creation using artificial intelligence (AI).

AI has become increasingly important in developing assessment items because it can address several limitations of traditional assessment item creation outlined above, as well as reduce the faculty time burden required to develop and maintain large, high-quality assessment banks. As reported in a recent study, expert-revised AI-generated assessment items can perform comparably to instructor-written questions in terms of psychometric quality (Wu et al., 2026), suggesting that AI can serve as a useful tool for item writing. Another study demonstrated that expert-refined AI-generated questions perform comparably to instructor-written questions, with nearly 70% of AI-generated items meeting inclusion standards with minimal modification (Ahmed et al., 2025). This finding highlights the potential of AI to make high-quality assessment items more accessible and scalable, while also suggesting that a substantial fraction of generated items requires significant revision by content experts. Overall, these reports support the integration of AI and faculty expertise in assessment item development to optimize depth, accuracy, clinical relevance, and timely updating of question banks, while also meeting growing student demand for robust formative practice opportunities.

PEER TO PEER FACULTY DEVELOPMENT TRAINING MODEL

In response to the growing importance of AI in higher education and in a continuous effort to improve assessments at OUWB School of Medicine, the members of the Assessment Subcommittee developed a two-part training entitled “Writing Wrongs: A Prescription for Effective NBME-style Questions.” This training was focused on supporting our peer faculty members, who teach in the pre-clerkship curriculum, in developing high-quality assessment items and integrating existing and emerging AI tools into assessment preparation. These goals were elaborated upon in the training announcement. To incentivize participation of clinical faculty in training in the use of AI in item writing, we partnered with OUWB’s Center for Excellence in Medical Education and the Office of Continuing Medical Education (CME) in our affiliated hospital to award CME credit to clinician participants.

The first session of a bipartite faculty development training consisted of an asynchronous video presentation detailing best practices in preparing NBME-style multiple-choice questions (MCQs). This part of the training exercise focused on providing instruction on the structure and key components of well-designed NBME-style MCQs, as well as the most common deficiencies identified in item creation. We also included a discussion of the interpretation of statistical measures of item performance (e.g., p-values and point biserials). Following our presentation of the anatomy of NBME-style items, we then focused on recommendations for the incorporation of AI into assessment item writing. These tips, derived from the facilitators’ experiences and supplemented with recommendations gleaned from a literature review (Indran et al., 2024; Kiyak & Emekli, 2024), included best practices for the ethical and professional integration of the available AI tools. Facilitators also stressed key strategies of detailed AI prompt design that increase item quality and promote adherence to NBME-style expectations. These strategies and tips include:

Providing numbered instructions
Splitting complex tasks into simpler ones
Requiring NBME-style conformity (single best answer, extensive clinical vignettes)
Targeting learner group (e.g., first-year medical students)
Including core concepts, learning objectives, keywords, etc.
Utilizing Bloom taxonomic terms (e.g., analyze, apply, synthesize) to adjust item difficulty level and to define the skills/knowledge being tested

At the completion of the first session, the participants were asked to generate drafts of new assessment items applying the provided information. This preparatory step was designed to encourage active application of the concepts and to allow participants to identify challenges encountered during item development.

The second part of the training sequence was an in-person interactive workshop. This session started with a brief refresher exercise where the participants applied knowledge from the recording to identify problems in several assessment items generated intentionally with common shortcomings. This activity served as a transition into a broader discussion of the role and integration of AI in the process of assessment item development. During the workshop, we continued discussing strategies for development of most effective prompts, building on the prompt example provided in the recorded session and noting that it is a skill that can be intentionally improved and is directly connected to the quality of AI outcomes. To illustrate the importance of high-quality prompts, we provided the participants with examples of items generated with intentionally vague or detailed prompts to demonstrate the difference in quality of assessment items. Additionally, we reiterated the importance of refining the prompt through iterative adjustments leading to higher quality output. This discussion, together with the provided example of a detailed prompt, was designed to help workshop participants focus their attention on a specific goal and to lead to more intentional use of AI tools in the development of items. To provide hands-on experience, the draft items prepared by workshop participants before the session were collaboratively reviewed and discussed, applying the concepts introduced in the recording and throughout the workshop. Finally, we discussed the importance of critical review of the AI output for factual inaccuracies, plausible distractors, alignment with content taught, and appropriateness to the learners’ expected cognitive level.

OUTCOMES AND REFLECTION

The workshop, attended by 30 faculty members, received highly positive evaluations, with all survey respondents noting an advancement of their understanding of the topic and its overall effectiveness in strengthening their knowledge and skills. Additionally, the majority of respondents (78%) reported that they are planning to implement the newly acquired skills to improve the preparation of high-quality assessment items.

Facilitated discussion with workshop participants revealed several recurring themes regarding their hesitancy to use AI in assessment. The most common concern was the potential compromise of assessment item security and integrity, as many students use the same AI tools to generate practice questions. Concerns about uploading lecture material into AI tools were also raised, citing risks related to question security, AI model training, and possible copyright infringement. Another theme was the risk of over-reliance on AI rather than utilizing it as a complementary tool for exam item development. While there was considerable anxiety regarding the quality of AI output, most participants agreed that this could be managed by careful evaluation of the accuracy and quality of generated items.

The future of AI in medical education assessment has immense potential but requires thoughtful stewardship. AI offers unprecedented capacity in producing assessment items that are of high-quality and psychometrically sound. But effective use of AI depends on deliberate investment in the education of faculty to critically evaluate, refine, and effectively deploy generative-AI tools for item building. It is also essential for faculty to abide by institutional and regulatory policies, which are evolving rapidly with the technology. Successful use of AI in this space will depend on balancing technological advancement with faculty development.

Overall, the intentional use of AI combined with clear prompts, iterative refinement, and careful review proved highly effective in assessment item development. By grounding faculty in NBME‑style principles and practical prompting strategies, the workshop enabled participants to use AI as a supportive tool. This experience underscores a key insight for educators and institutions seeking to enhance assessment quality: by focusing on peer training and best practices, stakeholders can benefit from AI’s efficiency to generate accurate, secure, and instructionally aligned assessment items while successfully navigating concerns regarding academic integrity.

ACKNOWLEDGEMENTS

The authors would like to thank Drs. Stefanie Attardi and Saima Mansuri, as well as the ex officio members of the Assessment subcommittee for their contributions to the development of the pre-recording. We thank Dr. Ann Voorheis-Sargent and Victoria Arnold of the Center for Excellence in Medical Education for their assistance in organizing the interactive workshop.

REFERENCES

Ahmed, A., Kerr, E., & O'Malley, A. (2025). Quality assurance and validity of AI-generated single best answer questions. BMC Medical Education, 25(1), 300. https://doi.org/10.1186/s12909-025-06881-w

Indran, I. R., Paranthaman, P., Gupta, N., & Mustafa, N. (2024). Twelve tips to leverage AI for efficient and effective medical question generation: A guide for educators using Chat GPT. Medical Teacher, 46(8), 1021-1026. https://doi.org/10.1080/0142159X.2023.2294703

Kiyak, Y. S., & Emekli, E. (2024). ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review. Postgraduate Medical Journal, 100(1189), 858-865. https://doi.org/10.1093/postmj/qgae065

Wu, H., Lee, D., Zerner, T., Court-Kowalski, S., Devitt, P., & Palmer, E. (2026). A comparison of the psychometric properties of GPT-4 versus human novice and expert authors of clinically complex MCQs in a mock examination of Australian medical students. Medical Teacher, 48(1), 74-84. https://doi.org/10.1080/0142159X.2025.2513418