- Home
- About AALHE
- Board of Directors
- Committees
- Guiding Documents
- Legal Information
- Organizational Chart
- Our Institutional Partners
- Membership Benefits
- Member Spotlight
- Contact Us
- Member Home
- Symposium
- Annual Conference
- Resources
- Publications
- Donate
EMERGING DIALOGUES IN ASSESSMENTAn AI API: A Case Study in Automated Scoring and Institutional Improvement
DO NOT CITE IN ANY CONTEXT WITHOUT PERMISSION OF THE AUTHOR
June 1, 2026
AbstractArtificial intelligence (AI) has rapidly expanded into higher education, including assessment practices. AI offers new ways to analyze, structure, and interpret institutional data. This article presents a case study on how one regional, comprehensive university utilized AI to analyze assessment data and develop a tool known as the Assessment Pursuit Index (API). It also situates the project within emerging scholarship on AI in educational assessment, arguing that leveraging AI’s capacity for programing and handling data can enhance institutional effectiveness by providing a low-cost approach and generating insights through the use of existing data, especially when paired with transparent methodological controls and human oversight. IntroductionAssessment is central to student learning, institutional improvement, program review, and accreditation. Even so, collection and analysis of assessment reports can vary in quality owing to a variety of factors. Reporting is inconsistent because programs submit reports in different file types and formatting, utilizing dissimilar samples of student work, and employing unique disciplinary perspectives. This is especially true at institutions that do not use large, expensive software packages that offer some degree of standardization. The result is vast stores of data that do not exhibit a high degree of constancy across years, especially when individuals responsible for recording and using the data rotate through assessment positions. In the end, many institutions possess assessment data but lack assessment insight. Fortunately, AI provides a low-cost, efficient way to use assessment data and this paper presents an instance of AI being leveraged to create a tool to work with assessment datasets. Artificial intelligence (AI) is rapidly being adopted in higher education in both academics and administration, including assessment. Research demonstrates AI’s capacity to assist in decision-making with its power to process and analyze data (Holmes et al., 2021; Ifenthaler & Yau, 2020). At the same time that computing is making such advances, institutions and assessment professionals continue to wrestle with reporting, incongruent data, and limited resources, both human and technological (Jankowski et al., 2018). This case study illustrates how AI can advance assessment efforts by creating a system of evaluation that can efficiently process and analyze assessment’s large, messy datasets, especially for schools with small assessment teams or limited resources for dedicated assessment software. Specifically, this article showcases the development and use of an Assessment Pursuit Index (API), which evaluates programs on both compliance and quality of reporting. Reviewing the development and use of API showcases the power of AI to standardize assessment results while increasing transparency. Also, development of an AI tool like API is its importance for improving student learning in that it provides analysis and rankings quickly, putting useful information in the hands of decisions-makers in a meaningful timeframe for action. In turn, acting on that information should be seen in improving API rank over time. BackgroundThe University of Northern Iowa (UNI) has a mature assessment culture and, like many institutions, collects annual assessment reports from programs that include documentation of learning outcomes, student work samples, and rubric use. Datasets compiled from these reports suffer from mixed reporting styles, varied assessment approaches, and variation in data coding over the years. These kinds of datasets require restructuring in the context of assessment management (Fulcher et al., 2014) and this is often left undone or falls behind other assessment efforts owing to the resources and effort required. Once received, UNI assessment staff read these reports and entered pertinent data into a multi-year spreadsheet that includes categories for artifact descriptions, rubric use, and if assessment was direct or indirect. Spanning the involvement of different assessment professionals, the spreadsheet reflects this history of different ownership as well as addition, closure, and reorganization of programs and departments. It also reflects growth changes to assessment practice, so as the past few years include a newly deployed Promise Score that rates the magnitude of changes made by programs in response to assessment findings. While inputting data for 2025, it occurred to us to see if AI could do something with the data, specifically if it could take the legacy spreadsheet, standardize the data across years, and generate a useful measure(s). In other words, could AI make sense of the data? Assessment staff uploaded the most recent three years of the spreadsheet to a password-protected, institutional version of Microsoft Copilot to clean and organize the data (cf. Zawacki-Richter et al., 2019). As a trial, this work was done using prompts as opposed to building a dedicated agent. AI quickly standardized columns, resolved inconsistencies, and combined programs with multiple assessments (i.e., looking at more than one outcome) using a “best evidence wins” rule. Building an Assessment Pursuit IndexThe key insight was then asking the AI to go beyond mere organization and distill the data in a way that would measure how each program was doing in regard to expectations for assessment. Traditionally, this question would be beyond the time and ability of assessment personnel, given the university has over 150 programs across four colleges. With AI, staff were able to derive a meaningful understanding of where programs stand, drawing from a large, messy, legacy spreadsheet that was otherwise underutilized (having been primarily used for tracking). For an institution without access to large assessment software packages, the scale and impact of this breakthrough cannot be underestimated because now with a handful of prompts, UNI can generate program-specific rankings from an otherwise underutilized database. The resulting product—API—takes into account both compliance and assessment quality by favoring programs using real student work as evidence and conducting direct assessment (Banta & Blaich, 2011). This brings up an important point—distinguishing between the API and AI. Feeding prompts to the AI led the creation of the API, but once created the API stands alone as a formula to create rankings. However, no resource-stressed institution would support the calculation by hand or trying to create code to pull data from large spreadsheets to generate scores. The Assessment Progress Index has value on its own, but only achieves efficiency by harvesting the number-crunching power of AI. The flexibility and ease of use of AI, meant the institution could use its own values in developing the API with just a prompt or two. For example, Promise Score (an indicator of how much changes occurs in response to assessment finds), unique to UNI and not published anywhere, does not favor the extremes (e.g., nothing to fix or major curricular overhaul) but instead prefers scores in the middle range (e.g., assignments, instruction, outcomes, or rubrics being changed). Including the Promise score became a matter of informing the AI how to interpret it (e.g., a middle score is best) whereas large, generalized software systems might not permit the addition of a “home brew” measure without major programming effort. The result then is a composite score ranging from 0 to 100 which takes into account many factors. Such a reduction in complexity—from spreadsheet to single number—allows faculty and administrators to know where individual programs stand at a glance. Below are the components of the API and how they were determined. Compliance: Submission (30 points)
Quality: Form Use (10 points)
Quality: Rubric Use (20 points)
Quality: Assessment Type (30 points)
Quality: Promise Score
Evidence: Gate
To generate API scores for each program for the most recent two years of reports the staff had the AI calculate a yearly API according to this formula: API = Compliance + Quality (modified by Evidence Gate as appropriate). This was done in order to look at how programs have done lately, not historically. The AI then sorted average API scores into the following standards-based categories (Suskie, 2018): Exemplary (100-85 points), Strong (84-70 points), Developing (69-55 points), Concerning (54-40 points), Noncompliant (39-0 points). Tying numeric scores to a single term increases the ease of use of the output. Assessment staff spot-checked the resulting output to confirm that scores for programs known to be traditionally good, average, and deficient were ranked appropriately. While it is possible some nuances or reevaluation might lead to slightly different scores, the output was ordinally accurate. Assessment professionals at UNI then met with the university’s deans to review assessment findings and gave a brief presentation on this new method and initial results while at the same time distributed college-sorted lists to deans for information and feedback. One dean already acted, requesting the presentation be provided to that college’s department heads. In that presentation more information was provided including specific calculated API scores for each program represented at the meeting. Following along with UNI’s assessment culture, assessment staff saw these meetings as an information-sharing opportunity as well as a first step in beginning to incorporate this new tool for evaluating annual assessment reports into the school’s assessment practice. Implications and LimitationsOne of the benefits of adopting AI is the ability for institutions without dedicated software packages to automate parts of assessment analysis efficiently. Use of AI also shows administrators and accreditors that assessment professionals at the institution are dedicated to developing using advanced tools in pursuit of improvement. In other words, instead of just improving assessment results alone, staff are improving assessment processes themselves. Objective ratings allow programs to compare their progress to others. API output also allows not only for quick identification of struggling programs, but information about how they are struggling. With the API automation in place, it can be fed a new spreadsheet annually, beginning the process of tracking these scores longitudinally. That being said, an AI approach to assessment, while relieving the need to do time-consuming and technical work, still requires human involvement, especially for interpretation. Humans are needed to validate the entered data so models built on it are accurate. Precisely defining data and variables, interpreting results in the context of an institution’s culture and history, communicating results appropriately, and promoting ethical use of data and results all require human oversight. This suggests AI might best serve programs with mature assessment programs as opposed to institutions that are still building out an assessment program. That being said, with this API as an example, colleges and universities working on instituting assessment plans can begin to think about collecting and recording data in a way that would facilitate the implementation of something like the API for their own schools. ConclusionThis case study demonstrates the potential of AI to transform legacy datasets into an evaluation system with ease and at low-cost for assessment offices facing staffing and funding pressures at a variety of institution type. Developing a tool like an API allows institutions of higher learning to advance assessment with objective and transparent metrics. In this way, AI can promote assessment efforts by providing faculty and administration with concrete systems for measurement, and thereby improve student learning.
ReferencesBanta, T. W., & Blaich, C. (2011). Closing the assessment loop. Change: The Magazine of Higher Learning, 43(1), 22-27. https://doi.org/10.1080/00091383.2011.538642 Floridi, L. & Cowls, J. (2019). A unified framework of five principles for AI in society. Harvard Data Science Review 1(1). https://hdsr.mitpress.mit.edu/pub/IOjsh9d1/release/8 Fulcher, K. H., Good, M. R., Coleman, C., & Smith, K.A. (2014). A simple model for learning improvement: Weigh pig, feed pig, weigh pig (Occasional Paper No. 23). National Institute for Learning Outcomes Assessment. https://files.eric.ed.gov/fulltext/ED555526.pdf Holmes, W., Bialik, M., & Fadel, C. (2021). Artificial intelligence in education: Promises and implications. Center for Curriculum Redesign. https://vicasa.org/wp-content/uploads/2024/05/ArtificialIntelligenceinEducation.PromiseandImplicationsforTeachingandLearning.pdf Ifenthaler, D. & Yau, J. Y. K. (2020). Utilizing learning analytics for study success: Reflections on current empirical findings. Research and Practice in Technology Enhanced Learning, 15(1), 1-13. https://doi.org/10.1186/s41039-020-00121-y Jankowski, N., Timmer, J., Kinzie, J., & Kuh, G. (2018). Assessment that matters: Trending toward practices that document authentic student learning. National Institute for Learning Outcomes Assessment. https://files.eric.ed.gov/fulltext/ED590514.pdf Suskie, L. (2018). Assessing student learning: A common sense guide. 3rd ed. Jossey-Bass. Zawacki-Richter, O., Marín, V.I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education: Where are the educators? International Journal of Educational Technology in Higher Education 16 (39), 1-27. https://doi.org/10.1186/s41239-019-0171-0 |