EMERGING DIALOGUES IN ASSESSMENT

An AI API: A Case Study in Automated Scoring and Institutional Improvement

DO NOT CITE IN ANY CONTEXT WITHOUT PERMISSION OF THE AUTHOR

June 1, 2026

Donald H. Gaff
Department of Geography
University of Northern Iowa

Abstract

Artificial intelligence (AI) has rapidly expanded into higher education, including assessment practices. AI offers new ways to analyze, structure, and interpret institutional data. This article presents a case study on how one regional, comprehensive university utilized AI to analyze assessment data and develop a tool known as the Assessment Pursuit Index (API). It also situates the project within emerging scholarship on AI in educational assessment, arguing that leveraging AI’s capacity for programing and handling data can enhance institutional effectiveness by providing a low-cost approach and generating insights through the use of existing data, especially when paired with transparent methodological controls and human oversight.

Introduction

Assessment is central to student learning, institutional improvement, program review, and accreditation. Even so, collection and analysis of assessment reports can vary in quality owing to a variety of factors. Reporting is inconsistent because programs submit reports in different file types and formatting, utilizing dissimilar samples of student work, and employing unique disciplinary perspectives. This is especially true at institutions that do not use large, expensive software packages that offer some degree of standardization. The result is vast stores of data that do not exhibit a high degree of constancy across years, especially when individuals responsible for recording and using the data rotate through assessment positions. In the end, many institutions possess assessment data but lack assessment insight. Fortunately, AI provides a low-cost, efficient way to use assessment data and this paper presents an instance of AI being leveraged to create a tool to work with assessment datasets.

Artificial intelligence (AI) is rapidly being adopted in higher education in both academics and administration, including assessment. Research demonstrates AI’s capacity to assist in decision-making with its power to process and analyze data (Holmes et al., 2021; Ifenthaler & Yau, 2020). At the same time that computing is making such advances, institutions and assessment professionals continue to wrestle with reporting, incongruent data, and limited resources, both human and technological (Jankowski et al., 2018).

This case study illustrates how AI can advance assessment efforts by creating a system of evaluation that can efficiently process and analyze assessment’s large, messy datasets, especially for schools with small assessment teams or limited resources for dedicated assessment software. Specifically, this article showcases the development and use of an Assessment Pursuit Index (API), which evaluates programs on both compliance and quality of reporting. Reviewing the development and use of API showcases the power of AI to standardize assessment results while increasing transparency. Also, development of an AI tool like API is its importance for improving student learning in that it provides analysis and rankings quickly, putting useful information in the hands of decisions-makers in a meaningful timeframe for action. In turn, acting on that information should be seen in improving API rank over time.

Background

The University of Northern Iowa (UNI) has a mature assessment culture and, like many institutions, collects annual assessment reports from programs that include documentation of learning outcomes, student work samples, and rubric use. Datasets compiled from these reports suffer from mixed reporting styles, varied assessment approaches, and variation in data coding over the years. These kinds of datasets require restructuring in the context of assessment management (Fulcher et al., 2014) and this is often left undone or falls behind other assessment efforts owing to the resources and effort required.

Once received, UNI assessment staff read these reports and entered pertinent data into a multi-year spreadsheet that includes categories for artifact descriptions, rubric use, and if assessment was direct or indirect. Spanning the involvement of different assessment professionals, the spreadsheet reflects this history of different ownership as well as addition, closure, and reorganization of programs and departments. It also reflects growth changes to assessment practice, so as the past few years include a newly deployed Promise Score that rates the magnitude of changes made by programs in response to assessment findings. While inputting data for 2025, it occurred to us to see if AI could do something with the data, specifically if it could take the legacy spreadsheet, standardize the data across years, and generate a useful measure(s). In other words, could AI make sense of the data? Assessment staff uploaded the most recent three years of the spreadsheet to a password-protected, institutional version of Microsoft Copilot to clean and organize the data (cf. Zawacki-Richter et al., 2019). As a trial, this work was done using prompts as opposed to building a dedicated agent. AI quickly standardized columns, resolved inconsistencies, and combined programs with multiple assessments (i.e., looking at more than one outcome) using a “best evidence wins” rule.

Building an Assessment Pursuit Index

The key insight was then asking the AI to go beyond mere organization and distill the data in a way that would measure how each program was doing in regard to expectations for assessment. Traditionally, this question would be beyond the time and ability of assessment personnel, given the university has over 150 programs across four colleges. With AI, staff were able to derive a meaningful understanding of where programs stand, drawing from a large, messy, legacy spreadsheet that was otherwise underutilized (having been primarily used for tracking). For an institution without access to large assessment software packages, the scale and impact of this breakthrough cannot be underestimated because now with a handful of prompts, UNI can generate program-specific rankings from an otherwise underutilized database.

The resulting product—API—takes into account both compliance and assessment quality by favoring programs using real student work as evidence and conducting direct assessment (Banta & Blaich, 2011). This brings up an important point—distinguishing between the API and AI. Feeding prompts to the AI led the creation of the API, but once created the API stands alone as a formula to create rankings. However, no resource-stressed institution would support the calculation by hand or trying to create code to pull data from large spreadsheets to generate scores. The Assessment Progress Index has value on its own, but only achieves efficiency by harvesting the number-crunching power of AI.

The flexibility and ease of use of AI, meant the institution could use its own values in developing the API with just a prompt or two. For example, Promise Score (an indicator of how much changes occurs in response to assessment finds), unique to UNI and not published anywhere, does not favor the extremes (e.g., nothing to fix or major curricular overhaul) but instead prefers scores in the middle range (e.g., assignments, instruction, outcomes, or rubrics being changed). Including the Promise score became a matter of informing the AI how to interpret it (e.g., a middle score is best) whereas large, generalized software systems might not permit the addition of a “home brew” measure without major programming effort.

The result then is a composite score ranging from 0 to 100 which takes into account many factors. Such a reduction in complexity—from spreadsheet to single number—allows faculty and administrators to know where individual programs stand at a glance. Below are the components of the API and how they were determined.

Compliance: Submission (30 points)

Compliance receives some weight, but the majority of points are reserved for quality.
Submitted reports receive 30 points, no submission receives 0 points. Incomplete reports receive full points; the idea being that even incomplete, submitting a partial report counts as a recognition of compliance, if not compliance itself.

Quality: Form Use (10 points)

The institution released a new template for reporting in Fall 2023 and the old form is being phased out and will no longer be accepted in Fall 2026. Whether a program adopted the new form or relied on the old one is taken as a measure of the extent to which a program is engaged with current practices.
Use of the new form is 10 points, as is the submission of an accreditation report (which UNI accepts in lieu of an assessment report); using the old form is worth 2 points (i.e., partial credit for at least using a form, even if obsolete).

Quality: Rubric Use (20 points)

UNI values the evaluation of authentic student work using rubrics.
Rubric use is 20 points; no evidence of rubric use is 0 points.

Quality: Assessment Type (30 points)

UNI values direct assessment of student work.
Direct assessment is 30 points, indirect assessment is 8 points (acknowledging assessment is at least being done), and no evidence of assessment is 0 points.

Quality: Promise Score

Newly developed measure to indicate how much a program changes in response to assessment findings. The preference is for a score in the middle of the range as too low means no changes are made and too high means large, curricular changes are taking place. Please note that while a major curriculum change or adding new classes—in response to assessment results—is good on occasion, major changes across all departments and programs every year is considered too much.
Promise Score of 0 = 0 points, 1 = 6 points, 2 = 10 points, 3 = 6 points, and 4 = 2 points.

Evidence: Gate

To avoid over-penalizing incomplete reports and keeping to principles of fairness (Floridi & Cowls, 2019), the model distinguishes between submitted without key elements and not submitted at all.
Incomplete reports receive compliance points but not quality points.

To generate API scores for each program for the most recent two years of reports the staff had the AI calculate a yearly API according to this formula: API = Compliance + Quality (modified by Evidence Gate as appropriate). This was done in order to look at how programs have done lately, not historically. The AI then sorted average API scores into the following standards-based categories (Suskie, 2018): Exemplary (100-85 points), Strong (84-70 points), Developing (69-55 points), Concerning (54-40 points), Noncompliant (39-0 points). Tying numeric scores to a single term increases the ease of use of the output.

Assessment staff spot-checked the resulting output to confirm that scores for programs known to be traditionally good, average, and deficient were ranked appropriately. While it is possible some nuances or reevaluation might lead to slightly different scores, the output was ordinally accurate. Assessment professionals at UNI then met with the university’s deans to review assessment findings and gave a brief presentation on this new method and initial results while at the same time distributed college-sorted lists to deans for information and feedback. One dean already acted, requesting the presentation be provided to that college’s department heads. In that presentation more information was provided including specific calculated API scores for each program represented at the meeting. Following along with UNI’s assessment culture, assessment staff saw these meetings as an information-sharing opportunity as well as a first step in beginning to incorporate this new tool for evaluating annual assessment reports into the school’s assessment practice.

Implications and Limitations

One of the benefits of adopting AI is the ability for institutions without dedicated software packages to automate parts of assessment analysis efficiently. Use of AI also shows administrators and accreditors that assessment professionals at the institution are dedicated to developing using advanced tools in pursuit of improvement. In other words, instead of just improving assessment results alone, staff are improving assessment processes themselves. Objective ratings allow programs to compare their progress to others. API output also allows not only for quick identification of struggling programs, but information about how they are struggling. With the API automation in place, it can be fed a new spreadsheet annually, beginning the process of tracking these scores longitudinally.

That being said, an AI approach to assessment, while relieving the need to do time-consuming and technical work, still requires human involvement, especially for interpretation. Humans are needed to validate the entered data so models built on it are accurate. Precisely defining data and variables, interpreting results in the context of an institution’s culture and history, communicating results appropriately, and promoting ethical use of data and results all require human oversight. This suggests AI might best serve programs with mature assessment programs as opposed to institutions that are still building out an assessment program. That being said, with this API as an example, colleges and universities working on instituting assessment plans can begin to think about collecting and recording data in a way that would facilitate the implementation of something like the API for their own schools.

Conclusion

This case study demonstrates the potential of AI to transform legacy datasets into an evaluation system with ease and at low-cost for assessment offices facing staffing and funding pressures at a variety of institution type. Developing a tool like an API allows institutions of higher learning to advance assessment with objective and transparent metrics. In this way, AI can promote assessment efforts by providing faculty and administration with concrete systems for measurement, and thereby improve student learning.

References

Banta, T. W., & Blaich, C. (2011). Closing the assessment loop. Change: The Magazine of Higher Learning, 43(1), 22-27. https://doi.org/10.1080/00091383.2011.538642

Floridi, L. & Cowls, J. (2019). A unified framework of five principles for AI in society. Harvard Data Science Review 1(1). https://hdsr.mitpress.mit.edu/pub/IOjsh9d1/release/8

Fulcher, K. H., Good, M. R., Coleman, C., & Smith, K.A. (2014). A simple model for learning improvement: Weigh pig, feed pig, weigh pig (Occasional Paper No. 23). National Institute for Learning Outcomes Assessment. https://files.eric.ed.gov/fulltext/ED555526.pdf

Holmes, W., Bialik, M., & Fadel, C. (2021). Artificial intelligence in education: Promises and implications. Center for Curriculum Redesign. https://vicasa.org/wp-content/uploads/2024/05/ArtificialIntelligenceinEducation.PromiseandImplicationsforTeachingandLearning.pdf

Ifenthaler, D. & Yau, J. Y. K. (2020). Utilizing learning analytics for study success: Reflections on current empirical findings. Research and Practice in Technology Enhanced Learning, 15(1), 1-13. https://doi.org/10.1186/s41039-020-00121-y

Jankowski, N., Timmer, J., Kinzie, J., & Kuh, G. (2018). Assessment that matters: Trending toward practices that document authentic student learning. National Institute for Learning Outcomes Assessment. https://files.eric.ed.gov/fulltext/ED590514.pdf

Suskie, L. (2018). Assessing student learning: A common sense guide. 3^rd ed. Jossey-Bass.

Zawacki-Richter, O., Marín, V.I., Bond, M., & Gouverneur, F. (2019). Systematic review of research on artificial intelligence applications in higher education: Where are the educators? International Journal of Educational Technology in Higher Education 16 (39), 1-27. https://doi.org/10.1186/s41239-019-0171-0