EMERGING DIALOGUES IN ASSESSMENT

Augmenting Professional Judgment Through AI-Supported Assessment Feedback

June 1, 2026

Alyce J. Odasso, Ph.D., Associate Director of Institutional Effectiveness, Texas A&M University

Abstract

Providing consistent, developmental feedback on assessment documentation is a persistent challenge in large institutions. This article describes the design and early testing of an AI-supported tool that helps reviewers generate criteria-aligned, improvement-focused feedback while preserving human judgment. The approach demonstrates how AI can support more consistent and scalable assessment processes.

Introduction

Assessment offices invest considerable time in consultations, workshops, and resources to support meaningful engagement in programmatic assessment. Yet some of the most valuable support programs receive comes through feedback on their documentation (Kuh & Ikenberry, 2009), where expectations are clarified and practices refined.

Especially at large, decentralized institutions, feedback quality and consistency may vary dramatically. This may be due to insufficient time, lack of assessment or disciplinary expertise, or differences in tone. Assessment should be a reflective and improvement-focused activity (Banta & Palomba, 2018; Suskie, 2015), but feedback processes may reinforce a compliance-only attitude if reviewers are looking primarily at whether required elements are present. Alternatively, reviewers may understand what ‘good assessment’ looks like but struggle to provide consistent feedback amid competing responsibilities.

To address both challenges, an AI tool was developed to augment reviewer judgment and strengthen the quality and consistency of assessment feedback. Interest in artificial intelligence in higher education has grown rapidly in recent years, with potential uses emerging across teaching, learning, and administrative processes (Jafari & Keykha, 2023).

Institutional Context

At Texas A&M University, academic program assessment follows a multi-role workflow. Program coordinators submit assessment plans in the spring and reports in the fall. At both stages, reviewers appointed by each college provide formative feedback on assessment documentation. The assessment office then assigns report ratings (Exemplary, Sufficient, Needs Improvement, or Noncompliant).

In the 2024-2025 assessment cycle, 451 academic programs participated in this process. College size varies widely, so reviewers in larger colleges may review 50-60 documents twice annually within a month-long window. Although the assessment office provides guidance and resources (including standardized feedback statements for common issues), review of workflow data revealed both inconsistency in feedback quality and a significant bottleneck at the college review stage. Across two cycles, between 55% and 71% of documents did not receive feedback within the designated window, and 19% received no feedback.

The consequences were measurable: reports that received timely feedback were far more likely to meet reporting requirements and demonstrate Sufficient or Exemplary quality. In a survey, reviewers expressed a desire for more structured support to increase confidence and effectiveness. These findings underscored the need for additional scaffolding within the feedback process.

Designing the AI Feedback Tool

Texas A&M’s AI hub allows users to create and share models with identified groups. The assessment office used this feature to develop an Assessment Feedback tool to aid reviewers in providing feedback on assessment documentation.

Design Principles

First and most importantly, the AI tool should support—not replace—reviewer judgment. The system requires certain inputs before the model generates a substantive response. For example, if a report is uploaded to the tool with a prompt to “review the report and provide feedback,” the model returns a directive to “send 2-4 key observations from your initial review.” This preserves the element of human review. To reinforce the model’s role as a support rather than a replacement, each output includes a confidence indicator, indicating the tool’s degree of certainty and prompting reviewers to exercise professional judgment.

Second, any feedback generated by the tool must be criteria-aligned, developmental in nature, and grounded in the submission’s specific context. The tool was designed to generate feedback explicitly anchored in published institutional criteria, ensuring that comments reference established expectations. Developmental feedback is framed to guide refinement rather than emphasize compliance or weaknesses. Context-grounded feedback is responsive to the submission’s narrative.

Third, the AI tool must reinforce institutional standards. To preserve institutional authority over assessment standards, Google search capability was disabled. The model was intentionally constrained to internal guidelines and resources uploaded to the system’s knowledge base. This prevents the tool from introducing external standards or redefining what constitutes compliance.

How the Tool Works

Reviewers either upload a PDF or copy and paste sections of assessment documentation into the interface. As previously noted, the tool will not generate content without identification of at least two key observations. These observations might include strengths, concerns, or alignment issues. Once conditions are met, the tool generates feedback in a structured format:

What’s Working, identifying strengths with contextualized explanations.
Areas Needing Attention, highlighting alignment issues or unmet reporting requirements. For example, “Rubric/criteria mismatch (high priority): The criteria referenced in the target statement does not match the rubric criteria listed in the measure description.”
Questions for Consideration, inviting reflection rather than simply corrective action. For example, “Given that results are already very high, what is one targeted refinement the program could pilot (e.g., to reduce “Fair” ratings on the rubric)?”
Paste-Ready Comment Options, short statements reviewers may include in their feedback.
Confidence Level, indicating the model’s degree of certainty based on the information provided.

Early Insights from Internal Testing

These insights are from structured internal testing rather than full implementation, as the pilot will launch in April 2026. The tool was tested with historical assessment documentation, and outputs were reviewed against established criteria.

Patterns Observed

Feedback consistently referenced published criteria, and any preference-based or stylistic commentary was minimal but appropriate (e.g., “This section would read more clearly if it was simplified”). In addition, the tone of feedback reflected a developmental framing. Instead of “policing” language rooted in compliance, the feedback was improvement-focused and consistently included reflective prompts and constructive suggestions to strengthen the report.

Across disciplines and contexts, feedback also followed a standardized structure. This consistency suggests the potential to reduce variability caused by different levels of reviewer experience.

The model also demonstrated adaptability in response to ad-hoc prompts. Follow-up questions—such as asking what changes might elevate the report to an Exemplary level—elicited substantive responses grounded in the same institutional criteria used for standard feedback. This suggests the tool may support exploratory and reflective interactions, extending usefulness beyond initial report review.

Limitations and Boundaries

While these patterns are promising, several limitations and boundaries were also identified through testing. These considerations are critical for understanding the appropriate role of the tool within the broader assessment process.

The tool cannot fully interpret disciplinary nuance and does not understand curricular and programmatic resource constraints. Professional judgment and reviewers’ contextual knowledge therefore remain essential. Testing also revealed that the tool occasionally stops short of identifying additional issues beyond those already noted in the reviewer’s pre-reflection input.

When prompted to elaborate, the tool sometimes produced feedback that was more generic than desired. Analysis quality was also influenced by how reports were provided to the system. Outputs tended to be more detailed when text was pasted into the interface than when PDFs were uploaded, likely due to how the model interpreted headers and section prompts.

Confidence levels decreased when only partial sections were provided, reflecting the reduced ability to evaluate submissions holistically. Finally,feedback quality depends heavily on the completeness of the internal knowledge base that anchors the model’s responses. Ensuring relevant guidelines, resources, and examples are included in system instructions is essential.

These observations reinforce the need for continued testing and refinement. Broader implementation will help to evaluate the tool’s performance across disciplines and different formats. Optional standard prompts will also be developed to support common use cases, such as generating general feedback, identifying revisions that could elevate a report from Needs Improvement to Sufficient, or prompting deeper analysis beyond the initial output. A version of the tool with Google search enabled will also be tested to evaluate whether access to external sources can enhance contextual understanding while maintaining alignment with institutional standards.

Ethical and Practical Considerations

AI-supported assessment feedback raises several ethical and practical considerations. Because the system operates within a secure institutional AI environment and analyzes reports that contain aggregated program-level data rather than student-level information, privacy risks associated with the content itself are limited. However, institutions must determine whether and how to disclose AI-supported feedback processes to programs. Transparency needs to be addressed explicitly in reviewer training, alongside an explanation of the model’s confidence indicators.

Although the model is not currently informed by examples of prior feedback, incorporating such documentation into its knowledge base would require periodic auditing to ensure that any existing blind spots or biases are not reproduced. Finally, questions remain about the tool’s impact on reviewer development. While some may worry that AI could reduce opportunities for reviewers to build assessment expertise, it may also serve as a form of professional development by modeling criteria-aligned feedback.

Conclusion

Providing consistent, developmental feedback across large and decentralized institutions remains a persistent challenge in program assessment. AI-supported tools like this one offer a way to augment reviewer judgment and align feedback with institutional criteria. By scaffolding reviewer observations and generating structured commentary, such tools may also reduce the administrative burden associated with reviewing large volumes of assessment documentation. Ultimately, strengthening formative feedback processes can help programs produce higher-quality documentation of student learning, supporting both institutional improvement efforts and accreditation expectations.

References

Banta, T. W. & Palomba, C. A. (2015). Assessment essentials: Planning, implementing, and improving assessment in higher education (2nd ed.). Jossey-Bass.

Jafari, F. & Keykha, A. (2023). Identifying the opportunities and challenges in artificial intelligence in higher education: A qualitative study. Journal of Applied Research in Higher Education, 16(4), 1228-1245. https://doi.org/10.1108/JARHE-09-2023-0426

Kuh, G. & Ikenberry, S. (2009). More than you think, less than we need. National Institute for Learning Outcomes Assessment. https://www.planningaccreditationboard.org/wp-content/uploads/2021/04/2009NILOutcomesAssess.pdf

Suskie, L. (2018). Assessing student learning: A common sense guide (3rd ed.). Jossey-Bass.