Evaluating Copilot as an Essay Rater: Interrater Agreement and Rasch Analysis of Ethical Reasoning Scores
Performance assessments are essential for evaluating student learning but remain difficult to scale due to cost, time, and variability in human scoring. While automated essay scoring systems have been used in high-stakes settings, they typically require extensive training datasets and show limited task generalizability. Recent advances in generative artificial intelligence suggest that large language models may address some of these constraints. This study examined whether Microsoft Copilot can function as a reliable rater of ethical reasoning essays. Copilot rated 100 student essays using a five-criterion rubric aligned with the Eight Key Questions framework. Scores were compared to human ratings using descriptive statistics, interrater reliability and agreement indices, and a Rating Scale Many-Facet Rasch Model. Results showed that Copilot operated within the range of human severity and demonstrated consistent use of rating scale categories; Copilot’s alignment with human raters was criterion dependent. Qualitative analyses indicated a tendency to over-reward emotionally salient narratives. Overall, the findings suggest that generative AI is best positioned as a supplemental rater within hybrid human-AI scoring systems rather than a replacement for human raters.
Click here to download the full article
« Back to Archive
