QUICK REVIEW

[논문 리뷰] Beyond human subjectivity and error: a novel AI grading system

Alexandra Gobrecht, Felix Tuma|arXiv (Cornell University)|2024. 05. 07.

Machine Learning and Data Classification인용 수 7

한 줄 요약

이 논문은 다양한 대학의 시험 데이터를 바탕으로 학습된 공개 소스 트랜스포머 시스템 ASAG를 미세조정한 자동 단답 평가에 제시하며, 벤치마크 연구에서 인간 재채점보다 공식 성적과의 일치도가 더 높음을 보인다.

ABSTRACT

The grading of open-ended questions is a high-effort, high-impact task in education. Automating this task promises a significant reduction in workload for education professionals, as well as more consistent grading outcomes for students, by circumventing human subjectivity and error. While recent breakthroughs in AI technology might facilitate such automation, this has not been demonstrated at scale. It this paper, we introduce a novel automatic short answer grading (ASAG) system. The system is based on a fine-tuned open-source transformer model which we trained on large set of exam data from university courses across a large range of disciplines. We evaluated the trained model's performance against held-out test data in a first experiment and found high accuracy levels across a broad spectrum of unseen questions, even in unseen courses. We further compared the performance of our model with that of certified human domain experts in a second experiment: we first assembled another test dataset from real historical exams - the historic grades contained in that data were awarded to students in a regulated, legally binding examination process; we therefore considered them as ground truth for our experiment. We then asked certified human domain experts and our model to grade the historic student answers again without disclosing the historic grades. Finally, we compared the hence obtained grades with the historic grades (our ground truth). We found that for the courses examined, the model deviated less from the official historic grades than the human re-graders - the model's median absolute error was 44 % smaller than the human re-graders', implying that the model is more consistent than humans in grading. These results suggest that leveraging AI enhanced grading can reduce human subjectivity, improve consistency and thus ultimately increase fairness.

연구 동기 및 목표

개방형 서술형 문제의 채점 업무 부담을 줄이고 고등교육에서 인간 주관성과 오류를 완화하려는 동기.
다양하고 다학제적 시험 데이터로 훈련된 미세조정 트랜스포머 모델을 사용한 확장 가능한 ASAG 시스템 개발.
unseen 문제와 강의에 대한 일반화 평가 및 과거 시험 데이터를 이용한 AI 채점과 인간 재채점자의 비교.

제안 방법

질문(Q), 기준 답(A_ref), 최대 점수, 학생 답(A)을 포함하는 데이터셋으로 미세조정된 대형 오픈 소스 트랜스포머 모델을 사용하여 점수를 예측합니다.
모델이 주어진 튜플(Q, A_ref, x_ref, A)에 대해 숫자 점수를 출력하는 입력/출력 설계.
데이터를 S_train, S_develop, S_test로 분할(무관한 문제와 무관한 강의가 포함된 S_test 포함)하여 감독 유사 조정 및 평가를 수행합니다.
보류된 테스트 세트와 정규화된 점수에서 회귀 지표(MAE, RMSE, 피어슨 상관계수)로 성능을 평가합니다.
16개 강좌의 1600개 문제-답안을 재채점하고 공식 성적과의 편차를 인간 재채점자와 비교하는 인간 벤치마크 실험을 수행합니다.

실험 결과

연구 질문

RQ1ASAG가 보지 못한 문제와 보지 못한 강의에 일반화하면서 인간 채점자와의 정확도에 버금가거나 더 나은가?
RQ2보관된 벤치마크에서 ASAG가 인간 재채점자보다 공식 기록된 과거 성적에서의 편차가 더 작게 나타나는가?
RQ3문제의 난이도(최대 점수)가 채점 정확도와 모델 성능에 어떤 영향을 미치는가?
RQ4ASAG의 원시 점수와 정규화 점수 모두에서의 성능 차이는 어떠한가?
RQ5AI 기반 채점을 고위험 교육 현장에 접목하기 위한 실용적 경로는 무엇인가?

주요 결과

ASAG는 보지 못한 문제에 대해 일반화가 양호하며(MAE 약 1.32–1.44, RMSE 약 2.27–2.41, 상관계수 약 0.69–0.78 분할)
정규화된 점수는 여러 분할에서 MAE가 약 15.6–18.6 백분율 포인트 수준이고 상관계수는 약 0.61–0.64로 일관된 성능을 보인다.
16개 강좌에서 1600개 항목의 인간 벤치마크에서 ASAG의 RMSE(3.061점)와 평균 편차(0.183 백분율 포인트)가 공식 성적과 더 잘 맞았으며, 이는 네 명의 인간 재채점자(RMSE 4.566점, 평균 편차 0.289 포인트)보다 우수하다.
모델의 공식 성적 편차는 16개 강좌 중 15개 강좌에서 인간 재채점자의 편차보다 작았고, 인간 재채점자와 비교할 때 모델의 중앙 절대 편차가 44% 감소했다.
최대 점수가 높은 문제(18점)에서 훈련 데이터의 저대표성으로 성능이 감소하는 경향이 있다.
저자들은 초기 배치를 보조 수정 채점 수준(수준 1-2)에서의 자동 채점 프레임워크를 제안하여 위험 관리와 규제 고려를 강조한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.