QUICK REVIEW

[논문 리뷰] S-GRADES -- Studying Generalization of Student Response Assessments in Diverse Evaluative Settings

Tasfia Seuti, Sagnik Ray Choudhury|arXiv (Cornell University)|2026. 03. 10.

Topic Modeling인용 수 0

한 줄 요약

S-GRADES는 14개의 AES와 ASAG 데이터셋을 웹 기반 벤치마크로 통합하고 표준화된 평가를 제공하며, 다양한 추론 전략과 교차 데이터셋 예시 전달에서 LLM을 분석한다.

ABSTRACT

Evaluating student responses, from long essays to short factual answers, is a key challenge in educational NLP. Automated Essay Scoring (AES) focuses on holistic writing qualities such as coherence and argumentation, while Automatic Short Answer Grading (ASAG) emphasizes factual correctness and conceptual understanding. Despite their shared goal, these paradigms have progressed in isolation with fragmented datasets, inconsistent metrics, and separate communities. We introduce S-GRADES (Studying Generalization of Student Response Assessments in Diverse Evaluative Settings), a web-based benchmark that consolidates 14 diverse grading datasets under a unified interface with standardized access and reproducible evaluation protocols. The benchmark is fully open-source and designed for extensibility, enabling continuous integration of new datasets and evaluation settings. To demonstrate the utility of S-GRADES, we evaluate three state-of-the-art large language models across the benchmark using multiple reasoning strategies in prompting. We further examine the effects of exemplar selection and cross-dataset exemplar transfer. Our analyses illustrate how benchmark-driven evaluation reveals reliability and generalization gaps across essay and short-answer grading tasks, highlighting the importance of standardized, cross-paradigm assessment.

연구 동기 및 목표

AES와 ASAG 데이터셋을 하나의 표준화된 평가 플랫폼으로 통합합니다.
재현 가능한 평가 및 공개 리더보드를 갖춘 웹 기반 인프라를 제공합니다.
다양한 채점 작업에서 여러 추론 구성을 사용하여 최첨단 LLM을 평가합니다.
예시 선택의 안정성과 데이터셋 간 예시 전달 효과가 채점 성능에 미치는 영향을 조사합니다.
에세이와 단답 채점 간의 일반화 격차를 강조하여 교차 패러다임 평가를 촉진합니다.

제안 방법

14개의 AES 및 ASAG 데이터셋을 스코어링 척도 보존 및 일관된 전처리로 수집하고 표준화합니다.
데이터셋 유통, 제출 검증, 평가 및 리더보드 추적을 위한 FastAPI 기반 플랫폼을 구현합니다.
세 가지 대형 언어 모델(GPT-4o-mini, Gemini 2.5 Flash, Llama 4 Scout)을 여섯 가지 추론 구성(Ind, Ded, Abd, Ind+Abd, Ind+Ded, Ded+Abd)을 사용하여 평가합니다.
일관된 추론 및 출력 제약을 강제하는 다부분 템플릿의 시스템화된 프롬프트를 사용합니다.
예시 선택에 대한 절개 연구(inductive setups with different seeds) 및 디코딩 랜덤성(온도) 분석으로 안정성을 평가합니다.
AES/ASAG 패러다임 내외의 데이터셋 간 예시 전달 전이를 분석하여 일반화를 연구합니다.

Figure 2: Complete benchmark submission interface.

실험 결과

연구 질문

RQ1어떤 LLM과 추론 전략이 서로 다른 채점 패러다임(AES 대 ASAG)에 가장 잘 부합합니까?
RQ2소수 샷 예시 선택 및 교차 데이터셋 전달이 채점 성능과 일반화에 어떤 영향을 미칩니까?
RQ3예시 선택 및 디코딩 랜덤성에 따른 모델 예측의 안정성은 어느 정도입니까?
RQ4패러다임 간 일반화(AES에서 ASAG으로, 그 반대)의 채점 정확도에 어떤 영향을 미칩니까?
RQ5표준화된 평가 하에서 에세이와 단답 채점 간 일반화의 격차는 무엇입니까?

주요 결과

하이브리드 추론 전략(예: Ind+Ded)이 일반적으로 데이터셋 간에 단일 전략보다 우수합니다.
GPT-4o-mini는 ASAP-AES에서 높은 일관성을 보이지만 다른 AES 데이터셋과 ASAG 태스크에서 변동성을 보입니다.
Gemini-2.5-Flash는 균형 잡힌 성능과 강한 교차 도메인 강건성을 제공하며, 특히 Rice_Chem 및 ASAG 태스크에서 강합니다.
ASAG 태스크는 AES에 비해 변동성이 크고 절대적 성능이 낮아 전이 난도가 더 큼을 시사합니다.
교차 데이터셋 예시 전달은 종종 성능을 저하시킬 수 있으며, 특정 데이터셋의 구조화된 예시를 사용할 때 긍정적 전달이 나타나는 경우도 있습니다.
예시 안정성은 일부 모델(예: Gemini-2.5-Flash)에서 높고 다른 모델에서는 낮아 예시 선택에 대한 모델 의존적 민감도를 나타냅니다.

Figure 3: Public leaderboard displaying aggregated results across all datasets and evaluation metrics.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.