QUICK REVIEW

[논문 리뷰] NERIF: GPT-4V for Automatic Scoring of Drawn Models

Gyeong-Geon Lee, Xiaoming Zhaı|arXiv (Cornell University)|2023. 11. 21.

Genetics, Bioinformatics, and Biomedical Research인용 수 9

한 줄 요약

논문은 GPT-4V를 이용한 프롬프트 엔지니어링 방법인 NERIF를 소개하여 instructional notes 및 rubrics와 함께 Few-shot 학습으로 학생이 그린 과학 모델을 자동으로 평가하고, 중간 정도의 테스트 정확도와 설명 가능한 채점을 달성한다.

ABSTRACT

Scoring student-drawn models is time-consuming. Recently released GPT-4V provides a unique opportunity to advance scientific modeling practices by leveraging the powerful image processing capability. To test this ability specifically for automatic scoring, we developed a method NERIF (Notation-Enhanced Rubric Instruction for Few-shot Learning) employing instructional note and rubrics to prompt GPT-4V to score students' drawn models for science phenomena. We randomly selected a set of balanced data (N = 900) that includes student-drawn models for six modeling assessment tasks. Each model received a score from GPT-4V ranging at three levels: 'Beginning,' 'Developing,' or 'Proficient' according to scoring rubrics. GPT-4V scores were compared with human experts' scores to calculate scoring accuracy. Results show that GPT-4V's average scoring accuracy was mean =.51, SD = .037. Specifically, average scoring accuracy was .64 for the 'Beginning' class, .62 for the 'Developing' class, and .26 for the 'Proficient' class, indicating that more proficient models are more challenging to score. Further qualitative study reveals how GPT-4V retrieves information from image input, including problem context, example evaluations provided by human coders, and students' drawing models. We also uncovered how GPT-4V catches the characteristics of student-drawn models and narrates them in natural language. At last, we demonstrated how GPT-4V assigns scores to student-drawn models according to the given scoring rubric and instructional notes. Our findings suggest that the NERIF is an effective approach for employing GPT-4V to score drawn models. Even though there is space for GPT-4V to improve scoring accuracy, some mis-assigned scores seemed interpretable to experts. The results of this study show that utilizing GPT-4V for automatic scoring of student-drawn models is promising.

연구 동기 및 목표

과학 교육에서 학생이 그린 모델의 자동 채점 필요성에 대한 동기를 제공하여 시간 절약 및 시의적 피드백 제공.
학생이 그린 모델을 채점하기 위해 GPT-4V의 이미지 처리 및 언어 능력을 활용한 프롬프트 기반 방법(NERIF) 개발.
여섯 가지 모델링 작업에 대해 인간 전문가 점수에 비해 GPT-4V의 성능 평가.
교수 노트와 루브릭이 해석 가능한 채점 결과를 가능하게 하는 방법 시연.

제안 방법

9개의 예시 평가로 몇샷 학습 접근법을 사용하여 GPT-4V를 프롬프트로 삼아 삼항 분류(Beginning, Developing, Proficient) 수행.
쿼리당 두 개의 첨부 이미지 제공: 채점 예시가 포함된 문제 맥락과 학생이 그린 모델; 채점 안내를 위한 프롬프트의 임의 예시를 가져와 가이드로 사용.
Notation-Enhanced Scoring Rubrics를 세 가지 구성요소로 통합: 채점 측면, 능숙 규칙, 교수 노트.
프롬프트를 반복적으로 다듬기 위한 검증(N=54) 수행 후 테스트 채점(N=900)을 그리디 디코딩(온도 0, top_p 0.01)으로 수행.
정확도, 정밀도, 재현율, F1, 및 Fleiss’ Kappa를 사용하여 평가; 혼동 행렬을 분석하여 오분류를 이해.

실험 결과

연구 질문

RQ1GPT-4V가 학생이 그린 모델을 자동으로 채점하는 정확도는 어느가?
RQ2제공된 루브릭과 노트를 사용하여 GPT-4V가 학생이 그린 모델에 점수를 자동으로 할당하는 방법은 어떠한가?

주요 결과

항목	정확도	Acc_Beg	Acc_Dev	Acc_Prof	정밀도	재현율	F1	카파
R1-1	0.50	0.50	0.66	0.34	0.56	0.50	0.50	0.44
J2-1	0.45	0.68	0.56	0.12	0.62	0.45	0.41	0.32
M3-1	0.53	0.82	0.40	0.36	0.53	0.53	0.51	0.51
H4-1	0.57	0.64	0.68	0.38	0.61	0.57	0.56	0.51
H5-1	0.47	0.62	0.58	0.22	0.53	0.47	0.46	0.43
J6-1	0.53	0.62	0.84	0.12	0.62	0.53	0.48	0.38

6개 항목에서의 평균 테스트 채점 정확도: 0.51 (SD = 0.037).
평균 정밀도, 재현율, F1은 각각 0.58, 0.51, 0.49; Fleiss’ Kappa는 0.32에서 0.51까지(공정에서 보통).
범주별 정확도: Beginning 0.64, Developing 0.61, Proficient 0.26, Proficient가 GPT-4V에 더 도전적임.
검증 정확도 평균 0.67(Beginning 0.78, Developing 0.67, Proficient 0.56)으로 6개 항목에서 나타남.
GPT-4V는 입력 이미지에서 문제 맥락과 채점 예시를 검색하고 채점 구성요소에 대한 자연어 근거를 생성할 수 있다.
예시 시연(few-shot prompts) 및 교수 노트를 추가하면 채점 품질이 향상된다는 결과를 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.