QUICK REVIEW

[논문 리뷰] Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach

Irina Jurenka, Markus Kunesch|arXiv (Cornell University)|2024. 05. 21.

Online Learning and Analytics인용 수 17

한 줄 요약

이 논문은 Gemini 1.0에 기반한 텍스트 기반 교육용 AI 튜터인 LearnLM-Tutor와 교육적 역량을 평가하고 개선하기 위한 일곱 가지 교육 벤치마크를 평가 주도적으로 참여적 방법론으로 제시한다.

ABSTRACT

A major challenge facing the world is the provision of equitable and universal access to quality education. Recent advances in generative AI (gen AI) have created excitement about the potential of new technologies to offer a personal tutor for every learner and a teaching assistant for every teacher. The full extent of this dream, however, has not yet materialised. We argue that this is primarily due to the difficulties with verbalising pedagogical intuitions into gen AI prompts and the lack of good evaluation practices, reinforced by the challenges in defining excellent pedagogy. Here we present our work collaborating with learners and educators to translate high level principles from learning science into a pragmatic set of seven diverse educational benchmarks, spanning quantitative, qualitative, automatic and human evaluations; and to develop a new set of fine-tuning datasets to improve the pedagogical capabilities of Gemini, introducing LearnLM-Tutor. Our evaluations show that LearnLM-Tutor is consistently preferred over a prompt tuned Gemini by educators and learners on a number of pedagogical dimensions. We hope that this work can serve as a first step towards developing a comprehensive educational evaluation framework, and that this can enable rapid progress within the AI and EdTech communities towards maximising the positive impact of gen AI in education.

연구 동기 및 목표

질 높은 교육에 대한 공정한 접근을 촉진하기 위해 책임감 있고 평가에 초점을 맞춘 생성형 AI 튜터를 개발한다.
Gemini 1.0으로 학습과학 원칙을 실용적인 교육 개선으로 번역한다.
AI 튜터의 교육적 역량을 평가하기 위한 포괄적이고 다면적인 평가 프레임워크를 수립한다.
실제 현장의 필요와 제약에 맞추어 학습자 및 교육자와 함께 튜터를 공동 설계한다.

제안 방법

LearnLM-Tutor를 개발하기 위해 Gemini 1.0을 1:1 대화형 튜터링에 대해 파인튜닝한다(SFT; 이후 RLHF는 고려되었지만 이 연구에서 구현되지 않음).
평가 분류표에 묘사된 바와 같이 계량적, 질적, 자동, 인간 평가를 포괄하는 일곱 가지 교육 벤치마크 세트 구축 및 배포.
참여 디자인 원칙과 교육 자료(예: 공유 수업 자료와 비디오)에 기반한 고품질 파인튜닝 데이터를 수집하여 구성한다.
빠른 자동 평가 루프와 느린 인간 평가 루프를 활용해 반복적 모델 개선을 안내한다.
목표와 평가 기준을 정의하기 위해 학습자와 교육자가 참여하는 워크숍, 인터뷰, Wizard-of-Oz 세션 등 참여 디자인 방법을 도입한다.

Figure 1 : LearnLM-Tutor Development : overview of our approach to responsible development of gen AI for education. Bold arrows show the development flow, dotted arrows the information flow. Our approach starts and ends with participation . We start by answering the questions of “who are we trying t

실험 결과

연구 질문

RQ11:1 교육을 지원하기 위해 AI 튜터가 갖추어야 할 핵심 교육적 역량은 무엇인가?
RQ2참여적이고 다학제적인 프로세스가 교육용 AI 튜터의 개발 및 평가에 어떤 정보를 제공할 수 있는가?
RQ3파인튜닝된 모델(LearnLM-Tutor)이 벤치마크에서 프롬프트 기반 비교대상보다 어느 정도 더 우수한가?
RQ4대규모로 교육용 gen AI를 배치할 때의 윤리적, 안전성 및 정책적 고려사항은 무엇인가?

주요 결과

LearnLM-Tutor는 여러 교육적 차원에서 프롬프트 기반 Gemini보다 교사와 학습자에게 일관되게 선호된다.
일곱 가지 벤치마크 평가 프레임워크는 AI 튜터의 교육적 역량을 광범위하게 포착할 수 있다.
참여 디자인 방법은 모델 개선을 실제 학습 자료와 학습자의 필요에 효과적으로 기반을 마련한다.
고품질의 근거 있는 튜터링 데이터로 파인튜닝하는 것이 프롬프트만 사용하는 것보다 더 교육적으로 정렬된 행동을 가능하게 한다.
이 연구는 교육 중심 AI 배치에서 지속적인 관심이 필요한 한계와 안전/윤리적 고려사항을 강조한다.

Figure 2 : Overview of the evaluation taxonomy introduced in Section 4.3.2 that underpins the seven pedagogical evaluation benchmarks introduced in this report. Each benchmark is unique in its place within the taxonomy and comes with its own benefits and challenges. Together, these different benchma

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.