QUICK REVIEW

[論文レビュー] Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach

Irina Jurenka, Markus Kunesch|arXiv (Cornell University)|May 21, 2024

Online Learning and Analytics被引用数 17

ひとこと要約

この論文は Gemini 1.0 上に構築されたテキストベースの教育用AIチューター LearnLM-Tutor を提案し、教育能力を評価・向上させるための七つの教育的ベンチマークを用いた評価駆動型の参加型手法を提示する。

ABSTRACT

A major challenge facing the world is the provision of equitable and universal access to quality education. Recent advances in generative AI (gen AI) have created excitement about the potential of new technologies to offer a personal tutor for every learner and a teaching assistant for every teacher. The full extent of this dream, however, has not yet materialised. We argue that this is primarily due to the difficulties with verbalising pedagogical intuitions into gen AI prompts and the lack of good evaluation practices, reinforced by the challenges in defining excellent pedagogy. Here we present our work collaborating with learners and educators to translate high level principles from learning science into a pragmatic set of seven diverse educational benchmarks, spanning quantitative, qualitative, automatic and human evaluations; and to develop a new set of fine-tuning datasets to improve the pedagogical capabilities of Gemini, introducing LearnLM-Tutor. Our evaluations show that LearnLM-Tutor is consistently preferred over a prompt tuned Gemini by educators and learners on a number of pedagogical dimensions. We hope that this work can serve as a first step towards developing a comprehensive educational evaluation framework, and that this can enable rapid progress within the AI and EdTech communities towards maximising the positive impact of gen AI in education.

研究の動機と目的

責任ある、評価に焦点を当てた生成AIチューターを開発することにより、質の高い教育への公平なアクセスを促進する。
学習科学の原理を Gemini 1.0 の実践的な教育改善へ翻訳する。
AIチューターの教育的能力を評価するための包括的で多角的な評価フレームワークを確立する。
学習者と教育者と共にチューターを共同設計し、実世界のニーズと制約に合わせる。

提案手法

LearnLM-Tutor を Gemini 1.0 の1対1対話型チュータリング向けにファインチューニングして開発する（SFT；後の RLHF は本研究では実装されない）。
定量・定性・自動・人間評価を網羅する七つの教育的ベンチマークのスイートを作成・導入する（評価分類図に示されるように）。
共同設計原則と教育資料（例：共有授業資料・動画）に基づく高品質なファインチューニングデータを収集する。
高速な自動評価ループと遅い人間評価ループを用いてモデル改善を反復的に導く。
学習者と教育者を巻き込んだ共同設計手法（ワークショップ、インタビュー、Wizard-of-Oz セッション）を取り入れ、目標と評価基準を定義する。

Figure 1 : LearnLM-Tutor Development : overview of our approach to responsible development of gen AI for education. Bold arrows show the development flow, dotted arrows the information flow. Our approach starts and ends with participation . We start by answering the questions of “who are we trying t

実験結果

リサーチクエスチョン

RQ1AIチューターが1対1教育を支援するために備えるべき核心的な教育能力は何か。
RQ2教育のためのAIチューターの開発と評価において、共同設計・多分野のプロセスはどのように有益か。
RQ3ファインチューニング済みモデル（LearnLM-Tutor）は、教育ベンチマークにおけるプロンプト調整ベースラインよりどの程度上回るか。
RQ4大規模展開時の教育用生成AIの倫理・安全・政策上の考慮事項は何か。

主な発見

LearnLM-Tutor は複数の教育的次元でプロンプト調整済み Gemini より教育者と学習者の双方に一貫して好まれる。
七つのベンチマーク評価フレームワークは、AIチューターの教育的能力を広範に捉えることができる。
共同設計手法は、実際の学習資料と学習者のニーズに基づくモデル改善を効果的に支える。
高品質で根拠に基づくチュータリングデータを用いたファインチューニングは、単なるプロンプトより教育的な挙動と整合性を高める。
本研究は、教育に焦点を当てたAIの展開において継続的な注意が必要な制約と安全/倫理の課題を強調する。

Figure 2 : Overview of the evaluation taxonomy introduced in Section 4.3.2 that underpins the seven pedagogical evaluation benchmarks introduced in this report. Each benchmark is unique in its place within the taxonomy and comes with its own benefits and challenges. Together, these different benchma

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。