QUICK REVIEW

[論文レビュー] Language Models as Science Tutors

Alexis Chevalier, Jiayi Geng|arXiv (Cornell University)|Feb 16, 2024

Innovative Teaching and Learning Methods被引用数 5

ひとこと要約

本論文は TutorEval を導入する長文脈の科学QAベンチマークと TutorChat を導入し、STEM教育のためのLMチューターを訓練・評価する。科学文テキストによるファインチューニングと TutorChat が TutorEval および数学タスクの性能を大幅に向上させることを示している。

ABSTRACT

NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To address this, we introduce TutorEval and TutorChat. TutorEval is a diverse question-answering benchmark consisting of questions about long chapters from STEM textbooks, written by experts. TutorEval helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multi-disciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TutorEval. Therefore, we create TutorChat, a dataset of 80,000 long synthetic dialogues about textbooks. We use TutorChat to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TutorEval while performing strongly on GSM8K and MATH. Our datasets build on open-source materials, and we release our models, data, and evaluations.

研究の動機と目的

短い文脈のベンチマークを超えた現実の長文脈LM支援の必要性を科学教育で動機づける。
複数のSTEM領域にまたがる長文脈・専門家作成の質問セット TutorEval を作成し、LM の指導能力を評価する。
教科書のような対話に基づくLMチューターを微調整するための、大規模な長文脈対話データセット TutorChat を開発する。
対話データだけのファインチューニングが不十分であることを示し、科学文テキストと TutorChat データが強力な TutorEval の性能には不可欠である。
科学と数学に特化した長文脈モデルが TutorEval、GSM8K、MATH で強力なベースラインに匹敵できることを示す。

提案手法

教科書の章から Math、CS、Physics、Environment、Life sciences にまたがる 834 問を含む TutorEval を構築（平均約1,800語、最大6,100語）。
各問題に対して正解ポイントをグラウンドトゥルースとして注釈付けし、LM 評価を導く（LM 評価者はこれらのキーポイントを使用）。
GPT-4 を評価者として、LM チューターの出力をグラウンドトゥルースのキーポイントと比較して評価し、人間の判断との相関を評価する。
教科書の章に関する長くて合成的な対話を 78K 件生成して TutorChat を作成（拡張後は 80K 件の対話）、GPT-3.5-Turbo および GPT-4-Turbo を使用。
LM の文脈を 32K トークンに拡張（Long-context）し、TutorChat および MathMix データセット（TutorChat-STEM + MetaMath）で Llemma-7B-32K をファインチューン。
MathMix（TutorChat-STEM + MetaMath）を提案し、TutorEval の性能を維持しつつ数学能力を向上させる。

Figure 1: Example from TutorEval . Given the chapter, the student asks a question to the LM Tutor. Both the chapter and the question are fed to the LM Tutor to generate the answer. GPT-4 assesses the generation by referencing the human annotated key points ( blue : the tutoring task; yellow : evalua

実験結果

リサーチクエスチョン

RQ1最終回答の正確さを超えた長文脈の科学タスクにおいて、LM チューターをどのように効果的に評価できるか？
RQ2科学文テキストと長文脈対話データでの学習は、ベースモデルや対話のみのファインチューニングと比べて TutorEval の性能を向上させるか？
RQ3TutorChat と数理中心データ（MetaMath）を組み合わせることが、数学の問題解決と一般的な科学チュータリングの間でどのような影響を与えるか？
RQ4Open-book 対 closed-book の設定は、長文脈科学チューターの長所と限界を明らかにできるか？
RQ5異なるベースモデルとデータ混合は TutorEval、GSM8K、MATH でどのように性能を示すか？

主な発見

TutorEval は、教科書の内容の処理と高度な科学知識を要する、複数の科学分野に跨る挑戦的な長文脈ベンチマークである。
GPT-4 は TutorEval の評価者として使用した場合、人間の判断と良く相関する。
科学文書と TutorChat でのファインチューニングは、ベースモデルおよび対話のみのファインチューニングより TutorEval の性能を大幅に改善する。
数学と科学データ（MathMix）で訓練された長文脈モデル（32K トークン）は、TutorEval の競争力を維持しつつ、数学の問題解決性能を高く達成する。
MathMix（TutorChat-STEM + MetaMath）のようなデータ混合は、強力な数学能力（GSM8K/MATH）と安定した TutorEval 結果を両立させ、いくつかのベースラインを上回る。
TutorChat データの品質（GPT-4 が生成した対話）は追従傾向を緩和し、誤解を招く質問への頑健性を高めることができる；オープンブック対話は多くのケースでクローズドブックより優れている。

Figure 2: Left: performance breakdown on TutorEval by domains. Right: leaderboard of popular models on TutorEval . Our models, marked in bold, achieve competitive TutorEval performance.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。