QUICK REVIEW

[论文解读] Language Models as Science Tutors

Alexis Chevalier, Jiayi Geng|arXiv (Cornell University)|Feb 16, 2024

Innovative Teaching and Learning Methods被引用 5

一句话总结

本文Introduce TutorEval——一个长上下文的科学问答基准，以及 TutorChat——一个长上下文对话数据集，用于训练和评估面向 STEM 教育的语言模型导师，结果显示对科学文本的微调和 TutorChat 能显著提高在 TutorEval 及数学任务上的表现。

ABSTRACT

NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To address this, we introduce TutorEval and TutorChat. TutorEval is a diverse question-answering benchmark consisting of questions about long chapters from STEM textbooks, written by experts. TutorEval helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multi-disciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TutorEval. Therefore, we create TutorChat, a dataset of 80,000 long synthetic dialogues about textbooks. We use TutorChat to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TutorEval while performing strongly on GSM8K and MATH. Our datasets build on open-source materials, and we release our models, data, and evaluations.

研究动机与目标

在科学教育中超越短上下文基准，动员对真实情境、长上下文语言模型援助的需求。
创建 TutorEval，一套跨越多个 STEM 领域、由专家撰写的长上下文问题集，用来评估 LM 辅导能力。
开发 TutorChat，一个大型的长上下文对话数据集，用于在课本式交互上微调 LM 导师。
证明仅在对话数据上进行微调不足以取得强劲的 TutorEval 性能；科学文本和 TutorChat 数据对于提升 TutorEval 表现至关重要。
显示专门针对科学与数学的长上下文模型可以与 TutorEval、GSM8K 和 MATH 的强基线相媲美。

提出的方法

使用教科书章节中的 834 道题构建 TutorEval，覆盖数学、计算机科学、物理、环境与生命科学（平均约 1800 字，最长 6100 字）。
为每道题标注真实关键点，以引导 LM 评估（LM 评估器使用这些关键点）。
使用 GPT-4 作为评估器，根据真实关键点对 LM 导师输出进行评分；评估与人类判断之间的相关性。
通过生成 78K 条关于教科书章节的长篇合成对话来创建 TutorChat（扩展后 80K 条对话），使用 GPT-3.5-Turbo 与 GPT-4-Turbo。
将语言模型上下文扩展到 32K tokens（长上下文），并在 TutorChat 和 MathMix 数据集（TutorChat-STEM + MetaMath）上对 Llemma-7B-32K 进行微调。
提出 MathMix（TutorChat-STEM + MetaMath），以提升数学能力，同时保持 TutorEval 的性能。

Figure 1: Example from TutorEval . Given the chapter, the student asks a question to the LM Tutor. Both the chapter and the question are fed to the LM Tutor to generate the answer. GPT-4 assesses the generation by referencing the human annotated key points ( blue : the tutoring task; yellow : evalua

实验结果

研究问题

RQ1如何在超越最终答案正确性的前提下，有效评估 LM 导师在长上下文科学任务上的表现？
RQ2与基础模型或仅对话微调相比，在科学文本和长上下文对话数据上的训练是否能提升 TutorEval 的表现？
RQ3将 TutorChat 与聚焦数学的数据（MetaMath）结合，对数学题解能力与一般科学辅导的影响如何？
RQ4开放书本与闭卷设置能否揭示长上下文科学导师的优势与局限？
RQ5不同的基础模型和数据混合在 TutorEval、GSM8K 和 MATH 上的表现如何？

主要发现

TutorEval 是一个跨多个科学领域的具有挑战性的长上下文基准，需具备高级科学知识并能处理教科书内容。
将 GPT-4 用作 TutorEval 的评估者时，与人类判断的相关性良好。
在科学文本和 TutorChat 上微调可显著提升 TutorEval 的表现，相较于基础模型和仅对话的微调。
在数学和科学数据（MathMix）上训练的长上下文模型（32K token）在数学题解上表现强劲，同时保持 TutorEval 的竞争力。
类似 MathMix（TutorChat-STEM + MetaMath）的数据混合在获得强大的数学能力（GSM8K/MATH）的同时，也有扎实的 TutorEval 结果，超过了若干基线。
TutorChat 数据质量（GPT-4 生成的对话）可以缓解拍马屁现象并提升对误导性问题的鲁棒性；在许多情况下，开放式对话胜过闭卷式。

Figure 2: Left: performance breakdown on TutorEval by domains. Right: leaderboard of popular models on TutorEval . Our models, marked in bold, achieve competitive TutorEval performance.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。