QUICK REVIEW

[论文解读] Retrieval-augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference

Zachary Levonian, Chenglu Li|arXiv (Cornell University)|Oct 4, 2023

Intelligent Tutoring Systems and Adaptive Learning被引用 13

一句话总结

本论文设计了一个使用开源数学教材作为检索语料库的检索增强生成系统，用于回答中学生的数学问题，并分析提示引导如何影响扎根性与人类偏好，揭示了扎根性与感知有用性之间的权衡。

ABSTRACT

For middle-school math students, interactive question-answering (QA) with tutors is an effective way to learn. The flexibility and emergent capabilities of generative large language models (LLMs) has led to a surge of interest in automating portions of the tutoring process - including interactive QA to support conceptual discussion of mathematical concepts. However, LLM responses to math questions can be incorrect or mismatched to the educational context - such as being misaligned with a school's curriculum. One potential solution is retrieval-augmented generation (RAG), which involves incorporating a vetted external knowledge source in the LLM prompt to increase response quality. In this paper, we designed prompts that retrieve and use content from a high-quality open-source math textbook to generate responses to real student questions. We evaluate the efficacy of this RAG system for middle-school algebra and geometry QA by administering a multi-condition survey, finding that humans prefer responses generated using RAG, but not when responses are too grounded in the textbook content. We argue that while RAG is able to improve response quality, designers of math QA systems must consider trade-offs between generating responses preferred by students and responses closely matched to specific educational resources.

研究动机与目标

使用大语言模型（LLMs）来激发和评估面向中学生的概念聚焦数学问答。
研究检索增强生成（RAG）如何使回答对经过检验的课程纲要进行扎根。
探究提示引导程度如何影响扎根性与人类偏好。
评估检索文档相关性与扎根性之间的关系。
识别教育资源对齐度与用户满意度之间的权衡。

提出的方法

构建一个使用 OpenStax Prealgebra 教材语料库（按小节分段）的检索增强生成数学问答系统。
使用 text-embedding-ada-002 的余弦相似度来检索与学生问题最相关的教材章节。
在三种提示引导条件（无、低、高）下，使用 gpt-3.5-turbo-0613 生成回答。
包含一个信息检索提示条件，重复问题和最相关段落。
使用三种指标（K-F1++、BLEURT、BERTScore）评估扎根性，并在同试验设计的调查中收集人类偏好排序。
比较不同引导条件下的回答，以评估对扎根性和感知有用性的影响。

实验结果

研究问题

RQ1检索增强生成和提示工程能否提高大语言模型生成的数学解释的扎根性？
RQ2在带有 RAG 的概念性数学问答中，人类更偏好更扎根还是不那么扎根的回答？
RQ3教材内容的检索相关性如何影响扎根性和用户偏好？
RQ4自动化扎根性指标与人类判断之间的关系是什么？

主要发现

当提示引导不太高时，人们更偏好使用 RAG 生成的回答，表明在扎根性与有用性之间存在平衡。
扎根性随提示引导增加而提高，但较高的引导在偏好上并不优于较低的引导。
检索文档相关性与感知的扎根性相关，但并非人类偏好的一致预测因素。
自动化扎根性指标与人类判断的相关性适中，其中 K-F1++ 与扎根性关系最为密切。
存在权衡：高度扎根于教材内容的回答如果过度限制回答风格或有用性，可能不被偏好。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。