QUICK REVIEW

[論文レビュー] Retrieval-augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference

Zachary Levonian, Chenglu Li|arXiv (Cornell University)|Oct 4, 2023

Intelligent Tutoring Systems and Adaptive Learning被引用数 13

ひとこと要約

本論文は、中学生の数学の質問に答えるために、オープンソースの数学テキストを検索コーパスとして用いたリトリーバル拡張生成システムを設計し、プロンプトの指示が根拠づけと人間の好みに与える影響を分析し、根拠づけと有用性の知覚とのトレードオフを明らかにする。

ABSTRACT

For middle-school math students, interactive question-answering (QA) with tutors is an effective way to learn. The flexibility and emergent capabilities of generative large language models (LLMs) has led to a surge of interest in automating portions of the tutoring process - including interactive QA to support conceptual discussion of mathematical concepts. However, LLM responses to math questions can be incorrect or mismatched to the educational context - such as being misaligned with a school's curriculum. One potential solution is retrieval-augmented generation (RAG), which involves incorporating a vetted external knowledge source in the LLM prompt to increase response quality. In this paper, we designed prompts that retrieve and use content from a high-quality open-source math textbook to generate responses to real student questions. We evaluate the efficacy of this RAG system for middle-school algebra and geometry QA by administering a multi-condition survey, finding that humans prefer responses generated using RAG, but not when responses are too grounded in the textbook content. We argue that while RAG is able to improve response quality, designers of math QA systems must consider trade-offs between generating responses preferred by students and responses closely matched to specific educational resources.

研究の動機と目的

LLMを用いた中学生向け概念重視の数学QAを動機づけ評価する。
検証済みカリキュラムへ応答を根拠づけるためのリトリーバル拡張生成（RAG）の活用を調査する。
プロンプトガイダンスレベルが根拠づけと人間の好みに与える影響を調査する。
取得文書の関連性と根拠づけの関係を評価する。
教育リソースの整合性とユーザー満足度との間のトレードオフを特定する。

提案手法

OpenStax Prealgebra 教科書コーパスを小節ごとに分割して、RAG対応の数学QAシステムを構築する。
text-embedding-ada-002を用いたコサイン類似度で、学生の質問に最も関連する教科書の章を取得する。
three prompt guidance conditions (None, Low, High) に基づいて gpt-3.5-turbo-0613 モデルで回答を生成する。
最も関連性の高い段落と質問を繰り返す情報検索プロンプト条件を含める。
3つの指標（K-F1++、BLEURT、BERTScore）で根拠づけを評価し、被験者内調査で人間の好みのランキングを収集する。
ガイダンス条件間で回答を比較し、根拠づけと有用性の知覚に及ぼす影響を評価する。

実験結果

リサーチクエスチョン

RQ1リトリーバル拡張生成とプロンプト設計は、LLMが生成する数学的説明の根拠づけを高められるか。
RQ2RAGを用いた概念的な数学QAにおいて、人間はより根拠づけのある回答と、そうでない回答のどちらを好むか。
RQ3教科書内容の取得関連性は、根拠づけとユーザーの好みにどう影響するか。
RQ4自動化された根拠づけ指標と人間の判断との関係は何か。

主な発見

プロンプトガイダンスがあまり高くない場合に、RAGで生成された回答を人間は好むことを示しており、根拠づけと有用性のバランスを示している。
プロンプトの指示が多いほど根拠づけは高まるが、強い指示は好みにおいて低い指示を上回らなかった。
取得文書の関連性は認知される根拠づけと相関するが、人間の好みの一貫した予測因子ではない。
自動化された根拠づけ指標は人間の判断と控えめな相関を示し、K-F1++ が根拠づけへの最も強い関連を示す。
トレードオフがある：教科書 contentへ高度に根拠づけられた回答は、回答スタイルや有用性を過度に制限すると好まれない場合がある。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。