QUICK REVIEW

[論文レビュー] Self-Reflection in LLM Agents: Effects on Problem-Solving Performance

Matthew Renze, Erhan Guven|arXiv (Cornell University)|May 5, 2024

Multi-Agent Systems and Negotiation被引用数 11

ひとこと要約

この論文は、 nine popular LLMs が自分の間違いを自己反省できるようにさせると、MCQA の問題解決性能を大幅に向上させ、より情報量の多い反省タイプほど、複数のモデルとドメインにわたってより大きな利益をもたらすことを示している。

ABSTRACT

In this study, we investigated the effects of self-reflection in large language models (LLMs) on problem-solving performance. We instructed nine popular LLMs to answer a series of multiple-choice questions to provide a performance baseline. For each incorrectly answered question, we instructed eight types of self-reflecting LLM agents to reflect on their mistakes and provide themselves with guidance to improve problem-solving. Then, using this guidance, each self-reflecting agent attempted to re-answer the same questions. Our results indicate that LLM agents are able to significantly improve their problem-solving performance through self-reflection ($p < 0.001$). In addition, we compared the various types of self-reflection to determine their individual contribution to performance. All code and data are available on GitHub at https://github.com/matthewrenze/self-reflection

研究の動機と目的

メタ認知的自己反省を用いてLLMの問題解決能力を向上させる動機づけ。
自己反省を明確な要素に分解し、それぞれの寄与を評価する。
複数のLLMと問題ドメインを比較して、反省が最も恩恵をもたらす領域を特定する。
自己反省を活用するエージェント型LLMシステムの設計に実用的な指針を提供する。

提案手法

複数のベンチマーク（ARC、AGIEval、Hellaswag、MedMCQA など）から1,000問の多ドメインMCQA試験を組み立てる。
Baseline（自己反省なし）プロンプトで9つのLLMを評価し、性能ベンチマークを取得する。
各不正解のBaselineアイテムについて、8種類の自己反省タイプ（Retry、Keywords、Advice、Explanation、Instructions、Solution、Composite、Unredacted）を用いて、正解をフィードバックとしてガイダンスを生成する。
自己反省を再回答プロンプトに注入し、以前に間違えた質問のみを再解答する。
自己反省内の回答を（Unredactedエージェントを除き）伏せ字化して漏洩を防ぐ。
Baselineの正解数＋再回答で正解となった件数を Baseline総数で割って正確度を算出し、McNemar検定で有意性を評価する。

Figure 1: Diagram of the self-reflection experiment.

実験結果

リサーチクエスチョン

RQ1自己反省戦略は、さまざまなLLMにおけるMCQAの性能を向上させるか？
RQ2どのタイプの自己反省が性能向上に最も寄与するか？
RQ3反省の利得は問題ドメインとモデルごとにどのように異なるか？
RQ4自己反省プロンプトに関連する漏洩リスクと限界は何か？

主な発見

Agent	Accuracy	Difference	Test Statistic	p-value
Baseline	0.786	N/A	N/A	N/A
Retry	0.827	0.041	39.024	<0.001
Keywords	0.832	0.046	44.022	<0.001
Advice	0.840	0.054	52.019	<0.001
Instructions	0.849	0.063	61.016	<0.001
Explanation	0.876	0.090	88.011	<0.001
Solution	0.925	0.139	137.007	<0.001
Composite	0.932	0.146	144.007	<0.001
Unredacted	0.971	0.185	183.005	<0.001

すべての自己反省タイプは、すべての tested LLM においてBaselineを上回り、正確度を有意に改善した（p < 0.001）。
情報量が多い反省タイプ（Instructions、Explanation、Solution、Composite）は、より軽いタイプ（Retry、Keywords、Advice）よりも大きな正確度の向上を示した。
Unredactedエージェントは、GPT-4の全エージェントの中で最高の正確度（0.971）を達成し、漏洩対策なしの上限を確立した。
モデル間で、LSAT-ARが最大の改善を示した一方、SAT-Englishなどの一部ドメインでは利益が小さかった。
単純なRetry反省でも顕著な利益が得られたことから、 prior errorを示すだけでも再挑戦の成功率を高め得ることが示唆された。

Figure 2: All self-reflection types improved the accuracy of GPT-4 agents.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。