QUICK REVIEW

[論文レビュー] Boosting Theory-of-Mind Performance in Large Language Models via Prompting

Shima Rahimi Moghaddam, Christopher J. Honey|arXiv (Cornell University)|Apr 22, 2023

Topic Modeling被引用数 39

ひとこと要約

この論文は、プロンプティング、特に2ショットのチェイン・オブ・思考またはステップバイステップのプロンプトを用いたインコンテキスト学習が、RLHFで訓練されたLLMのToM（理論的心）能力を向上させることを示しており、GPT-4はプロンプトでToM精度を100%に達し、ゼロショットのGPT-4は約80%、人間の精度は87%である。

ABSTRACT

Large language models (LLMs) excel in many tasks in 2023, but they still face challenges in complex reasoning. Theory-of-mind (ToM) tasks, which require understanding agents' beliefs, goals, and mental states, are essential for common-sense reasoning involving humans, making it crucial to enhance LLM performance in this area. This study measures the ToM performance of GPT-4 and three GPT-3.5 variants (Davinci-2, Davinci-3, GPT-3.5-Turbo), and investigates the effectiveness of in-context learning in improving their ToM comprehension. We evaluated prompts featuring two-shot chain of thought reasoning and step-by-step thinking instructions. We found that LLMs trained with Reinforcement Learning from Human Feedback (RLHF) (all models excluding Davinci-2) improved their ToM accuracy via in-context learning. GPT-4 performed best in zero-shot settings, reaching nearly 80% ToM accuracy, but still fell short of the 87% human accuracy on the test set. However, when supplied with prompts for in-context learning, all RLHF-trained LLMs exceeded 80% ToM accuracy, with GPT-4 reaching 100%. These results demonstrate that appropriate prompting enhances LLM ToM reasoning, and they underscore the context-dependent nature of LLM cognitive capacities.

研究の動機と目的

GPT-4と3つのGPT-3.5系バリアント（Davinci-2、Davinci-3、GPT-3.5-Turboを含む）をToMタスクでのToM性能を評価する。
インコンテキスト学習プロンプトがToM精度に与える影響を評価する。
RLHFで訓練されたモデルと非RLHFベースラインのToMタスクにおける差異を調べる。

提案手法

GPT-4と3つのGPT-3.5系バリアント（Davinci-2、Davinci-3、GPT-3.5-Turbo）のToM性能を評価する。
ゼロショットとインコンテキスト学習プロンプトをテストし、2ショットのチェイン・オブ・思考とステップバイステップの思考指示を含む。
ToM精度に関してRLHF訓練済みモデルを非 RLHF ベースラインと比較する。
ベンチマークとして人間の性能と比較してToM精度を測定する。

実験結果

リサーチクエスチョン

RQ1大規模言語モデルにおける prompting はToM精度にどのような影響を与えるか？
RQ2RLHFで訓練されたLLMは、非RLHFモデルよりもToMタスクにおいてインコンテキスト学習の恩恵を受けるのか？
RQ3LLMsのToMにおける最適な prompting 設定は何か（ゼロショット vs. インコンテキスト、チェイン・オブ・思考 vs. ステップバイステップ）？
RQ4テストセットにおける人間の精度に、LLMのToM性能はどれくらい接近するか？

主な発見

GPT-4はゼロショット設定でほぼ80%のToM精度を達成する。
RLHF訓練済みモデル（Davinci-2を除く）はインコンテキスト学習を通じてToM精度を向上させる。
プロンプティングを用いると、すべてのRLHF訓練済みLLMは80%を超えるToM精度を達成し、GPT-4はプロンプト付きで100%に達する。
ゼロショットアプローチのGPT-4は人間の87%の精度には達しないが近づく。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。