QUICK REVIEW

[論文レビュー] Understanding Social Reasoning in Language Models with Language Models

Kanishk Gandhi, Jan-Philipp Fränken|arXiv (Cornell University)|Jun 21, 2023

Topic Modeling被引用数 20

ひとこと要約

この論文は、因果テンプレートを用いてモデル生成の Theory of Mind ベンチマーク BigToM を提案し、5,000項目をプロシージャルに作成、様々な大規模言語モデル（LLM）を評価している。GPT-4 は人間に近い ToM パターンを示す一方で限界もあり、他のモデルは劣後。

ABSTRACT

As Large Language Models (LLMs) become increasingly integrated into our everyday lives, understanding their ability to comprehend human mental states becomes critical for ensuring effective interactions. However, despite the recent attempts to assess the Theory-of-Mind (ToM) reasoning capabilities of LLMs, the degree to which these models can align with human ToM remains a nuanced topic of exploration. This is primarily due to two distinct challenges: (1) the presence of inconsistent results from previous evaluations, and (2) concerns surrounding the validity of existing evaluation methodologies. To address these challenges, we present a novel framework for procedurally generating evaluations with LLMs by populating causal templates. Using our framework, we create a new social reasoning benchmark (BigToM) for LLMs which consists of 25 controls and 5,000 model-written evaluations. We find that human participants rate the quality of our benchmark higher than previous crowd-sourced evaluations and comparable to expert-written evaluations. Using BigToM, we evaluate the social reasoning capabilities of a variety of LLMs and compare model performances with human performance. Our results suggest that GPT4 has ToM capabilities that mirror human inference patterns, though less reliable, while other LLMs struggle.

研究の動機と目的

原因テンプレートを用いてLLMのToMを評価するためのスケーラブルで制御可能なフレームワークを開発する。
多様な制御条件と5,000項目を含む合成のモデル作成ToMベンチマーク（BigToM）を作成する。
モデルの性能を人間の性能や以前のクラウドソーシング/専門家ベンチマークと比較する。
モデル生成の評価が専門家の品質に匹敵し、LLMのToM分析を導くことができるかを評価する。

提案手法

ToM のシナリオを欲望、知覚、信念、行動などの変数を含む因果グラフとして表現する。
Stage 1: 文脈、エージェント、初期状態、因果イベントを指定する因果テンプレートを構築する。
Stage 2: テンプレート変数を埋めるためにGPT-4へプロンプトを与え、各変数の流暢な文を作成させる。
Stage 3: テンプレート文をテストストーリーと質問へ接続し、テンプレートごとに25条件（合計5,000項目）を生成する。
各プロンプトにつき3つの完了を生成し、変数ごとに1文に制限する；Forward Belief、Forward Action、Backward Belief の条件と制御を重視する。

Figure 1: Illustration of our template-based Theory-of-Mind (ToM) scenarios. [a] The causal template and an example scenario including prior desires, actions, and beliefs, and a causal event that changes the state of the environment. [b] Testing Forward Belief inference by manipulating an agent’s pe

実験結果

リサーチクエスチョン

RQ1LLMs は真偽信念条件の下で知覚から信念へと前向き信念推論を行えるか？
RQ2LLMs は知覚と信念からエージェントの行動を推測できるか、偽信念シナリオを含むか？
RQ3LLMs は観測された行動から潜在的な知覚と信念を推論するバックワード信念推定を行えるか？
RQ4モデル生成のToM評価はクラウドソースや専門家作成ベンチマークと比べて品質が同等か？
RQ5どの prompting 戦略（0-shot、1-shot、チェーン・オブ・思考）が最もToM 推論を引き出すか？

主な発見

GPT-4 は人間の推論パターンと一致するToM能力を示し、特に真実信念とバックワード信念課題で人間の推論に近いが、最も難しい推論レベルでは完全には達していない。
ほとんどのモデルは真偽信念条件下の前向き信念と特に前向き行動に苦戦し、GPT-4 がいくつかの課題で人間に最も近い性能を示した。
バックワード信念推論は人間とモデルの双方にとって最も難しく、GPT-4 は相対的に人間らしいパターンを示すが人間の正確さには及ばない。
ワンショット prompting およびワンショットチェーン・オブ・思考プロンプトは全モデルの性能を向上させるが、これが真のToMではなく推論テンプレートの模倣を反映している可能性がある。

Figure 2: [a] Three-stage method for generating evaluations: Building a causal template for the domain (left). Creating a prompt template (simplified here; see Fig. 4 for the prompt) from the causal graph and populating template variables using a language model (middle). Composing test items by comb

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。