QUICK REVIEW

[논문 리뷰] Understanding Social Reasoning in Language Models with Language Models

Kanishk Gandhi, Jan-Philipp Fränken|arXiv (Cornell University)|2023. 06. 21.

Topic Modeling인용 수 20

한 줄 요약

이 논문은 대규모로 생성된 이론 마음(ToM) 벤치마크인 BigToM을 도입한다. 인과 템플릿을 사용하여 절차적으로 5,000개의 항목을 생성하고 다양한 LLM을 평가한다. GPT-4는 인간과 유사한 ToM 패턴을 보이지만 한계가 있으며, 다른 모델은 성능이 떨어진다.

ABSTRACT

As Large Language Models (LLMs) become increasingly integrated into our everyday lives, understanding their ability to comprehend human mental states becomes critical for ensuring effective interactions. However, despite the recent attempts to assess the Theory-of-Mind (ToM) reasoning capabilities of LLMs, the degree to which these models can align with human ToM remains a nuanced topic of exploration. This is primarily due to two distinct challenges: (1) the presence of inconsistent results from previous evaluations, and (2) concerns surrounding the validity of existing evaluation methodologies. To address these challenges, we present a novel framework for procedurally generating evaluations with LLMs by populating causal templates. Using our framework, we create a new social reasoning benchmark (BigToM) for LLMs which consists of 25 controls and 5,000 model-written evaluations. We find that human participants rate the quality of our benchmark higher than previous crowd-sourced evaluations and comparable to expert-written evaluations. Using BigToM, we evaluate the social reasoning capabilities of a variety of LLMs and compare model performances with human performance. Our results suggest that GPT4 has ToM capabilities that mirror human inference patterns, though less reliable, while other LLMs struggle.

연구 동기 및 목표

계산된, 제어 가능한 프레임워크를 구축하여 인과 템플릿을 사용해 LLM의 이론 마음(ToM)을 평가한다.
다양한 제어 조건과 5,000개 항목을 가진 합성적이고 모델이 작성한 ToM 벤치마크(BigToM)를 생성한다.
모델의 성능을 인간 성능 및 기존의 군중 소싱/전문가 벤치마크와 비교한다.
모델이 생성한 평가가 전문가의 품질에 부합하고 LLM의 ToM 분석을 안내할 수 있는지 평가한다.

제안 방법

ToM 시나리오를 욕구, 지각, 신념, 행동 등 변수와 함께 인과 그래프로 표현한다.
Stage 1: 맥락, 에이전트, 초기 상태, 인과적 사건을 명시하는 인과 템플릿을 구축한다.
Stage 2: GPT-4에게 템플릿 변수 값을 채우도록 프롬프트를 제공하여 각 변수에 대해 유창한 문장을 생성한다.
Stage 3: 템플릿 문장을 테스트 이야기와 질문으로 엮어 템플릿당 25개의 조건(총 5,000개 항목)을 생성한다.
프롬프트당 3개의 완성을 생성하고 변수당 한 문장으로 한정하며, Forward Belief, Forward Action, Backward Belief 조건에 초점을 맞추고 제어를 둔다.

Figure 1: Illustration of our template-based Theory-of-Mind (ToM) scenarios. [a] The causal template and an example scenario including prior desires, actions, and beliefs, and a causal event that changes the state of the environment. [b] Testing Forward Belief inference by manipulating an agent’s pe

실험 결과

연구 질문

RQ1LLMs가 실제/허위 믿음 조건에서 지각에서 신념으로의 Forward 믿음 추론을 수행할 수 있는가?
RQ2LLMs가 지각과 신념으로부터 에이전트의 행동을 추론할 수 있는가, 허위-belief 시나리오를 포함하여?
RQ3관찰된 행동에서 근본적인 지각과 신념으로의 역방향 믿음 추론을 LLM이 수행할 수 있는가?
RQ4모델 생성 ToM 평가가 군중 소싱 및 전문가가 작성한 벤치마크의 품질과 비교해 어떤가?
RQ50-shot, 1-shot, chain-of-thought 등 어떤 프롬프트 전략이 LLM의 ToM 추론을 가장 잘 이끌어내는가?

주요 결과

GPT-4는 인간 추론 패턴과 일치하는 ToM 능력을 보이며, 특히 진실된 믿음(true-belief) 및 역방향 믿음(backward belief) 과제에서 인간에 근접한 경향을 보이지만 가장 어려운 추론 수준에서는 완벽하지 않다.
대부분의 모델은 Forward Belief 및 특히 False-belief 조건에서 Forward Action에 취약하며, GPT-4가 여러 과제에서 인간에 가장 가깝게 수행한다.
Backward belief 추론은 인간과 모델 모두에게 가장 도전적인 영역이며, GPT-4는 상대적으로 인간에 더 근접한 패턴을 보이지만 여전히 인간의 정확도 이하면 한다.
원샷 프롬프트와 원샷 Chain-of-Thought 프롬프트는 모델 전반의 성능을 향상시키지만, 이는 실제 ToM보다는 추론 템플릿을 모방한 결과일 수 있다.

Figure 2: [a] Three-stage method for generating evaluations: Building a causal template for the domain (left). Creating a prompt template (simplified here; see Fig. 4 for the prompt) from the causal graph and populating template variables using a language model (middle). Composing test items by comb

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.