QUICK REVIEW

[논문 리뷰] Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners?

Andreas Opedal, Alessandro Stolfo|arXiv (Cornell University)|2024. 01. 31.

Education and Critical Thinking Development인용 수 21

한 줄 요약

이 논문은 현재의 LLM이 산술 단어 문제를 해결하는 데 인간과 유사한 편향을 보이는지 여부를 조사하고, 세 가지 해결 단계에서 편향을 식별하며, 지시 학습(instruction-tuning)이 적용되었거나 적용되지 않은 여러 오픈 소스 모델을 테스트한다.

ABSTRACT

There is increasing interest in employing large language models (LLMs) as cognitive models. For such purposes, it is central to understand which properties of human cognition are well-modeled by LLMs, and which are not. In this work, we study the biases of LLMs in relation to those known in children when solving arithmetic word problems. Surveying the learning science literature, we posit that the problem-solving process can be split into three distinct steps: text comprehension, solution planning and solution execution. We construct tests for each one in order to understand whether current LLMs display the same cognitive biases as children in these steps. We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features. We find evidence that LLMs, with and without instruction-tuning, exhibit human-like biases in both the text-comprehension and the solution-planning steps of the solving process, but not in the final step, in which the arithmetic expressions are executed to obtain the answer.

연구 동기 및 목표

LLM을 인간 학습의 인지 모델로 활용하려는 동기를 제시하고 산술 단어 문제 해결에서 인간의 편향과의 일치 여부 또는 차이를 식별한다.
세 가지 해결 단계(텍스트 이해, 해결 계획, 해결 실행)에서 특정 편향을 테스트하기 위한 통제된 문제 생성 파이프라인을 개발한다.
다수의 프롬프트 체계에서 지시 학습 여부에 관계없이 오픈 소스 LLM들(LLaMA2, Mistral, Mixtral)을 실증적으로 평가하여 편향 패턴을 탐지한다.
조건부 평균 처리 효과(CATE) 추정치를 통해 목표 문제 특징이 모델 성능에 미치는 인과 효과를 정량화한다.

제안 방법

문제 해결의 세 단계 인지 모델(텍스트 이해, 해결 계획, 해결 실행)을 제안하고 이를 MathWorld 논리 형식과 기호 표현 증명 순서로 구현한다.
문제 구조를 고정하고, 인지 모델을 구체화하며, 템플릿 텍스트를 렌더링하고, 사후 편집 오류 수정 단계를 적용하는 신경-기호적 파이프라인을 사용하여 산술 단어 문제의 통제된 데이터 세트를 생성한다.
선택된 특징에 대해 x와 x' 변형을 생성하는 쌍(pairwise) 문제 생성을 사용하여, CATE를 통해 특징이 모델 정확도에 미치는 인과 효과를 추정한다.
지침 직접 프롬프트와 체인-오브-생각(chain-of-thought) 프롬프트하에서 지시-튜닝 여부에 따라 LLaMA2 7B/13B, Mistral 7B, Mixtral 8x7B의 여덟 가지 모델 구성을 평가하고 제로샷 추론을 사용한다.
관찰된 CATE가 0과 다른지 여부를 결정하기 위해 통계적 검정(대응 표본 t-검정)을 적용하고 가능한 경우 p-값을 보고한다.

실험 결과

연구 질문

RQ1관계 키워드가 필요한 연산과 일치하는 문제 텍스트에서 LLM이 일관성 편향을 보이나요?
RQ2문제 해결 시 정신 모델 수준에서 전이 편향과 비교 편향을 보이나요?
RQ3특히 올림수를 야기하는 숫자에서 기호 표현 실행 단계에서 캐리(carry) 효과가 나타나나요?
RQ4지시-튜닝된 모델과 비튜닝 모델이 이러한 편향을 프롬프트 체계(직접 프롬프트 대 체인-오브-생각)에서 어떻게 비교되나요?

주요 결과

LLMs exhibit human-like consistency bias at the problem text level, with lower accuracy on inconsistent statements compared to consistent ones.
Transfer vs. comparison bias is present in LLMs, mirroring child learners, across multiple models and prompting settings.
Carry effects are not consistently observed in the solution-execution step across tested models and prompting methods.
Chain-of-thought prompting can amplify certain biases (e.g., consistency bias) but improves overall performance, depending on the model and task setup.
Instruction-tuned models generally show larger CATEs for certain biases compared to pretrained-only variants, depending on the prompt regime.
Across models and tests, several biases reach statistical significance (p-values often < 0.01 for key comparisons).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.