QUICK REVIEW

[논문 리뷰] Reasoning with Language Model is Planning with World Model

Shibo Hao, Yi Gu|arXiv (Cornell University)|2023. 05. 24.

Topic Modeling인용 수 10

한 줄 요약

RAP는 LLM이 내부 세계 모델과 몬테 카를로 트리 탐색(MCTS) 계획을 통해 추론하도록 하여, 표준 연쇄 사고(CoT) 프롬프트를 넘어 계획 생성, 수학적 추론 및 논리적 추론을 향상시킨다.

ABSTRACT

Large language models (LLMs) have shown remarkable reasoning capabilities, especially when prompted to generate intermediate reasoning steps (e.g., Chain-of-Thought, CoT). However, LLMs can still struggle with problems that are easy for humans, such as generating action plans for executing tasks in a given environment, or performing complex math, logical, and commonsense reasoning. The deficiency stems from the key fact that LLMs lack an internal $ extit{world model}$ to predict the world $ extit{state}$ (e.g., environment status, intermediate variable values) and simulate long-term outcomes of actions. This prevents LLMs from performing deliberate planning akin to human brains, which involves exploring alternative reasoning paths, anticipating future states and rewards, and iteratively refining existing reasoning steps. To overcome the limitations, we propose a new LLM reasoning framework, $\underline{R}$easoning vi$\underline{a}$ $\underline{P}$lanning $ extbf{(RAP)}$. RAP repurposes the LLM as both a world model and a reasoning agent, and incorporates a principled planning algorithm (based on Monto Carlo Tree Search) for strategic exploration in the vast reasoning space. During reasoning, the LLM (as agent) incrementally builds a reasoning tree under the guidance of the LLM (as world model) and task-specific rewards, and obtains a high-reward reasoning path efficiently with a proper balance between exploration $ extit{vs.}$ exploitation. We apply RAP to a variety of challenging reasoning problems including plan generation, math reasoning, and logical inference. Empirical results on these tasks demonstrate the superiority of RAP over various strong baselines, including CoT and least-to-most prompting with self-consistency. RAP on LLAMA-33B surpasses CoT on GPT-4 with 33% relative improvement in a plan generation setting.

연구 동기 및 목표

LLMs가 계획 및 장기 지향적 추론을 위한 내부 세계 모델을 가지지 못한 점을 동기 부여하고 해결한다.
RAP라는 프레임워크를 제안하여 LLM을 세계 모델이자 추론 에이전트로 전환한다.
계획 생성, 수학적 추론, 논리 추론 전반에 걸친 RAP의 효과를 입증한다.
학습된 보상에 의해 안내되는 MCTS를 이용한 계획이 고품질의 추론 흔적을 산출함을 보인다.

제안 방법

각 추론 작업에 대한 상태와 행동을 정의하고 프롬프트를 통해 LLM을 사용하여 세계 모델을 인스턴스화한다.
행동 가능성, 상태 신뢰도, 자기 평가 및 작업 특화 휴리스틱을 포함한 추론 단계에 대한 보상을 도입한다.
UCT 기반 선택, 확장, 시뮬레이션 및 역전파를 사용하여 추론 흔적을 구축하고 평가하기 위해 몬테카를로 트리 검색(MCTS)을 적용한다.
적절할 때 여러 추론 흔적을 모아 최종 답변을 내놓도록 RAP 집계(RAP-Aggregation)를 허용한다.
세계 모델이자 에이전트로서 LLM이 탐색과 활용의 균형을 맞춰 높은 보상을 얻는 추론 경로를 찾아낼 수 있음을 입증한다.

실험 결과

연구 질문

RQ1LLM에 내재된 내부 세계 모델이 도메인 전반에 걸친 계획 유사 추론을 향상시킬 수 있는가?
RQ2LLM에서 도출된 보상으로 안내되는 MCTS 기반 계획이 표준 CoT 프롬프트보다 더 높은 품질의 추론 흔적을 생성하는가?
RQ3강력한 기준선과 비교하여 RAP가 계획 생성, 수학적 추론, 논리 추론에서 어떻게 성능하는가?
RQ4특정 설정에서 RAP가 강력한 모델(예: CoT를 사용하는 GPT-4)을 능가하거나 일치할 수 있는가?

주요 결과

RAP은 2/4/6단계 Blocksworld 계획 생성에서 평균 64%의 성공률을 달성하여 CoT보다 크게 우수하다.
RAP를 적용한 LLaMA-33B가 계획 생성에서 CoT를 사용하는 GPT-4보다 상대적으로 33%의 향상을 보인다.
RAP는 GSM8K 수학 추론 정확도를 연쇄 사고(CoT) 및 최소-에서-다다 프롬프트의 Self-Consistency보다 향상시키며 약 48.8%의 정확도에 도달하고, 집계로 51.6%까지 개선된다.
PrOntoQA 논리 추론에서 RAP은 예측 정확도 94.2%와 증명 정확도 78.8%를 산출하여 CoT 기준선을 상회한다.
Llama-2 70B로 전체 Blocksworld에서 RAP의 견고함을 보여주며, CoT가 감소하는 더 어려운 6단계 이상 사례에서도 더 높은 성능을 유지한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.