QUICK REVIEW

[논문 리뷰] Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

Chengshu Li, Jacky Liang|arXiv (Cornell University)|2023. 12. 07.

Software Engineering Research인용 수 12

한 줄 요약

Chain of Code (CoC)은 언어 모델이 코드를 생성하고 코드를 실행할 수 없을 때 LMulator로 실행을 모방하도록 프롬프트하여, BBH 작업에서 Chain of Thought보다 우수한 상태 의사결정(stateful reasoning)을 이끌어내고 모델 크기에 걸쳐 확장됩니다.

ABSTRACT

Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter - we hypothesize that language models (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks, but also for semantic ones (and in particular, those that are a mix of both). For example, consider prompting an LM to write code that counts the number of times it detects sarcasm in an essay: the LM may struggle to write an implementation for "detect_sarcasm(string)" that can be executed by the interpreter (handling the edge cases would be insurmountable). However, LMs may still produce a valid solution if they not only write code, but also selectively "emulate" the interpreter by generating the expected output of "detect_sarcasm(string)". In this work, we propose Chain of Code (CoC), a simple yet surprisingly effective extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that the interpreter can explicitly catch undefined behaviors and hand off to simulate with an LM (as an "LMulator"). Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought. In a nutshell, CoC broadens the scope of reasoning questions that LMs can answer by "thinking in code".

연구 동기 및 목표

의미적 및 수치적 작업을 처리하기 위해 코드 기반 추론과 언어 모델 추론을 결합해야 할 필요성을 제시한다.
Chain of Code (CoC)를 두 단계 프로세스로 제안한다: 코드 생성 및 실행(인터프리터를 통해) 또는 실행이 실패할 때 LMulator 시뮬레이션.
Python 실행과 LM 기반 시뮬레이션의 교차 실행이 다양한 작업에서 성능을 향상시킨다는 것을 보여준다.
모델 크기에 따른 확장성과 일반 목적 추론의 척도로서 크로스태스크 프롬프트를 평가한다.

제안 방법

LM은 먼저 하위 문제를 구성하기 위해 코드, 의사코드 또는 자연어의 형식으로 추론을 생성한다.
가능한 경우 Python 인터프리터를 사용하여 한 줄씩 코드 실행을 수행하고, 그렇지 않으면 미해결 라인의 실행 효과를 LMulator가 시뮬레이션한다.
각 줄이 실행되거나 시뮬레이션된 후 공유 프로그램 상태가 업데이트되어 코드와 LM 출력의 인터리빙 실행을 가능하게 한다.
LMulator는 코드 실행이 실패할 때 중간 상태 추적을 생성하기 위해 LM의 자체 추론(가능하면 chain-of-thought)을 활용한다.
본 접근법은 BBH 작업과 GSM8K에서 Direct QA 및 Chain of Thought를 포함한 기준과 비교하여 평가되며, 구성 요소를 분리하기 위한 여러 제거 시험(예: Python-만, LM-만, LM 상태)을 포함한다.

실험 결과

연구 질문

RQ1CoC가 다양한 어려운 추론 과제에서 기준선과 비교하여 어떤 성능을 보이나?
RQ2어떤 작업 유형(알고리즘적, 의미적 또는 혼합)이 CoC로부터 가장 큰 이점을 얻는가?
RQ3각 CoC 구성 요소( Python 실행, LMulator, 인터리빙)가 성능에 미치는 영향은 무엇인가?
RQ4크로스태스크 프롬프트를 일반 목적 추론기로 활용할 때 모델 크기에 따라 어떻게 확장되는가?
RQ5도구가 있는 경우와 없는 경우의 지시 기반 튜닝된 채팅 모델에 대해 CoC가 어떻게 경쟁하는가?

주요 결과

CoC는 BIG-Bench Hard에서 84%를 달성하여 Chain of Thought를 능가하고, 일부 작업에서는 인간 기준치를 상회한다.
Python 사용 가능(CoC 변형)은 코드가 실행 가능할 때 여러 작업에서 거의 100%에 달하거나 그에 근접하고; LMulator 주도 변형은 코드 실행이 어렵거나 불가능한 의미적 작업을 가능하게 한다.
CoC는 알고리즘적 및 의미적 작업 모두에서 기준선보다 우수하며 모델 크기에 따라 확장되며 작은 모델에서도 개선이 나타난다.
크로스-태스크 프롬프트는 여전히 경쟁력을 유지하며 규모 확대로 인간 성능에 근접하여 일반 목적 추론기로서의 가능성을 시사한다.
삭제 실험은 인터프리터 실행과 LM 기반 시뮬레이션의 교차 실행 및 프로그램 상태 유지가 최대 성능에 결정적임을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.