QUICK REVIEW

[논문 리뷰] Faithful Chain-of-Thought Reasoning

Qing Lyu, Shreya Havaldar|arXiv (Cornell University)|2023. 01. 31.

Topic Modeling인용 수 22

한 줄 요약

Faithful CoT는 추론을 Translation (NL to a NL/SL 체인)와 Problem Solving (deterministic solver)으로 분해하여 설명이 최종 답을 충실하게 산출하도록 보장하고, 여러 데이터셋에서 SOTA 성능을 달성한다.

ABSTRACT

While Chain-of-Thought (CoT) prompting boosts Language Models' (LM) performance on a gamut of complex reasoning tasks, the generated reasoning chain does not necessarily reflect how the model arrives at the answer (aka. faithfulness). We propose Faithful CoT, a reasoning framework involving two stages: Translation (Natural Language query $ ightarrow$ symbolic reasoning chain) and Problem Solving (reasoning chain $ ightarrow$ answer), using an LM and a deterministic solver respectively. This guarantees that the reasoning chain provides a faithful explanation of the final answer. Aside from interpretability, Faithful CoT also improves empirical performance: it outperforms standard CoT on 9 of 10 benchmarks from 4 diverse domains, with a relative accuracy gain of 6.3% on Math Word Problems (MWP), 3.4% on Planning, 5.5% on Multi-hop Question Answering (QA), and 21.4% on Relational Inference. Furthermore, with GPT-4 and Codex, it sets the new state-of-the-art few-shot performance on 7 datasets (with 95.0+ accuracy on 6 of them), showing a strong synergy between faithfulness and accuracy.

연구 동기 및 목표

Chain-of-Thought 프롬프트에서 추론 체인의 충실성에 다룬다.
Translation과 Problem Solving이라는 두 단계 프레임워크를 제안하여 충실한 설명을 제공한다.
수학 단어 문제, 계획, 다중 홉 QA, 관계 추론 등에서 성능 향상을 입증한다.
언어 모델, 프롬프트 변형, 예시에 대한 견고성을 보이고 한계와 윤리를 강조한다.

제안 방법

NL 쿼리를 NL과 상징 언어(SL)가 교차하는 추론 체인으로 번역한다.
결과 답을 산출하기 위해 결정론적 외부 솔버로 SL 프로그램을 해결한다.
NL을 사용하여 문제를 하위 문제로 분해하고, 하위 단계는 Python, Datalog, 또는 PDDL과 같은 SL로 인코딩한다.
A가 C_SL의 실행에서 도출된다는 것을 보장하는 두 단계 파이프라인을 사용하여 충실성을 보장한다.
MWP, Planning, Multi-hop QA, Relational Inference의 4 도메인에서 여러 LMs와 디코딩 전략으로 Faithful CoT를 평가한다.
견고성, 솔버의 역할, 오류 패턴에 대한 분석을 제공한다.

실험 결과

연구 질문

RQ1다양한 추론 과제에서 최종 답의 정확도를 높이면서 Faithful CoT가 충실한 설명을 제공할 수 있는가?
RQ2추론 체인에서 NL과 SL의 교차가 성능과 해석가능성에 미치는 영향은 무엇인가?
RQ3외부 솔버(Python, Datalog, PDDL)의 선택이 결과와 견고성에 어떤 영향을 미치는가?
RQ4번역 단계의 불투명성이 해석가능성의 한계로 작용하는 정도는 어느 정도인가?

주요 결과

GSM8K	SVAMP	MultiArith	ASDiv	AQuA	SayCan	StrategyQA	Date	Sport	CLUTRR
72.3	83.4	98.8	80.2	47.2	89.3	63.0	81.6	99.1	58.9
78.0	86.8	100.0	84.2	52.0	89.3	79.8	63.8	98.0	45.7
38.3	80.3	74.0	76.5	40.6	77.7	72.2	76.6	99.5	47.2
38.8	80.5	74.0	76.3	44.9	76.7	71.9	77.2	99.4	50.9

Faithful CoT는 vanilla CoT 및 LtM 베이스라인에 비해 9개 벤치마크에서 정확도를 향상시킨다.
Codex를 사용할 때, Faithful CoT는 Relational Inference에서 최대 21.4% 상대 증가를 달성하고 Math Word Problems, Planning, Multi-hop QA에서 주목할 만한 이점을 얻는다.
GPT-4를 사용할 때, Faithful CoT는 7개 데이터셋에서 SOTA few-shot 결과를 달성하고 그 중 6개에서 95.0% 이상 정확도이다.
외부 솔버는 많은 작업에서 결정적이며, 이를 제거하면 여러 데이터셋에서 정확도가 급격히 떨어진다.
Faithful CoT는 사람 평가에서 상당히 그럴듯한 추론 체인을 생성하지만, 지식 집약적이거나 모호한 사례에서 여전히 올바른 답과 함께 잘못된 체인이 수반되기도 한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.