QUICK REVIEW

[論文レビュー] Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning

Antonia Creswell, Murray Shanahan|arXiv (Cornell University)|May 19, 2022

Topic Modeling被引用数 110

ひとこと要約

著者らは Selection-Inference (SI) フレームワークを紹介します。これはモジュラーな2段階プロセス（selectionとinference）でプレトレーニング済みのLLMを用い、因果的で解釈可能な多段階の論理推論を実現し、ファインチューニングなしの5-shot prompting において vanilla や Chain-of-Thought baselines を大幅に上回り、さらにはるかに大きなモデルにも匹敵します。

ABSTRACT

Large language models (LLMs) have been shown to be capable of impressive few-shot generalisation to new tasks. However, they still tend to perform poorly on multi-step logical reasoning problems. Here we carry out a comprehensive evaluation of LLMs on 50 tasks that probe different aspects of logical reasoning. We show that language models tend to perform fairly well at single step inference or entailment tasks, but struggle to chain together multiple reasoning steps to solve more complex problems. In light of this, we propose a Selection-Inference (SI) framework that exploits pre-trained LLMs as general processing modules, and alternates between selection and inference to generate a series of interpretable, casual reasoning steps leading to the final answer. We show that a 7B parameter LLM used within the SI framework in a 5-shot generalisation setting, with no fine-tuning, yields a performance improvement of over 100% compared to an equivalent vanilla baseline on a suite of 10 logical reasoning tasks. The same model in the same setting even outperforms a significantly larger 280B parameter baseline on the same suite of tasks. Moreover, answers produced by the SI framework are accompanied by a causal natural-language-based reasoning trace, which has important implications for the safety and trustworthiness of the system.

研究の動機と目的

LLMs が広範な論理推論タスクでどのように機能するかを評価し、多段階推論の限界を特定する。
因果トレースを用いて推論を改善するモジュラーなSelection-Inferenceフレームワークを提案する。
SIフレームワークの有効性を、7B LLMを用いた5-shot promptingで、280Bモデルを含むベースラインと比較して実証する。
SI が安全性、デバッグ、信頼性に有用な解釈可能な因果推論トレースを生み出すことを示す。

提案手法

推論を、Gopher系の事前訓練済みで凍結されたLLMを用いて実装された反復的な選択と推論ステップに分解する。
コンテキストから事実を点数付けして単一の推論ステップに選択するよう、Selectionモジュールのプロンプト設計を行う。
質問にアクセスせず、選択されたサブセットから新しい事実を生成する別個のInferenceモジュールを使う。
複数の(Selection, Inference)ステップを連鎖させて、新たに推論された事実を含む文脈を構築し、因果的推論トレースを形成する。
SI を vanilla LLM、Chain-of-Thought (COT)、およびより大きな280Bモデルと、10の論理タスクにわたり比較する。

実験結果

リサーチクエスチョン

RQ1LLMs は単純な含意と多段階の論理推論タスクでどのように機能するか？
RQ2モジュラーなSelection-Inferenceフレームワークはファインチューニングなしで推論精度を改善できるか？
RQ3SIにより生成された推論トレースは回答の因果的・解釈可能な正当性を提供し、エラー回復を可能にするか？

主な発見

SIフレームワークの7B LLMは生成精度58.75%を達成し、同じモデルを素直に用いた場合は2.94%、COTでは41.32%（いずれも p<0.01）であった。
7B SIモデルは、280Bベースラインのvanilla(31.19%)、COT(44.03%)設定をしばしば上回す（いずれもp<0.01）。
易しい多肢選択評価では、vanilla 7Bモデルが280Bモデルを上回る（57.31%対51.45%）が、SIは生成設定でそれをさらに上回る。
SIはbAbI 15 deductionを100%の精度で解き、プロンプト例はわずか5つで済む。
SIはProofWriter Depth 0およびDepth 1タスクで高い性能を示し（有意水準のp値）。
SIは因果的で自然言語の推論トレースを生成し、新たな推定事実を追加することで誤りから回復できる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。