QUICK REVIEW

[論文レビュー] Can Large Language Models Really Improve by Self-critiquing Their Own Plans?

Karthik Valmeekam, Matthew Marquez|arXiv (Cornell University)|Oct 12, 2023

Topic Modeling被引用数 9

ひとこと要約

この論文は、ジェネレータLLMが計画を作成し、検証者LLMがそれを批評するLLM+LLMプランニングシステムを評価する。自己批評は、外部検証器を健全に使用する場合と比べて、検証器の多数の偽陽性により計画生成の性能を低下させる。フィードバックの粒度はほとんど影響を与えない。

ABSTRACT

There have been widespread claims about Large Language Models (LLMs) being able to successfully verify or self-critique their candidate solutions in reasoning problems in an iterative mode. Intrigued by those claims, in this paper we set out to investigate the verification/self-critiquing abilities of large language models in the context of planning. We evaluate a planning system that employs LLMs for both plan generation and verification. We assess the verifier LLM's performance against ground-truth verification, the impact of self-critiquing on plan generation, and the influence of varying feedback levels on system performance. Using GPT-4, a state-of-the-art LLM, for both generation and verification, our findings reveal that self-critiquing appears to diminish plan generation performance, especially when compared to systems with external, sound verifiers and the LLM verifiers in that system produce a notable number of false positives, compromising the system's reliability. Additionally, the nature of feedback, whether binary or detailed, showed minimal impact on plan generation. Collectively, our results cast doubt on the effectiveness of LLMs in a self-critiquing, iterative framework for planning tasks.

研究の動機と目的

LLMベースの計画システムにおいて自己批評が計画生成を改善するかを評価する。
計画タスクにおける検証者LLMの性能をグラウンドトゥルース検証（VAL）と比較する。
フィードバックの粒度が計画生成性能に与える影響を分析する。

提案手法

ジェネレータLLMと検証者LLM（ともにGPT-4）を用い、反プロンプトバックループ（15回の上限）で反復実行する。
計画問題をPDDLで表現し、Blocksworldドメインで評価する。
最終計画のグラウンドトゥルース検証をVAL（外部健全検証器）で行う。
LLM+LLMとLLM+VALバックプロンプティング、及びジェネレータのみのベースラインを比較する。
4つのフィードバックレベルを試す：フィードバックなし、バイナリフィードバック、バイナリ＋最初のエラーのフィードバック、バイナリ＋全エラーのフィードバック。

実験結果

リサーチクエスチョン

RQ1自己批評は外部検証と比較して計画生成性能を改善するか。
RQ2検証者LLMの精度はグラウンドトゥルース検証（VAL）と比較してどうか。偽陽性を含む。
RQ3フィードバックの粒度（バイナリ対詳細）は、LLM+LLMシステムの計画生成性能に影響を与えるか。
RQ4健全な検 verifierを用いたバックプロンプティングは全体の信頼性と効率性にどのような影響を与えるか。

主な発見

Method	Accuracy	Avg. Number of iterations
LLM+LLM w/ Backprompting (BP)	55/100 (55%)	3.48
LLM+VAL w/ BP	88/100 (88%)	4.18
Generator LLM only w/o BP	40/100 (40%)	1.00

LLM+LLMバックプロンプティングは55/100（55%）の精度、LLM+VALは88/100（88%）、ジェネレータのみのベースラインは40/100（40%）。
検証者LLMは61/100の精度で、54件が真陽性、38件が偽陽性（偽陽性は38/45）。
外部VAL検証器は自己批評型のLLM検証器よりも性能と信頼性を大幅に向上させる。
フィードバックの粒度（バイナリ対詳細）は、正確なバイナリフィードバックが得られる場合、計画生成性能にほとんど影響を与えなかった。
効率性（平均反復回数）は、LLM+LLM BPが3.48、LLM+VAL BPが4.18で、外部検証にもかかわらず反復回数は類似していることを示唆する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。