QUICK REVIEW

[論文レビュー] MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback

Xingyao Wang, Zihan Wang|arXiv (Cornell University)|Sep 19, 2023

Topic Modeling被引用数 15

ひとこと要約

MINTは、ツールの使用からの利得と模擬的な自然言語フィードバックを測定し、既存のデータセットから再利用した多様なタスクを横断して、マルチターン設定でLLMsを評価するベンチマークです。

ABSTRACT

To solve complex tasks, large language models (LLMs) often require multiple rounds of interactions with the user, sometimes assisted by external tools. However, current evaluation protocols often emphasize benchmark performance with single-turn exchanges, neglecting the nuanced interactions among the user, LLMs, and external tools, while also underestimating the importance of natural language feedback from users. These oversights contribute to discrepancies between research benchmark evaluations and real-world use cases. We introduce MINT, a benchmark that evaluates LLMs' ability to solve tasks with multi-turn interactions by (1) using tools and (2) leveraging natural language feedback. To ensure reproducibility, we provide an evaluation framework where LLMs can access tools by executing Python code and receive users' natural language feedback simulated by GPT-4. We repurpose a diverse set of established evaluation datasets focusing on reasoning, coding, and decision-making and carefully curate them into a compact subset for efficient evaluation. Our analysis of 20 open- and closed-source LLMs offers intriguing findings. (a) LLMs generally benefit from tools and language feedback, with performance gains (absolute, same below) of 1-8% for each turn of tool use and 2-17% with natural language feedback. (b) Better single-turn performance does not guarantee better multi-turn performance. (c) Surprisingly, on the LLMs evaluated, supervised instruction-finetuning (SIFT) and reinforcement learning from human feedback (RLHF) generally hurt multi-turn capabilities. We expect MINT can help measure progress and incentivize research in improving LLMs' capabilities in multi-turn interactions, especially for open-source communities where multi-turn human evaluation can be less accessible compared to commercial LLMs with a larger user base.

研究の動機と目的

問題解決中のマルチターンツール使用からLLMsがどの程度利益を得られるかを評価する。
マルチターンのLLMパフォーマンスに対する自然言語フィードバックの影響を評価する。
多様な既存データセットを再利用して、コンパクトで再現可能なMINT評価セットを作成する。
ツール強化とフィードバック有効化のマルチターン設定で、オープンソースとクローズドソースLLMsを比較する。
マルチターン評価で明らかになるアーティファクトと障害パターンを分析し、今後のモデル開発に役立てる。

提案手法

LLMsがツールを使用するためのPythonインタプリタを介してPythonコードを実行できる再現可能な評価フレームワークを提供する。
GPT-4を用いてユーザーの自然言語フィードバックをシミュレートし、LLMsがフィードバックを活用する能力を測定する。
推論、コーディング、意思決定を横断する8つのデータセットから29,307インスタンス中の586インスタンスのコンパクトなサブセットをキュレーションする。
4つのクローズドソースと16のオープンソースLLMsを、ベース、SIFT、RLHFのバリアントを含め、様々な対話ターン数（k in {1..5}）で評価する。
SR（成功率）を主要指標とし、回帰由来の改善率（Delta_tools）をツール強化のゲインを定量化する。

Figure 1: An interaction trajectory produced by evaluating gpt-3.5-turbo-0613 with MINT on a mathematical reasoning task. The evaluated model’s outputs are in the blue boxes, and the feedback by gpt-4-0613 in red, dotted ones. Some details are omitted for clarity.

実験結果

リサーチクエスチョン

RQ1LLMsはタスク解決時にマルチターンツール使用を許可するとどれくらい改善するか？
RQ2LLMsは模擬的な自然言語フィードバックをマルチターン対話で活用して性能を向上させることがどれほど効果的か？
RQ3マルチターン、ツール有効化、フィードバック強化評価の下で、オープンソースモデルはクローズドソースモデルとのギャップを縮められるか？
RQ4SIFTとRLHFのトレーニング規制は、マルチターンツール使用とフィードバック活用にどう影響するか？
RQ5マルチターン設定で現れる障害パターンは何で、GPT-4のフィードバックは人間のフィードバックと同等に効果的か？

主な発見

Model	Size	Type	k=1	k=2	k=3	k=4	k=5	Improvement	R^2
CodeLLaMA	7B	Base	0.3	4.1	7.2	7.2	4.3	+1.1	0.38
CodeLLaMA	7B	SIFT	0.3	7.8	10.2	9.7	8.7	+1.9	0.53
CodeLLaMA	13B	Base	0.5	13.7	17.9	19.3	18.4	+4.1	0.70
CodeLLaMA	13B	SIFT	1.5	12.6	13.1	15.0	14.5	+2.8	0.64
CodeLLaMA	34B	Base	0.2	16.2	23.0	25.9	28.2	+6.6	0.85
CodeLLaMA	34B	SIFT	2.6	10.1	14.7	15.4	17.1	+3.4	0.86
LLaMA-2	7B	Base	0.2	5.6	7.3	8.9	9.7	+2.2	0.87
LLaMA-2	7B	RLHF	-	4.3	6.7	6.5	7.3	+1.5	0.83
LLaMA-2	13B	Base	0.2	11.4	15.5	15.2	14.5	+3.2	0.63
LLaMA-2	13B	RLHF	4.1	12.5	12.5	13.3	11.9	+1.7	0.47
LLaMA-2	70B	Base	1.9	19.4	24.6	26.4	26.4	+5.6	0.73
LLaMA-2	70B	RLHF	4.3	14.3	15.7	16.6	17.9	+3.0	0.73
Lemur-v1	70B	Base	1.0	17.9	23.6	25.3	26.3	+5.8	0.77
Lemur-v1	70B	SIFT	3.8	27.0	35.7	37.5	37.0	+7.7	0.73
Vicuna-v1.5	7B	-	0.0	6.7	12.3	15.4	12.6	+3.4	0.77
Vicuna-v1.5	13B	-	0.0	2.2	4.4	6.7	8.4	+2.1	1.00

すべてのモデルがツール使用と自然言語フィードバックの恩恵を受け、追加ツールターンごとに絶対的な改善は1–8%、フィードバックからの改善は2–17%である。
単一ターン時の性能が必ずしもすべてのケースでマルチターン性能の向上を保証するわけではない。
オープンソースモデルは一般にマルチターン性能で最高クローズドソースモデルに遅れをとるが、言語フィードバックでLemur-70b-chat-v1のようにギャップを縮めるモデルもある。
SIFTおよびRLHFトレーニングは多くの場合マルチターン能力を損なうが、例外も存在する（例：Vicuna-7B、Lemur-70b-chat-v1）。
GPT-4シミュレートフィードバックは多くの設定で人間のフィードバックと同様に有益であり、評価者はGPT-4フィードバックをしばしば実地のフィードバックと同等または人間の真のフィードバックと同様に有用と判断する。
フィードバック提供能力は課題解決能力と直交する場合があり、強力な解決者が常に強いフィードバック提供者とは限らない。
MINTはトレーニングデータ（ShareGPTなど）のアーティファクトを検出し、いくつかのモデルでフォーマットや解析の問題を明らかにする。

Figure 2: With an increasing interaction limit $k$ , the success rate (%, micro averaged) improves at different rates for different LLMs. For clarity, only a subset of evaluated LLMs are visualized.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。