QUICK REVIEW

[論文レビュー] AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu|arXiv (Cornell University)|Aug 7, 2023

Topic Modeling被引用数 44

ひとこと要約

AgentBenchは8環境のマルチタスクベンチマークを導入し、エージェントとして振る舞うLLMを評価する。実世界タスクでトップAPI LLMとOSSモデルの間に有意なギャップを示す。

ABSTRACT

The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over \num API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and many OSS competitors that are no larger than 70B. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Improving instruction following and training on high quality multi-round alignment data could improve agent performance. And different from existing assumptions, training on code present ambivalent impacts on different agent tasks. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench.

研究の動機と目的

LLMを自律的なエージェントとして対話型環境で評価するための多次元ベンチマーク（AgentBench）を定義する。
コードに根ざしたタスク、ゲームに根ざしたタスク、ウェブに根ざした設定を横断する8つの実世界タスクを通じてLLMsを評価する。
エージェントの性能を制約する失敗モードと要因を特定し、今後の改善を導く。
統合的でAPI中心の評価ツールキットを提供し、エージェント評価ワークフローを標準化する。

提案手法

インタラクティブな評価を部分観測可能なマルコフ決定過程として形式化する。
評価の主要な推論戦略としてChain-of-Thought promptingを使用する。
指示の追従、コーディング、計画、ツール使用をテストする8つの多様な環境（OS、DB、KG、DCG、LTP、HH、WS、WB）を構築する。
標準化されたプロンプトとゼロ温度デコーディングを用いて27のLLM（APIベースとOSS）を評価し、タスク別および重み付き総合スコアで評価する。
HTTP APIを介してタスクを分離して実行するサーバークライアント、Dockerベースの評価ツールキットを提供する。

Figure 1: An overview of LLMs on AgentBench . While LLMs begin to manifest their proficiency in LLM-as-Agent, gaps between models and the distance toward practical usability are significant.

実験結果

リサーチクエスチョン

RQ1多様で実世界的なタスクにおいて、現在のAPIベースのLLMとOSSモデルはエージェントとして展開された場合、どのように比較されるのか？
RQ2連続した複数ターンの環境でエージェントとして機能させる際、LLMを制限する主な失敗モードは何か？
RQ3コードデータでの学習と高品質な整合データはエージェントの行動と性能をどの程度改善するのか？
RQ4タスク構造と環境タイプは、エージェントにとってのChain-of-Thought promptingの有効性にどのような影響を与えるのか？

主な発見

モデル	VER	OA	OS	DB	KG	DCG	LTP	HH	WS	WB
gpt-4	0613	4.01	42.4	32.0	58.8	74.5	16.6	78.0	61.1	29.0
claude-2	-	2.49	18.1	27.3	41.3	55.5	8.4	54.0	61.4	0.0
claude	v1.3	2.44	9.7	22.0	38.9	40.9	8.2	58.0	55.7	25.0
gpt-3.5-turbo	0613	2.32	32.6	36.7	25.9	33.7	10.5	16.0	64.1	20.0
text-davinci-003	-	1.71	20.1	16.3	34.9	3.0	7.1	20.0	61.7	26.0
claude-instant	v1.1	1.60	16.7	18.0	20.8	5.9	12.6	30.0	49.7	4.0
chat-bison-001	-	1.39	9.7	19.7	23.0	16.6	4.4	18.0	60.5	12.0
text-davinci-002	-	1.25	8.3	16.7	41.5	11.8	0.5	16.0	56.3	9.0
llama-2-70b	-	0.78	9.7	13.0	8.0	21.3	0.0	2.0	5.6	19.0
guanaco-65b	-	0.54	8.3	14.7	1.9	0.1	1.5	12.0	0.9	10.0
codellama-34b	-	0.96	2.8	14.0	23.5	8.4	0.7	4.0	52.1	20.0
vicuna-33b	-	0.73	15.3	11.0	1.2	16.3	1.0	6.0	23.9	7.0
wizardlm-30b	-	0.46	13.9	12.7	2.9	0.3	1.8	6.0	4.4	1.0
guanaco-33b	-	0.39	11.1	9.3	3.2	0.3	0.0	6.0	6.2	5.0
vicuna-13b	-	0.93	10.4	6.7	9.4	0.1	8.0	8.0	41.7	12.0
llama-2-13b	-	0.77	4.2	11.7	3.6	26.4	0.0	6.0	25.3	13.0
openchat-13b	-	0.70	15.3	12.3	5.5	0.1	0.0	0.0	46.9	15.0
wizardlm-13b	-	0.66	9.0	12.7	1.7	1.9	0.0	10.0	43.7	12.0
vicuna-7b	-	0.56	9.7	8.7	2.5	0.3	6.4	0.0	2.2	9.0
codellama-13b	-	0.56	3.5	9.7	10.4	0.0	0.0	0.0	43.8	14.0
codellama-7b	-	0.50	4.9	12.7	8.2	0.0	0.0	2.0	25.2	12.0
koala-13b	-	0.34	3.5	5.0	0.4	0.1	4.4	0.0	3.9	7.0
llama-2-7b	-	0.34	4.2	8.0	2.1	6.9	0.0	0.0	11.6	7.0
codegeex2-6b	-	0.27	1.4	0.0	4.8	0.3	0.0	0.0	20.9	11.0
dolly-12b	-	0.14	0.0	0.0	0.0	0.1	1.2	0.0	0.4	9.0
chatglm-6b	-	0.11	4.9	0.3	0.0	0.0	0.0	0.0	0.5	4.9
oasst-12b	-	0.03	1.4	0.0	0.0	0.0	0.0	0.0	0.3	1.0

GPT-4は複数の環境で最も高い総合性能を達成し、House-Holdingタスクで78%の成功率を特に示した。
AgentBenchではAPIベースの商用LLMとOSSモデルの間に大きなギャップがあり、OSSモデルは多くのタスクで一般的に性能が低い。
コードデータでの学習は手続き主導のタスクの性能を向上させる一方、他のタスクの性能を損なう可能性がある。手続き的追従と一般的推論のトレードオフを示す。
高品質な整合データ（例：ShareGPT風）はOSS LLMを大幅に改善し、時にはより大きく整合性の低いモデルと同等になることもある。
多くのOSSモデルはKG、DCG、HHタスクで苦戦しており、長距離推論と指示追従のギャップを露呈している。
Task Limit Exceeded (TLE)は支配的な失敗メカニズムであり、マルチターン推論と意思決定の限界を示している。

Figure 2: AgentBench is the first systematic benchmark to evaluate LLM-as-Agent on a wide array of real-world challenges and 8 distinct environments. In total, 27 LLMs are examined in this edition.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。