QUICK REVIEW

[논문 리뷰] AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu|arXiv (Cornell University)|2023. 08. 07.

Topic Modeling인용 수 44

한 줄 요약

AgentBench는 에이전트로 작동하는 LLM을 평가하기 위한 8개 환경의 다중태스크 벤치마크를 도입하며, 실제 과제에서 API 상위 LLM과 OSS 모델 간에 상당한 격차가 있음을 보여준다.

ABSTRACT

The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively extit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities. Our extensive test over um API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and many OSS competitors that are no larger than 70B. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Improving instruction following and training on high quality multi-round alignment data could improve agent performance. And different from existing assumptions, training on code present ambivalent impacts on different agent tasks. Datasets, environments, and an integrated evaluation package for AgentBench are released at https://github.com/THUDM/AgentBench.

연구 동기 및 목표

인터랙티브한 환경에서 LLM을 자율 에이전트로 평가하기 위한 다차원 벤치마크(AgentBench)를 정의합니다.
코드 기반, 게임 기반, 웹 기반 설정에 걸친 여덟 가지 실제 세계 과제를 통해 LLM을 평가합니다.
에이전트 성능에 영향을 미치는 실패 모드와 요인을 식별하여 향후 개선에 방향을 제시합니다.
에이전트 평가 워크플로를 표준화하기 위한 통합적이고 API 중심의 평가 도구 키트를 제공합니다.

제안 방법

인터랙티브 평가를 부분적으로 관찰 가능한 마르코프 의사결정 과정으로 형식화합니다.
평가의 주요 추론 전략으로 사고의 연쇄(Chain-of-Thought) 프롬프팅을 사용합니다.
지시 따르기, 코딩, 계획 수립, 도구 사용을 테스트하기 위해 여덟 가지 다양한 환경(OS, DB, KG, DCG, LTP, HH, WS, WB)을 구성합니다.
표준화된 프롬프트와 제로-온도 디코딩을 사용해 27개 LLM(API 기반 및 OSS)를 평가하고, 과제별 및 가중 합계 점수로 전체 점수를 산출합니다.
HTTP API를 통해 작업을 격리 실행하기 위한 서버-클라이언트, Docker 기반 평가 도구 키트를 제공합니다.

Figure 1: An overview of LLMs on AgentBench . While LLMs begin to manifest their proficiency in LLM-as-Agent, gaps between models and the distance toward practical usability are significant.

실험 결과

연구 질문

RQ1다양하고 실제적인 과제에서 에이전트로 배치될 때 현재 API 기반 LLM과 OSS LLM은 어떻게 비교됩니까?
RQ2연속적이고 다중 턴 환경에서 에이전트로서 LLM이 효과적으로 작동하는 것을 제한하는 주요 실패 모드는 무엇입니까?
RQ3코드 학습과 고품질 정렬 데이터가 에이전트의 행동과 성능을 얼마나 향상시키나요?
RQ4작업 구조와 환경 유형이 에이전트를 위한 사고의 연쇄 프롬프팅의 효과에 어떤 영향을 줍니까?

주요 결과

모델	VER	OA(종합)	OS	DB	KG	DCG	LTP	HH	WS	WB
gpt-4	0613	4.01	42.4	32.0	58.8	74.5	16.6	78.0	61.1	29.0
claude-2	-	2.49	18.1	27.3	41.3	55.5	8.4	54.0	61.4	0.0
claude	v1.3	2.44	9.7	22.0	38.9	40.9	8.2	58.0	55.7	25.0
gpt-3.5-turbo	0613	2.32	32.6	36.7	25.9	33.7	10.5	16.0	64.1	20.0
text-davinci-003	-	1.71	20.1	16.3	34.9	3.0	7.1	20.0	61.7	26.0
claude-instant	v1.1	1.60	16.7	18.0	20.8	5.9	12.6	30.0	49.7	4.0
chat-bison-001	-	1.39	9.7	19.7	23.0	16.6	4.4	18.0	60.5	12.0
text-davinci-002	-	1.25	8.3	16.7	41.5	11.8	0.5	16.0	56.3	9.0
llama-2-70b	-	0.78	9.7	13.0	8.0	21.3	0.0	2.0	5.6	19.0
guanaco-65b	-	0.54	8.3	14.7	1.9	0.1	1.5	12.0	0.9	10.0
codellama-34b	-	0.96	2.8	14.0	23.5	8.4	0.7	4.0	52.1	20.0
vicuna-33b	-	0.73	15.3	11.0	1.2	16.3	1.0	6.0	23.9	7.0
wizardlm-30b	-	0.46	13.9	12.7	2.9	0.3	1.8	6.0	4.4	1.0
guanaco-33b	-	0.39	11.1	9.3	3.2	0.3	0.0	6.0	6.2	5.0
vicuna-13b	-	0.93	10.4	6.7	9.4	0.1	8.0	8.0	41.7	12.0
llama-2-13b	-	0.77	4.2	11.7	3.6	26.4	0.0	6.0	25.3	13.0
openchat-13b	-	0.70	15.3	12.3	5.5	0.1	0.0	0.0	46.9	15.0
wizardlm-13b	-	0.66	9.0	12.7	1.7	1.9	0.0	10.0	43.7	12.0
vicuna-7b	-	0.56	9.7	8.7	2.5	0.3	6.4	0.0	2.2	9.0
codellama-13b	-	0.56	3.5	9.7	10.4	0.0	0.0	0.0	43.8	14.0
codellama-7b	-	0.50	4.9	12.7	8.2	0.0	0.0	2.0	25.2	12.0
koala-13b	-	0.34	3.5	5.0	0.4	0.1	4.4	0.0	3.9	7.0
llama-2-7b	-	0.34	4.2	8.0	2.1	6.9	0.0	0.0	11.6	7.0
codegeex2-6b	-	0.27	1.4	0.0	4.8	0.3	0.0	0.0	20.9	11.0
dolly-12b	-	0.14	0.0	0.0	0.0	0.1	1.2	0.0	0.4	9.0
chatglm-6b	-	0.11	4.9	0.3	0.0	0.0	0.0	0.0	0.5	4.9
oasst-12b	-	0.03	1.4	0.0	0.0	0.0	0.0	0.0	0.3	1.0

GPT-4는 여러 환경에서 가장 높은 종합 성능을 달성했으며, 특히 House-Holding 과제에서 78% 성공을 기록했습니다.
AgentBench에서 API 기반 상용 LLM과 OSS 모델 간에 상당한 차이가 있으며, OSS 모델은 여러 과제에서 일반적으로 성능이 떨어집니다.
코드 데이터 학습은 절차 주도 과제에서 성능을 높일 수 있지만 다른 과제에서는 성능을 해칠 수 있어 절차 준수와 일반적 추론 간의 트레이드오프를 시사합니다.
고품질 정렬 데이터(예: ShareGPT 스타일)는 OSS LLM의 성능을 크게 향상시켜 더 크고 정렬이 덜 된 모델과 거의 대등하게 만드는 경우가 있습니다.
많은 OSS 모델이 KG, DCG, HH 과제에서 어려움을 겪으며 장기적 추론과 지시 이행의 차이를 드러냅니다.
작업 한계 초과(TLE)는 지배적인 실패 메커니즘으로 다중 턴 추론 및 의사 결정의 한계를 시사합니다.

Figure 2: AgentBench is the first systematic benchmark to evaluate LLM-as-Agent on a wide array of real-world challenges and 8 distinct environments. In total, 27 LLMs are examined in this edition.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.