QUICK REVIEW

[논문 리뷰] TPTU-v2: Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world Systems

Yilun Kong, Jingqing Ruan|arXiv (Cornell University)|2023. 11. 19.

Topic Modeling인용 수 7

한 줄 요약

이 논문은 실제 시스템에서 LLM 기반 에이전트의 작업 계획 및 API 사용을 개선하기 위해 세 가지 구성요소 프레임워크(API Retriever, LLM Finetuner, Demo Selector)를 도입하고, 실무 데이터와 ToolBench에서 검증한다.

ABSTRACT

Large Language Models (LLMs) have demonstrated proficiency in addressing tasks that necessitate a combination of task planning and the usage of external tools that require a blend of task planning and the utilization of external tools, such as APIs. However, real-world complex systems present three prevalent challenges concerning task planning and tool usage: (1) The real system usually has a vast array of APIs, so it is impossible to feed the descriptions of all APIs to the prompt of LLMs as the token length is limited; (2) the real system is designed for handling complex tasks, and the base LLMs can hardly plan a correct sub-task order and API-calling order for such tasks; (3) Similar semantics and functionalities among APIs in real systems create challenges for both LLMs and even humans in distinguishing between them. In response, this paper introduces a comprehensive framework aimed at enhancing the Task Planning and Tool Usage (TPTU) abilities of LLM-based agents operating within real-world systems. Our framework comprises three key components designed to address these challenges: (1) the API Retriever selects the most pertinent APIs for the user task among the extensive array available; (2) LLM Finetuner tunes a base LLM so that the finetuned LLM can be more capable for task planning and API calling; (3) the Demo Selector adaptively retrieves different demonstrations related to hard-to-distinguish APIs, which is further used for in-context learning to boost the final performance. We validate our methods using a real-world commercial system as well as an open-sourced academic dataset, and the outcomes clearly showcase the efficacy of each individual component as well as the integrated framework.

연구 동기 및 목표

실제 시스템에서 LLM 기반 에이전트의 실용적 도전 과제를 식별합니다(방대한 API 세트, 복잡한 작업/API 순서화, API 유사성).
세 가지 구성요소 프레임워크를 제시합니다: API Retriever, LLM Finetuner, 그리고 Demo Selector.
각 구성요소와 통합 프레임워크의 효과를 실제 시스템 및 오픈 소스 데이터 세트에서 입증합니다.

제안 방법

API Retriever는 의미론적 임베딩과 Multiple Negatives Ranking Loss를 사용하는 듀얼 스트림 SBERT 훈련으로 대규모 API 컬렉션에서 가장 관련성 높은 API를 선택합니다.
LLM Finetuner는 실제 시스템 맥락에서 작업 계획 및 API 호출을 강화하기 위해 신중하게 구성된 데이터 세트에 대해 지도학습 미세 조정을 수행합니다.
Demo Selector는 임베딩 유사성에 따라 인-컨텍스트 학습을 개선하고 유사한 API를 구별하기 위해 서브태스크 수준 또는 API 수준의 시演 Демonstrations를 동적으로 검색합니다.
API Retriever의 학습 데이터는 instruction-API 쌍과 인간/LLM 혼합 주석 프로세스에 의존합니다.
Fine-tuning 데이터 세트에는 Training Set v1(실무 분포), Training Set v2(특징 목록이 포함된 프롬프트 기능), Training Set v3(다양한 프롬프트와 다중 단계 API 상호작용)가 포함됩니다.
Demo Selector는 Knowledge Database와 API Collection의 임베딩을 사용하여 상위-k Demonstrations를 가져오거나 필요 시 API 수준의 Demonstrations로 대체합니다.]

실험 결과

연구 질문

RQ1대규모 API 생태계에서 API 검색이 작업 계획에 대한 API 관련성을 얼마나 효과적으로 향상시킬 수 있는가?
RQ2도메인 특정 데이터에 대한 LLM의 미세 조정이 작업 계획 및 API 호출 정확도를 높이는가?
RQ3적응형 데모 검색이 모델이 의미적으로 유사한 API를 구별하고 최종 작업 완료를 개선하는 데 도움이 되는가?

주요 결과

API Retriever는 실제 환경에서 Recall@5가 84.64%이고 Recall@10이 98.47%를 달성합니다.
기본 LLM의 실행 정확도는 데모 없음 38.89%에서 API Retriever로 43.33%, Demo Selector로 95.55%, 미세 조정된 LLM + API Retriever로 80%로 상승하며, 모든 구성요소를 통합하면 96.67%에 도달합니다.
오픈 소스 환경에서 기본 LLM의 실행 정확도는 76.67%이며, API Retriever 단독은 53.3%로 복잡성으로 인해 하락하지만, 미세 조정된 LLM + API Retriever는 86.7%에 도달합니다.
가장 높은 실제 환경 성능(96.67%)은 미세 조정된 LLM, API Retriever, Demo Selector를 결합한 결과에서 나오며, 통합 구성요소의 가치를 강조합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.