[논문 리뷰] Are LLMs All You Need for Task-Oriented Dialogue?
The paper evaluates instruction-tuned LLMs for task-oriented dialogue without fine-tuning, finding weak belief-state tracking but potential in response generation when belief states are correct; few-shot in-domain examples help, and zero-shot results on MultiWOZ and Schema-Guided data are competitive without in-domain training.
Instructions-tuned Large Language Models (LLMs) gained recently huge popularity thanks to their ability to interact with users through conversation. In this work we aim to evaluate their ability to complete multi-turn tasks and interact with external databases in the context of established task-oriented dialogue benchmarks. We show that for explicit belief state tracking, LLMs underperform compared to specialized task-specific models. Nevertheless, they show ability to guide the dialogue to successful ending if given correct slot values. Furthermore this ability improves with access to true belief state distribution or in-domain examples.
연구 동기 및 목표
- Assess the capability of instruction-tuned LLMs to perform task-oriented dialogue (TOD) without fine-tuning.
- Compare zero-shot and few-shot in-context learning for TOD across multiple datasets.
- Analyze domain detection, belief-state tracking, and response generation within an end-to-end TOD pipeline.
- Examine the impact of using oracle vs. generated belief states on downstream tasks.
제안 방법
- Propose an LLM-based TOD pipeline with three calls per turn: domain detection/state tracking, database retrieval, and response generation.
- Use simple, universal prompts for all LLMs without heavy prompt engineering.
- Evaluate domain detection accuracy, belief-state tracking (JGA and Slot-F1), and response quality (BLEU) and dialogue success on TOD benchmarks.
- In few-shot settings, maintain a context store of retrieved domain-specific examples and augment prompts with positive/negative instances to aid learning.
- Compare zero-shot and few-shot variants across multiple instruction-tuned models and datasets (MultiWOZ 2.2, Schema Guided Dataset).
- Assess the influence of oracle vs. generated belief states on downstream performance.
실험 결과
연구 질문
- RQ1Can instruction-tuned LLMs perform TOD tasks out-of-the-box without fine-tuning?
- RQ2How do zero-shot and few-shot prompting affect domain detection, state tracking, and response generation in TOD?
- RQ3What is the impact of providing oracle belief states versus model-generated belief states on overall dialogue success?
- RQ4Do out-of-the-box LLMs achieve state-of-the-art unsupervised TOD results on standard benchmarks?
- RQ5How does context-stored few-shot exemplars affect performance as the number of retrieved examples increases?
주요 결과
- LLMs underperform in explicit belief-state tracking compared to specialized TOD models.
- When given correct belief states, some LLMs generate responses with competitive quality to earlier fine-tuned models.
- Zero-shot TOD with instruction-tuned LLMs achieves state-of-the-art unsupervised results on MultiWOZ and Schema-Guided datasets, within the constraints of no in-domain fine-tuning.
- Few-shot in-domain examples improve performance, particularly when belief states are oracle-provided.
- ChatGPT often outperforms other models on dialogue-level success and belief-state metrics, highlighting robust out-of-the-box capabilities.
- Prompting and post-processing can mitigate some prompt-recoverable errors and reduce hallucinations, though inherent errors persist across models.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.