QUICK REVIEW

[论文解读] A Preliminary Evaluation of ChatGPT for Zero-shot Dialogue Understanding

Wenbo Pan, Qiguang Chen|arXiv (Cornell University)|Apr 9, 2023

Topic Modeling被引用 21

一句话总结

本文评估 ChatGPT 在 SLU 与 DST 的零-shot 能力，显示在多轮提示下 DST 表现强劲但槽位填充较弱，并提出一个多轮交互式提示框架。

ABSTRACT

Zero-shot dialogue understanding aims to enable dialogue to track the user's needs without any training data, which has gained increasing attention. In this work, we investigate the understanding ability of ChatGPT for zero-shot dialogue understanding tasks including spoken language understanding (SLU) and dialogue state tracking (DST). Experimental results on four popular benchmarks reveal the great potential of ChatGPT for zero-shot dialogue understanding. In addition, extensive analysis shows that ChatGPT benefits from the multi-turn interactive prompt in the DST task but struggles to perform slot filling for SLU. Finally, we summarize several unexpected behaviors of ChatGPT in dialogue understanding tasks, hoping to provide some insights for future research on building zero-shot dialogue understanding systems with Large Language Models (LLMs).

研究动机与目标

Investigate zero-shot dialogue understanding capabilities of ChatGPT on SLU and DST benchmarks.
Assess how prompt design affects ChatGPT performance in single-turn vs. multi-turn settings.
Identify behaviors and limitations of ChatGPT in zero-shot dialogue tasks to inform future research.

提出的方法

Prompts are designed for zero-shot SLU with schema, regulations, and sentence input to elicit intents and slots.
A multi-turn interactive prompt framework is proposed for DST to leverage ChatGPT’s context tracking across turns.
Evaluations compare ChatGPT with GPT-3.5, Codex, and SOTA baselines on SLU (ATIS, SNIPS) and DST (MultiWOZ 2.1, 2.4).
Analysis includes error categories (undefined slot values, slot format violations, verbose responses) and prompt-length considerations.

实验结果

研究问题

RQ1Can ChatGPT perform zero-shot SLU and DST on standard benchmarks?
RQ2Does a multi-turn interactive prompting strategy improve DST over single-turn prompts?
RQ3How do prompt designs (descriptions, examples, names) affect slot filling in SLU?
RQ4What unexpected behaviors does ChatGPT exhibit in zero-shot dialogue understanding, and how can they be mitigated?

主要发现

模型	SNIPS 意图	SNIPS 槽位	ATIS 意图	ATIS 槽位	MultiWOZ2.1 JGA	MultiWOZ2.1 槽位准确率	MultiWOZ2.4 JGA	MultiWOZ2.4 槽位准确率
GPT-3.5	97.71	58.24	75.22	15.71	60.28	97.83	64.23	98.12
Codex	98.42	68.90	89.92	57.29	34.38	95.12	37.50	95.68
Finetuned SoTA	98.80	97.10	98.00	96.10	61.02	98.05	75.90	-
ChatGPT	97.71	58.24	75.22	15.71	60.28	97.83	64.23	98.12

ChatGPT achieves zero-shot dialogue understanding on SLU and DST benchmarks, with a gap to fine-tuned SOTA.
ChatGPT surpasses GPT-3.5 and Codex on MultiWOZ 2.1/2.4 DST, likely due to multi-turn prompts leveraging context.
ChatGPT underperforms in SLU slot filling, but performance improves with slot names, descriptions, and examples.
Multi-turn interactive prompts improve DST performance over single-turn prompts (e.g., JGA: 60.02 vs 58.05; Slot Accuracy: 97.80 vs 97.74).
ChatGPT displays unexpected behaviors (undefined slot values, format violations, verbose outputs) and prompt length limits can cause forgetting over long conversations.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。