QUICK REVIEW

[论文解读] ChatGPT Evaluation on Sentence Level Relations: A Focus on Temporal, Causal, and Discourse Relations

Chunkit Chan, Jay J. Cheng|arXiv (Cornell University)|Apr 28, 2023

Topic Modeling被引用 30

一句话总结

本论文在11个数据集上对 ChatGPT 在跨句间关系（时间性、因果关系和话语关系）进行定量评估，使用三种提示设置（Prompt、Prompt Engineering 和 In-Context Learning）来建立基线性能。

ABSTRACT

This paper aims to quantitatively evaluate the performance of ChatGPT, an interactive large language model, on inter-sentential relations such as temporal relations, causal relations, and discourse relations. Given ChatGPT's promising performance across various tasks, we proceed to carry out thorough evaluations on the whole test sets of 11 datasets, including temporal and causal relations, PDTB2.0-based, and dialogue-based discourse relations. To ensure the reliability of our findings, we employ three tailored prompt templates for each task, including the zero-shot prompt template, zero-shot prompt engineering (PE) template, and in-context learning (ICL) prompt template, to establish the initial baseline scores for all popular sentence-pair relation classification tasks for the first time. Through our study, we discover that ChatGPT exhibits exceptional proficiency in detecting and reasoning about causal relations, albeit it may not possess the same level of expertise in identifying the temporal order between two events. While it is capable of identifying the majority of discourse relations with existing explicit discourse connectives, the implicit discourse relation remains a formidable challenge. Concurrently, ChatGPT demonstrates subpar performance in the dialogue discourse parsing task that requires structural understanding in a dialogue before being aware of the discourse relation.

研究动机与目标

评估 ChatGPT 在多样数据集上理解跨句间关系（时间性、因果、话语）的能力。
在三种提示范式下量化性能：零样本提示、零样本提示工程，以及上下文学习。
识别不同关系类型与内部关系中 ChatGPT 的优势与局限。
提供基线和见解，以指导未来在关系文本理解领域对大型语言模型的研究。

提出的方法

使用三种定制提示模板（Prompt、Prompt Engineering、In-Context Learning）将关系分类框架化为多项选择任务。
在涵盖时间性、因果性和话语关系的11个数据集的整套测试集上评估 ChatGPT。
分析关系层面的表现以及内部关系（例如 Time 的 Before/After、显式与隐式话语）。
将 ChatGPT 与基线（随机、BERT-base、微调的 SOTA RoBERTa）进行比较，并在适用的地方报告准确率和宏F1。

实验结果

研究问题

RQ1ChatGPT 在标准数据集上识别两事件之间的时间关系的能力如何？
RQ2与基线相比，ChatGPT 在因果关系的检测与推理方面的表现如何？
RQ3ChatGPT 在识别显性与隐性话语关系方面的有效性如何，包括对话话语解析？
RQ4不同提示策略（Prompt、PE、ICL）如何影响 ChatGPT 在这些关系任务上的表现？
RQ5哪些内部关系模式（例如 BEFORE/AFTER、显式与隐式连接词）会影响 ChatGPT 的成功？

主要发现

在 TB-Dense、MATRES 和 TDDMan 的时间关系数据集上，ChatGPT 落后于微调模型。
工程化提示通常比标准提示在时间性表现上有所提升，对 TB-Dense、MATRES 和 TDDMan 的提升尤为显著。
ChatGPT 在 COPA 上显示出强大的因果推理，在 e-CARE 和 HeadlineCause 上表现具有竞争力，工程化提示在 COPA/e-CARE 上有所帮助。
显性话语关系在存在显性连接词且利用标签依赖时更易识别，但隐性话语仍然具有挑战性。
在多方对话话语解析中，ChatGPT 的表现不及监督基线，借助上下文学习和提示的增益有限。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。