QUICK REVIEW

[论文解读] ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models

Ning Bian, Xianpei Han|arXiv (Cornell University)|Mar 29, 2023

Topic Modeling被引用 47

一句话总结

本文评估 ChatGPT 及其他大语言模型在 11 个常识问答数据集上的表现，以评估它们回答问题、理解所需知识、准确回忆并在推理中利用知识的能力，结果显示 ChatGPT 知识丰富但常是经验不足的解题者，难以有选择性地利用知识。

ABSTRACT

Large language models (LLMs) have made significant progress in NLP. However, their ability to memorize, represent, and leverage commonsense knowledge has been a well-known pain point. In this paper, we specifically focus on ChatGPT, a widely used and easily accessible LLM, and ask the following questions: (1) Can ChatGPT effectively answer commonsense questions? (2) Is ChatGPT aware of the underlying commonsense knowledge for answering a specific question? (3) Is ChatGPT knowledgeable in commonsense? (4) Can ChatGPT effectively leverage commonsense for answering questions? We conduct a series of experiments on 11 datasets to evaluate ChatGPT's commonsense abilities, including answering commonsense questions, identifying necessary knowledge, generating knowledge descriptions, and using knowledge descriptions to answer questions again. Experimental results show that: (1) ChatGPT can achieve good QA accuracies in commonsense tasks, while still struggling with certain domains of datasets. (2) ChatGPT is knowledgeable, and can accurately generate most of the commonsense knowledge using knowledge prompts. (3) Despite its knowledge, ChatGPT is an inexperienced commonsense problem solver, which cannot precisely identify the needed commonsense for answering a specific question. These findings raise the need to explore improved mechanisms for effectively incorporating commonsense into LLMs like ChatGPT, such as better instruction following and commonsense guidance.

研究动机与目标

评估 GPTs 是否能够在多样化领域中准确回答常识问题。
确定 GPTs 是否了解并能够列举回答问题所需的知识。
评估 GPTs 是否能够回忆并描述回答问题所需的常识性知识。
探究 GPTs 是否能够在上下文中利用生成的知识来提升推理能力。

提出的方法

使用覆盖一般、物理、社会、科学、事件、数值、原型和时间领域的 11 个常识问答数据集。
比较 GPT-3 (davinci)、GPT-3.5 (text-davinci-003) 与 ChatGPT，GPT-3 使用 4-shot 提示，GPT-3.5/ChatGPT 使用零-shot 提示。
评估每个数据集上的问答准确性。
要求模型描述回答每个问题所需的知识，并评估这些描述的准确性/召回率。
请 ChatGPT 再次回答问题，使用生成的知识作为上下文以测试知识的利用。
分析知识准确性与答案准确性之间的相关性。

实验结果

研究问题

RQ1GPTs 是否能够在多样化领域有效回答常识性问题？
RQ2GPTs 是否具备常识性知识并能够生成相关的知识提示？
RQ3GPTs 是否知道回答特定问题所必需的潜在知识？
RQ4GPTs 是否能够在上下文中利用常识性知识来改进答案？

主要发现

Dataset	Domain	GPT-3	GPT-3.5	ChatGPT
CommonsenseQA	General	38	81	74
OpenBookQA	General	22	65	73
WSC	General	46	78	78
PIQA	Physical	48	77	78
Social IQA	Social	36	71	62
ARC	Science	27	88	94
QASC	Science	25	75	74
HellaSWAG	Event	19	61	67
NumerSense	Numerical	45	63	79
ProtoQA	Prototypical	67.3	84.6	94.2
MC-TACO	Temporal	20	53	52

GPTs 在常识性任务上达到较好的问答准确性，但在某些知识类型（尤其是社会、事件和时间领域）存在挑战。
ChatGPT 知识渊博，能够使用提示准确生成大多数常识性知识。
ChatGPT 是经验不足的常识性问题解决者，不能准确识别给定问题所需的具体知识。
GPTs 在上下文中利用生成的知识来改进回答的能力有限，使用生成的知识描述时收益混合或没有显著提升。
生成的必要知识质量（知识 F1）与总体答案准确性之间存在较强相关性（Pearson 0.77）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。