QUICK REVIEW

[论文解读] Dr ChatGPT, tell me what I want to hear: How prompt knowledge impacts health answer correctness

Guido Zuccon, Bevan Koopman|arXiv (Cornell University)|Feb 23, 2023

Topic Modeling被引用 33

一句话总结

本文比较了 ChatGPT 在仅使用模型知识与使用提示提供的证据情况下对健康问题的回答，结果显示提示知识可以推翻模型知识，将准确率从 80% 降至 63%。

ABSTRACT

Generative pre-trained language models (GPLMs) like ChatGPT encode in the model's parameters knowledge the models observe during the pre-training phase. This knowledge is then used at inference to address the task specified by the user in their prompt. For example, for the question-answering task, the GPLMs leverage the knowledge and linguistic patterns learned at training to produce an answer to a user question. Aside from the knowledge encoded in the model itself, answers produced by GPLMs can also leverage knowledge provided in the prompts. For example, a GPLM can be integrated into a retrieve-then-generate paradigm where a search engine is used to retrieve documents relevant to the question; the content of the documents is then transferred to the GPLM via the prompt. In this paper we study the differences in answer correctness generated by ChatGPT when leveraging the model's knowledge alone vs. in combination with the prompt knowledge. We study this in the context of consumers seeking health advice from the model. Aside from measuring the effectiveness of ChatGPT in this context, we show that the knowledge passed in the prompt can overturn the knowledge encoded in the model and this is, in our experiments, to the detriment of answer correctness. This work has important implications for the development of more robust and transparent question-answering systems based on generative pre-trained language models.

研究动机与目标

评估 ChatGPT 在仅使用模型知识（仅问题）情况下回答复杂健康信息问题的有效性
评估使用支持性或相反证据进行提示如何影响答案的正确性（证据偏向）
确定提示中嵌入的知识如何影响健康信息的可靠性及潜在错误信息风险

提出的方法

使用 TREC Health Misinformation 跟踪中的100个主题来测试普遍有效性（RQ1）
将问题仅提示与带有最多3份支持性和3份相反文档的证据偏向提示进行比较（每个主题）(RQ2)
使用是/否及解释对照真实答案对 ChatGPT 的回答进行标注和评估
分析证据偏向提示翻转正确答案的频率，以及翻转是提高还是降低准确性

Dr ChatGPT, tell me what I want to hear: How prompt knowledge impacts health answer correctness

实验结果

研究问题

RQ1RQ1 一般有效性：ChatGPT 在回答复杂健康信息问题上的有效性有多高？
RQ2RQ2 证据偏向有效性：使用支持性或相反证据进行提示如何影响答案的正确性？

主要发现

在仅使用模型中编码的知识来回答健康问题时，ChatGPT 的准确率为 80%。
在证据偏向提示下，整体准确率下降到63%。
提示提供的证据可以推翻模型的答案，当证据相反时往往导致不正确的结果。
由于证据偏向提示引起的答案翻转在多数情况下往往是错误的。
随答案附带的解释经常讨论有限或相互矛盾的证据，有时是一般医疗建议，而不一定是可核验的来源。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。