QUICK REVIEW

[論文レビュー] Dr ChatGPT, tell me what I want to hear: How prompt knowledge impacts health answer correctness

Guido Zuccon, Bevan Koopman|arXiv (Cornell University)|Feb 23, 2023

Topic Modeling被引用数 33

ひとこと要約

この論文は、ChatGPTの健康に関する質問応答を、モデル知識のみで行う場合と、プロンプトで提供された証拠を使用する場合とを比較し、プロンプト知識がモデル知識を覆すことができ、精度を80%から63%へ低下させることを示しています。

ABSTRACT

Generative pre-trained language models (GPLMs) like ChatGPT encode in the model's parameters knowledge the models observe during the pre-training phase. This knowledge is then used at inference to address the task specified by the user in their prompt. For example, for the question-answering task, the GPLMs leverage the knowledge and linguistic patterns learned at training to produce an answer to a user question. Aside from the knowledge encoded in the model itself, answers produced by GPLMs can also leverage knowledge provided in the prompts. For example, a GPLM can be integrated into a retrieve-then-generate paradigm where a search engine is used to retrieve documents relevant to the question; the content of the documents is then transferred to the GPLM via the prompt. In this paper we study the differences in answer correctness generated by ChatGPT when leveraging the model's knowledge alone vs. in combination with the prompt knowledge. We study this in the context of consumers seeking health advice from the model. Aside from measuring the effectiveness of ChatGPT in this context, we show that the knowledge passed in the prompt can overturn the knowledge encoded in the model and this is, in our experiments, to the detriment of answer correctness. This work has important implications for the development of more robust and transparent question-answering systems based on generative pre-trained language models.

研究の動機と目的

質問のみを用いて、複雑な健康情報の質問に対してモデル知識だけでChatGPTの有効性を評価する（Question-only）
支持証拠または反対証拠を含むプロンプトが回答の正確さに与える影響を評価する（証拠偏り）
プロンプトに埋め込まれた知識が健康情報の信頼性と潜在的な誤情報リスクにどのように影響するかを明らかにする

提案手法

一般的な有効性をテストするために、TREC Health Misinformationトラックの100トピックを使用する（RQ1）
トピックごとに最大3件の支持文と最大3件の反対文を含む、質問のみのプロンプトと証拠偏りプロンプトを比較する（RQ2）
正解と照合して、Yes/Noと説明を含む形でChatGPTの回答を注釈付けして評価する
証拠偏りプロンプトが正解を覆す頻度と、覆しが精度を改善するか悪化させるかを分析する

Dr ChatGPT, tell me what I want to hear: How prompt knowledge impacts health answer correctness

実験結果

リサーチクエスチョン

RQ1RQ1 一般的有効性: ChatGPTは複雑な健康情報の質問に対してどれくらい効果的に回答できるか？
RQ2RQ2 証拠偏りの有効性: 支持的または反対的な証拠を含むプロンプトは回答の正確さにどのように影響するか？

主な発見

ChatGPTはモデル内にエンコードされた知識のみを使用して健康質問に回答した場合、80%の正確さを達成する。
証拠偏りのあるプロンプティングでは、全体の正確性が63%に低下する。
プロンプトで提供された証拠はモデルの回答を覆すことがあり、証拠が反対の場合には誤った結果につながることが多い。
証拠偏りプロンプトによる回答の反転は、ほとんどの場合正しくない傾向がある。
回答に付随する説明は、しばしば限られたまたは矛盾する証拠や、時には検証可能な出典が必ずしもない一般的な医療アドバイスを論じることが多い。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。