QUICK REVIEW

[論文レビュー] When "A Helpful Assistant" Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models

Mingqian Zheng, Jiaxin Pei|arXiv (Cornell University)|Nov 16, 2023

Topic Modeling被引用数 13

ひとこと要約

この研究は、162の社会的役割をシステムプロンプトに組み込んだうえで、3つのオープンソースのLLMに対して2457件のMMLU質問を評価し、対人関係およびジェンダーレスな役割が性能を向上させることが多い一方で、最適な役割を予測することは困難であることを系統的に示している。

ABSTRACT

Prompting serves as the major way humans interact with Large Language Models (LLM). Commercial AI systems commonly define the role of the LLM in system prompts. For example, ChatGPT uses ``You are a helpful assistant'' as part of its default system prompt. Despite current practices of adding personas to system prompts, it remains unclear how different personas affect a model's performance on objective tasks. In this study, we present a systematic evaluation of personas in system prompts. We curate a list of 162 roles covering 6 types of interpersonal relationships and 8 domains of expertise. Through extensive analysis of 4 popular families of LLMs and 2,410 factual questions, we demonstrate that adding personas in system prompts does not improve model performance across a range of questions compared to the control setting where no persona is added. Nevertheless, further analysis suggests that the gender, type, and domain of the persona can all influence the resulting prediction accuracies. We further experimented with a list of persona search strategies and found that, while aggregating results from the best persona for each question significantly improves prediction accuracy, automatically identifying the best persona is challenging, with predictions often performing no better than random selection. Overall, our findings suggest that while adding a persona may lead to performance gains in certain settings, the effect of each persona can be largely random. Code and data are available at https://github.com/Jiaxin-Pei/Prompting-with-Social-Roles.

研究の動機と目的

社会的役割をシステムプロンプトへ追加することが、複数のモデルとタスクにおけるLLMの性能に影響を与えるかを評価する。
どのカテゴリと特定の役割が最大の性能向上をもたらすかを特定する。
役割効果を説明する要因（領域、性別、類似性、困惑度）を調査する。
promptingの効果的な役割を自動的に選択または予測する戦略を検討する。

提案手法

6つの対人関係タイプと8つの職業分野にまたがる162の社会的役割をキュレーションする。
2457 MMLUの質問で3つのオープンソースのinstruction-tuned LLM（FLAN-T5-XXL、LLaMA-2-7b-chat、OPT-iml-max-1.3B）を評価する。
Role、Audience、Interpersonal、およびそれらの変 variationsを含む、役割の離散化の有無で6つのプロンプトテンプレートを設計する。Imagineを頑健性チェックとして含むパラフレーズを含める。
カテゴリ別、ドメイン内/ドメイン外のマッピング、性別に基づく役割で役割効果を分析する。
単語頻度、プロンプトと質問の類似性、プロンプトの困惑度の潜在的なメカニズムとしての相関を計算する。
best-role、類似性ベース、ドメイン分類器、役割分類器、セルフピックなど、いくつかの自動的な役割探索戦略をテストして、最適な prompting roleを見つける。

Figure 1: Our overall research question: does adding social roles in prompts affect LLMs’ performance?

実験結果

リサーチクエスチョン

RQ1 promptsに社会的役割を追加することは、異なるモデルや質問に対してLLMの性能に影響を与えるか？
RQ2どのタイプまたは特定の役割が高い性能をもたらすか、そしてこれらの効果はモデルやデータセット間でどれくらい一貫しているか？
RQ3社会的役割が性能に及ぼす影響を説明するメカニズム（領域整合性、性別、類似性、困惑度）は何か？
RQ4自動戦略は最適な役割を効果的に特定できるか？

主な発見

社会的役割を用いた prompting は、コントロールプロンプトと比較して性能を大幅に改善する。
対人関係的役割とジェンダーレスな役割は、モデルとデータセット全体でより高い性能を示す傾向がある。
ドメイン内/ドメイン外の役割の整合性には普遍的な利点はなく、効果はデータセットと質問によって異なる。
Audienceプロンプト（聴衆を指定するもの）は、一般的に役割プロンプトおよび対人関係プロンプトよりも優れている。
役割効果は、単語頻度、プロンプトと質問の類似性、プロンプトの困惑度と弱～中程度の相関を示す。単一の要因で全ての gains を説明することはできない。
自動的な役割探索戦略はベースラインを上回り、best-roleの性能に近づくことができるが、質問ごとに最適な役割を信頼性高く予測することは依然として難しい。

Figure 2: Overall model performance when being prompted with different social roles (e.g. “You are a lawyer.”) for FLAN-T5-XXL and LLAMA2-7B-Chat. Tested on 2457 MMLU questions. Best-performing roles are highlighted in red. We also highlight “helpful assistant” as it is commonly used in commercial A

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。