QUICK REVIEW

[論文レビュー] Investigating the Effectiveness of Task-Agnostic Prefix Prompt for Instruction Following

Seonghyeon Ye, Hyeonbin Hwang|arXiv (Cornell University)|Feb 28, 2023

Topic Modeling被引用数 11

ひとこと要約

この論文は、入力に固定のTask-Agnostic Prefix Prompt（TAPP）を推論時に追加することで、baseおよびinstruction-tuned LLMの指示遵守を改善することを示しており、評価タスクでbaseモデルで最大34.58%、instruction-tunedモデルで12.26%の向上がある。効果はファインチューニングと直交しており、入力破損の下でも持続する。

ABSTRACT

In this paper, we present our finding that prepending a Task-Agnostic Prefix Prompt (TAPP) to the input improves the instruction-following ability of various Large Language Models (LLMs) during inference. TAPP is different from canonical prompts for LLMs in that it is a fixed prompt prepended to the beginning of every input regardless of the target task for zero-shot generalization. We observe that both base LLMs (i.e. not fine-tuned to follow instructions) and instruction-tuned models benefit from TAPP, resulting in 34.58% and 12.26% improvement on average, respectively. This implies that the instruction-following ability of LLMs can be improved during inference time with a fixed prompt constructed with simple heuristics. We hypothesize that TAPP assists language models to better estimate the output distribution by focusing more on the instruction of the target task during inference. In other words, such ability does not seem to be sufficiently activated in not only base LLMs but also many instruction-fine-tuned LLMs. All experiments are reproducible from https://github.com/seonghyeonye/TAPP.

研究の動機と目的

固定のタスク非依存プレフィックスが推論時の指示遵守を多様なLLMに跨って改善することを実証する。
TAPPがベースモデルと指示チューニング済みモデルの両方に利得をもたらし、指示ファインチューニングと補完的であることを示す。
TAPPデモンストレーションの構築ルールと、それらがゼロショットおよび少数ショット一般化に及ぶ影響を分析する。
入力分布の破損に対するTAPPのロバスト性と、タスクカテゴリ信号との関係を調査する。

提案手法

交差タスクのデモンストレーション（指示、入力、出力）から固定のTAPPプレフィックスを単純なヒューリスティックで構築する。
ターゲットタスクの指示と入力にM（TAPPデモンストレーション）を前置し、y_tiをarg max P(y_ti|M,I_t,x_ti;θ)で計算する。
12カテゴリにわたる119件の保持アウトSuperNIタスクを評価する（ diverse LLMs: GPT-3, OPT, GPT-NeoX, GPT-J ）。
TAPPをタスク特有のプレフィックス（Nearest PP、Category PP、Output PP）と比較し、TAPPを用いた少数ショットICLを分析する。
デモンストレーションの入力をランダムな文に置換することによる入力分布の破損に対するロバスト性を調べる。

Figure 1: Overview of Task-Agnostic Prefix Prompt ( TAPP ). We construct a fixed set of demonstrations consisting of instruction, input, and output instances to evaluate base and instruction-fine-tuned LLMs for all tasks. The task categories included in the demonstrations are strictly held-out and f

実験結果

リサーチクエスチョン

RQ1固定のタスク非依存プレフィックスは、推論時にベースLLMと指示チューニング済みLLMの指示遵守を改善するか？
RQ2TAPPからの改善は指示ファインチューニングおよびRLHFと補完的か？
RQ3TAPPデモンストレーションのどの側面（指示、入力、出力）が有効性を左右するのか？
RQ4デモンストレーションと組み合わせた場合、タスク固有プレフィックスや少数ショットインコンテキスト学習とどのように比較されるか？
RQ5デモンストレーション入力の破損に対してTAPPはどの程度頑健か？

主な発見

TAPPはモデルスケールを問わずベースLLMの性能を一貫して向上させ、顕著な利得をもたらす（例：OPT-13B）。
TAPPは、TAPPなしの場合に比べて小さなモデルをはるかに大きなモデルに勝ることがある（例：6B GPT-J が 175B GPT-3 なしのときの比較）。
TAPPは指示チューニング済みL L Mにも改善をもたらし、特に100B超のモデルで効果が大きく、トップモデル（例：text-davinci-003）を最大9.3%向上させうる。
TAPPの改善は指示ファインチューニングと直交的で、少数ショットデモをTAPPと組み合わせても機能し、入力破損は性能に限定的な影響を与える。
明示的な回答選択肢を持つ分類タスクから作られたデモンストレーションは特に効果的で、デモごとの回答選択肢の重複を避けると生成タスクで性能が向上する。
ChatGPTによって生成されたTAPPデモンストレーションは、ベンチマーク由来のデモと同等の利得を示し、デモ源の頑健性を示す。

Figure 2: Average performance of 119 evaluation tasks on SuperNI benchmark. TAPP is effective for both base and instruction-fine-tuned LLMs. We report the mean score of three random seeds for different demonstration sets for TAPP and the error bars of standard deviation. We also perform an evaluatio

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。