QUICK REVIEW

[論文レビュー] Learning to Retrieve In-Context Examples for Large Language Models

Liang Wang, Nan Yang|arXiv (Cornell University)|Jul 14, 2023

Topic Modeling被引用数 13

ひとこと要約

本論文は LLM-R を提案する。これは LLM-informed reward model から蒸留することにより、 dense retrievers を学習し、大規模言語モデル（LLMs）用の高品質なインコンテキスト例を選択する反復フレームワークである。

ABSTRACT

Large language models (LLMs) have demonstrated their ability to learn in-context, allowing them to perform various tasks based on a few input-output examples. However, the effectiveness of in-context learning is heavily reliant on the quality of the selected examples. In this paper, we propose a novel framework to iteratively train dense retrievers that can identify high-quality in-context examples for LLMs. Our framework initially trains a reward model based on LLM feedback to evaluate the quality of candidate examples, followed by knowledge distillation to train a bi-encoder based dense retriever. Our experiments on a suite of $30$ tasks demonstrate that our framework significantly enhances in-context learning performance. Furthermore, we show the generalization ability of our framework to unseen tasks during training. An in-depth analysis reveals that our model improves performance by retrieving examples with similar patterns, and the gains are consistent across LLMs of varying sizes. The code and data are available at https://github.com/microsoft/LMOps/tree/main/llm_retriever .

研究の動機と目的

インコンテキスト例に対する LLM の感度と、品質を意識した検索の必要性を動機づけ、分析する。
LLM のフィードバックを用いて dense retriever を学習する反復的フレームワーク（LLM-R）を提案する。
LLM-R が多様なタスクと LLM サイズにわたってインコンテキスト学習を改善することを示す。
retrieved examples がテストケースと同じ入力パターンやラベルを共有する傾向があり、見たことのないタスクにも一般化することを示す。

提案手法

タスク混合のプールから BM25 を用いた初期候補取得。
テスト入力と候補を与えたときの LLM の ground-truth ログ尤度で候補をランク付け。
ground-truth ラベルとハードネガティブを用いて候補品質をスコアリングするクロスエンコーダ報酬モデルを訓練。
KL発散を用いた情報蒸留で報酬モデルを模倣する bi-encoder dense retriever を訓練し、 hard negatives と an InfoNCE コントラスト損失を組み合わせる。
新たに取得した positives/negatives で dense retriever を反復再訓練して品質を向上させる。

Figure 1: The overall architecture of our proposed framework LLM-R. The training process comprises three stages: generating training data based on an initial retriever and LLM feedback, reward modeling, and training dense retrievers by distilling the knowledge from the reward model. At inference tim

実験結果

リサーチクエスチョン

RQ1LLM のフィードバックに導かれた学習済み dense retriever は、ヒューリスティックなベースラインよりインコンテキスト例の選択で上回れるか？
RQ2報酬モデルの監督下での反復再訓練は、複数のタスクと LLM サイズにわたる ICL の改善につながるか？
RQ3未知のタスクや異なる LLM にこのアプローチはどの程度一般化できるか？
RQ4取得されたインコンテキスト例の効果に影響を与える要因（タスクタイプ、データパターン、タスク難易度）は何か？

主な発見

CQA	Comm.	Coref.	NLI	Para.	RC	Sent.	D2T	Summ.	Avg
48.8	80.1	67.6	71.9	66.5	60.0	93.5	50.1	50.8	65.7
48.7	80.4	70.4	72.5	71.5	59.0	93.6	49.9	51.1	66.5
48.9	80.0	70.8	72.6	72.8	58.0	92.9	49.8	50.8	66.4

LLM-R は基準法を一貫して上回る（random, k-means, BM25, E5, SBERT, EPR）30 タスクで、1 回の反復後に平均 65.7、2 回の反復後に 66.5。
反復訓練は 2 回の反復を過ぎると利得が逓減し、収束を示唆。
報酬モデルを用いた蒸留は、報酬モデルなしのバリアントと比べて性能に大きく寄与。
LLM-R は Held-out タスクや異なる LLM に一般化（GPT-Neo-2.7B, LLaMA-13B, GPT-3.5-turbo）、特に小型の LLM で顕著な向上。
トップで取得された例はテストと同じ入力パターンや同じラベルを共有する傾向があり、知識集約的なタスクでは絶対的なゲインが小さい。

Figure 2: The collection of datasets used in our experiments. The yellow-colored datasets are held out and excluded from training. For further information, please refer to Table 8 in the Appendix.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。