QUICK REVIEW

[論文レビュー] Internet-augmented language models through few-shot prompting for open-domain question answering

Angeliki Lazaridou, Elena Gribovskaya|arXiv (Cornell University)|Mar 10, 2022

Topic Modeling被引用数 64

ひとこと要約

この論文は、検索済みウェブの証拠を few-shot プロンプトで大規模言語モデル(LLMs)へ条件付けすることが、オープンドメインQAの性能を改善し、時にははるかに大きいクローズドブックモデルを上回ること、そして推論時の計算を複数証拠のリランキングで増やすことでさらなる利得が得られることを示している。

ABSTRACT

In this work, we aim to capitalize on the unique few-shot capabilities of large-scale language models (LSLMs) to overcome some of their challenges with respect to grounding to factual and up-to-date information. Motivated by semi-parametric language models (LMs), which ground their decisions in external retrieved evidence, we use few-shot prompting to learn to condition LMs on information returned from the web using Google Search, a broad and constantly updated knowledge source. Our approach does not involve fine-tuning or learning additional parameters, thus making it applicable to any LM, offering therefore a strong baseline. Indeed, we find that LMs conditioned on the web surpass performance of closed-book models of similar, or even larger, model sizes in open-domain question answering. Finally, we find that increasing the inference-time compute of models, achieved via using multiple retrieved evidences to generate multiple answers followed by a reranking stage that uses scores generated by the same LMs, leads to better performance and alleviates lower performance of smaller few-shot LMs. All in all, our findings suggest that it might be beneficial to slow down the race towards the biggest model and instead shift attention towards finding more effective ways to use models, including but not limited to, better prompting or increasing inference-time compute.

研究の動機と目的

最新のウェブ情報に基づいて大規模言語モデル(LLMs)を地固めするために、few-shot promptingを活用する。
ウェブ条件付けされたLLMsが、同程度またはそれより大きいサイズのクローズドブックのベースラインをオープンドメインQAで上回ることを示す。
Wikipediaを超える普遍的な知識源として、クエリベースのウェブ検索(Google)の効果を検討する。
複数の取得済みパッセージとリランキングによる推論時の計算量増加がQA性能を向上させるかを調査する。

提案手法

各質問に対してGoogle検索を用いて関連ウェブ passagesを取得する（トップ20のURL）。
取得済み文書を六文のパラグラフに分割し、TF-IDFコサイン類似度を用いて抜粋を選択し、証拠集合Pを形成する。
証拠パラグラフを追加した15ショットのインコンテキスト例でLLMにプロンプトを与える（ファインチューニングはしない）。
段落ごとに4つの候補回答を生成し、直接推論、ノイジーチャンネル、PoEといったスコアリング関数で集約して最終回答を選択する。
複数の回答サンプリング（段落数nは最大50）とリランキングを探索して精度を向上させる（RAG、Noisy Channel、PoE）。
データセット(NQ、HotpotQA、Fever、StrategyQA)とモデルサイズ(44m–280B)で評価し、スケーリングと計算時間のトレードオフを評価する。

実験結果

リサーチクエスチョン

RQ1few-shotプロンプトでウェブ証拠を用いてLLMsを条件付けると、クローズドブックプロンプトと比較してオープンドメインQAの性能が向上するか？
RQ2Google検索を介した検索品質が、シングルホップ対マルチホップのQAタスクにどのように影響するか？
RQ3複数パラグラフのサンプリングとリランキングによる推論時の計算量の増加は、より小さなオープンブックモデルとより大きなクローズドブックモデルの差を縮めることができるか？
RQ4ファインチューニングなしでインターネット検索を使ってLLMsを最新状態に保つことは可能か？
RQ5商用検索エンジンを検索基盤として用いる際の制限と安全性の考慮点は何か？

主な発見

データセット	SOTA	CB	OB_Google_no_reranking	CB	OB_Gold	OB_Google_a\|q,p	OB_Google_PoE	検索性能@50
NQ	51.4 [ 8 ]	21.7	23.1	25.8	61.7	32.7	38.4	85.0
HotpotQA	65.2 [ 28 ]	20.7	24.5	21.2	54.8	26.3	30.3	55.5
Fever	73.2 [ 31 ]	44.5	52.2	44.5	66.6 a	52.0	57.2	43.3
StrategyQA	63.6 [ 29 ]	61.0	61.1	61.0	80.4	64.6	66.2	34.9

Googleで取得した証拠を用いてGopher-280Bを条件付けると、4データセットすべてでオープンドメインQAの性能がクローズドブックのベースラインを一貫して改善する。
改善は生成タスクで最大となり(NQで相対改善約30%程度)、小型モデルでも持続し、オープンブックがより大きなクローズドブックモデルを上回ることがある。
複数パッセージとリランキング（PoE、Noisy Channel、RAG）を用いると、単一証拠条件付けより追加の改善が得られ、リランキング戦略の中ではPoEが最も良く機能することが多い。
Googleベースの検索はリコール競争力があり、特定データセットでWikipediaベースの密なリトリーバよりも優れることがあり、ウェブが柔軟な知識源であることを示している。
小型モデルはウェブ証拠と高い推論時計算量で強い結果を達成でき、時にははるかに大きなクローズドブックモデルを上回る。
オラクルのようなゴールド証拠条件付けは上限を表しており、より良いプロンプト最適化と制約付きデコーディングによる改善の余地を示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。