QUICK REVIEW

[論文レビュー] Pre-training Tasks for Embedding-based Large-scale Retrieval

Wei-Cheng Chang|arXiv (Cornell University)|Feb 10, 2020

Topic Modeling被引用数 101

ひとこと要約

この論文は二塔型Transformerリトリーバルモデルの事前学習タスクを分析し、段落レベルのタスク（Inverse Cloze Task、Body First Selection、Wiki Link Prediction）がBM-25より大幅に検索性能を改善する一方、トークンレベルの MLM は限定的な利得を提供する。

ABSTRACT

We consider the large-scale query-document retrieval problem: given a query (e.g., a question), return the set of relevant documents (e.g., paragraphs containing the answer) from a large document corpus. This problem is often solved in two steps. The retrieval phase first reduces the solution space, returning a subset of candidate documents. The scoring phase then re-ranks the documents. Critically, the retrieval algorithm not only desires high recall but also requires to be highly efficient, returning candidates in time sublinear to the number of documents. Unlike the scoring phase witnessing significant advances recently due to the BERT-style pre-training tasks on cross-attention models, the retrieval phase remains less well studied. Most previous works rely on classic Information Retrieval (IR) methods such as BM-25 (token matching + TF-IDF weights). These models only accept sparse handcrafted features and can not be optimized for different downstream tasks of interest. In this paper, we conduct a comprehensive study on the embedding-based retrieval models. We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks. With adequately designed paragraph-level pre-training tasks, the Transformer models can remarkably improve over the widely-used BM-25 as well as embedding models without Transformers. The paragraph-level pre-training tasks we studied are Inverse Cloze Task (ICT), Body First Selection (BFS), Wiki Link Prediction (WLP), and the combination of all three.

研究の動機と目的

大規模なクエリ–文書リトリーバル問題の動機づけと、二段階システムにおける効率的なリトリーバルの必要性。
事前学習タスクが二塔型 Transformer リトリーバの性能に与える影響を調査する。
段落レベルの事前学習タスク ICT、BFS、およびその組み合わせを、トークンレベルの MLM および BM-25 と比較して評価する。
適切に設計された事前学習タスクが、リトリーバル設定において二塔型モデルをBM-25およびBoWベースラインより優位にすることを示す。

提案手法

クエリエンコーダと文書エンコーダを持ち、Transformerアーキテクチャを介して埋め込みを生成する二塔型リトリーバルモデルを定義する。
全Softmaxを近似するためにサンプル付きSoftmaxを用いた候補文書上のSoftmaxで訓練する。
Inverse Cloze Task (ICT)、Body First Selection (BFS)、Wiki Link Prediction (WLP)、およびICT+BFS+WLPの組み合わせをトークンレベル MLM と比較して提案・評価する。
事前学習のためにWikipedia由来データを用いてポジティブな (q, d) ペアを構築し、下流のリトリーバルQAデータセット（SQuAD、Natural Questions）およびオープンドメイン設定でファインチューニングする。
recall@k 指標の効果を評価するためにBM-25およびBoW-MLPベースラインと比較する。
実験設定は512次元埋め込みを用いる二塔エンコーダ、64トークンのクエリ、288トークンの文書、32 TPU v3上での100Kプリトレーニングステップを使用する。

実験結果

リサーチクエスチョン

RQ1さまざまな事前学習タスクが、大規模リトリーブにおける二塔型 Transformer リトリーバルモデルの有効性にどう影響するか？
RQ2段落レベルの事前学習タスクは、トークンレベル MLMやBM-25のような従来のIRベースラインをリトリーバルタスクで上回るか？
RQ3ICT、BFS、WLPを組み合わせると、特にデータが少ない場合やオープンドメイン設定で追加の利得が得られるか？
RQ4モデルの深さと埋め込み次元が事前学習タスクと相互作用してリトリーバルのRecallにどう影響するか？

主な発見

訓練/テスト比率	エンコーダ	事前学習タスク	R@1	R@5	R@10	R@50	R@100
1%/99%	BM-25	No Pretraining	4.99	11.91	15.41	24.00	27.97
1%/99%	BoW-MLP	No Pretraining	0.14	0.35	0.49	1.13	1.72
1%/99%	BoW-MLP	ICT+BFS+WLP	22.55	41.03	49.93	69.70	77.01
1%/99%	Transformer	No Pretraining	0.02	0.06	0.08	0.31	0.54
1%/99%	Transformer	MLM	0.18	0.51	0.82	2.46	3.93
1%/99%	Transformer	ICT+BFS+WLP	37.43	61.48	70.18	85.37	89.85
5%/95%	BM-25	No Pretraining	41.87	57.98	63.63	74.17	77.91
5%/95%	BoW-MLP	ICT+BFS+WLP	26.23	46.49	55.68	75.28	81.89
5%/95%	Transformer	ICT+BFS+WLP	45.90	70.89	78.47	90.49	93.64
80%/20%	BM-25	No Pretraining	41.77	57.95	63.55	73.94	77.49
80%/20%	BoW-MLP	ICT+BFS+WLP	32.24	55.26	65.49	83.37	88.50
80%/20%	Transformer	ICT+BFS+WLP	58.35	82.76	88.44	95.87	97.49

適切に設計された段落レベルの事前学習タスクを備えた二塔型 Transformer モデルは、リトリーバルタスクにおいてBM-25およびBoWベースラインを著しく上回る。
段落レベルの事前学習 ICT、BFS、WLP は大幅な利得を生み出す一方、トークンレベル MLM は僅かな改善にとどまる。
ICT+BFS+WLP の組み合わせは、SQuADおよびNatural Questions で個別タスクを一貫して上回り、特に低リソース設定とオープンドメインの状況で顕著である。
Transformerエンコーダは、浅い BoW-MLP エンコーダよりも段落レベルの事前学習の恩恵を受けやすく、埋め込み次元を大きくすると性能が向上する。
オープンドメインのリトリーブ実験では、大規模な候補集合でもICT+BFS+WLPおよびICTが堅牢な利得を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。