QUICK REVIEW

[論文レビュー] TwinBERT: Distilling Knowledge to Twin-Structured BERT Models for Efficient Retrieval

Wenhao Lu, Jian Jiao|arXiv (Cornell University)|Feb 14, 2020

Topic Modeling参考文献 37被引用数 28

ひとこと要約

TwinBERTは、クエリとドキュメントの符号化を分離する二重構造のBERTモデルを提案する。これにより、ドキュメントの埋め込み表現を事前に計算・キャッシュできるようになり、CPU上での推論時間を約20msに短縮する。知識蒸留と効率的なネットワーク設計を用いることで、BERT-Baseレベルの性能を達成しながら、BERT-BaseおよびBERT-Largeに比べ77～663倍の高速な推論を実現し、生産環境の情報検索システムにおける低遅延デプロイを可能にする。

ABSTRACT

Pre-trained language models like BERT have achieved great success in a wide variety of NLP tasks, while the superior performance comes with high demand in computational resources, which hinders the application in low-latency IR systems. We present TwinBERT model for effective and efficient retrieval, which has twin-structured BERT-like encoders to represent query and document respectively and a crossing layer to combine the embeddings and produce a similarity score. Different from BERT, where the two input sentences are concatenated and encoded together, TwinBERT decouples them during encoding and produces the embeddings for query and document independently, which allows document embeddings to be pre-computed offline and cached in memory. Thereupon, the computation left for run-time is from the query encoding and query-document crossing only. This single change can save large amount of computation time and resources, and therefore significantly improve serving efficiency. Moreover, a few well-designed network layers and training strategies are proposed to further reduce computational cost while at the same time keep the performance as remarkable as BERT model. Lastly, we develop two versions of TwinBERT for retrieval and relevance tasks correspondingly, and both of them achieve close or on-par performance to BERT-Base model. The model was trained following the teacher-student framework and evaluated with data from one of the major search engines. Experimental results showed that the inference time was significantly reduced and was firstly controlled around 20ms on CPUs while at the same time the performance gain from fine-tuned BERT-Base model was mostly retained. Integration of the models into production systems also demonstrated remarkable improvements on relevance metrics with negligible influence on latency.

研究の動機と目的

リアルタイム情報検索（IR）システムにおけるBERTの高い推論遅延を解消すること。
スポンサーリンク検索のような低遅延環境における深層ニューラルモデルの効率的オンラインサービングを可能にすること。
計算コストを大幅に削減しながらも、高い検索性能と関連性を維持すること。
性能を保持しつつ推論効率を向上させる知識蒸留技術の検討。
関連性品質を損なわせることなく、CPU上に高密度意味的モデルをデプロイ可能にする。

提案手法

TwinBERTは、クエリとドキュメントを別々に処理する2つの独立したBERTに類似したエンコーダーを用い、BERTの従来の連結アプローチとは異なり、入力符号化プロセスを分離する。
ドキュメントの埋め込み表現はオフラインで事前に計算し、メモリにキャッシュすることで、推論時のドキュメント符号化を省略する。
クロスレイヤーが、コサイン類似度または残差ネットワークを用いてクエリとドキュメントの埋め込み表現を組み合わせ、関連性スコアを計算する。
知識蒸留を適用し、BERT-Baseモデルを教師として用いてTwinBERTを学習することで、モデルの複雑さを低減しながら性能を維持する。
ONNX Runtimeを活用してCPU推論に最適化し、生産環境システムにおけるサービングオーバーヘッドを最小限に抑える。
計算コストを大幅に削減するための効率的なネットワーク部品とトレーニング戦略を設計し、性能の著しい低下を防ぐ。

実験結果

リサーチクエスチョン

RQ1BERTにおけるクエリとドキュメントの符号化を分離することで、推論遅延を短縮しつつも高い検索性能を維持できるか？
RQ2知識蒸留により、より小型で高速なモデルでBERTレベルの性能をどの程度維持できるか？
RQ3ドキュメントの埋め込み表現を事前に計算・キャッシュすることで、検索システムにおける実行時計算をどの程度削減できるか？
RQ4TwinBERTがCPU上で20ms未満の推論遅延を達成しつつ、BERT-Baseの関連性ランク付け性能を再現できるか？
RQ5TwinBERTが生産環境の検索システムに与える影響（遅延、精度、デプロイ可能性）はいかほどか？

主な発見

100件のドキュメントを1クエリあたりスコアリングする際、TwinBERTはCPU上で平均して約20msの推論時間を達成し、BERTに比べ顕著な遅延短縮を実現した。
事前に計算されたドキュメントの埋め込み表現を用いることで、TwinBERTはそれぞれBERT-BaseおよびBERT-Largeに比べ77倍および422倍の高速な推論速度を達成した。
生産環境のA/Bテストでは、微調整済みBERT-12がもたらすインクリメンタルな性能向上の90％以上を保持し、悪質広告インプレッションが10％以上削減された。
TwinBERTモデルは大手スポンサーリンク検索システムに正常にデプロイされ、遅延への影響はほとんどなく、高い関連性品質を維持した。
コサイン類似度バージョンのTwinBERTは、同じ条件下でBERT-3およびBERT-12に比べ121倍および663倍の高速化を達成した。
クエリの埋め込み表現を実行時にも再計算した場合でも、TwinBERTはBERT-3を上回る速度を示し、その効率的優位性を裏付けた。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。