QUICK REVIEW

[論文レビュー] Incorporating BERT into Parallel Sequence Decoding with Adapters

Junliang Guo, Zhirui Zhang|arXiv (Cornell University)|Oct 13, 2020

Topic Modeling参考文献 38被引用数 40

ひとこと要約

本論文はAB-Netを提示する。軽量アダプタを2つのBERTモデル（ソースとターゲット）に挿入してMask-Predictで並列系列デコードを可能にし、デコード遅延を半分に抑えつつ、パラメータ効率的に高いNMT性能を達成する。

ABSTRACT

While large scale pre-trained language models such as BERT have achieved great success on various natural language understanding tasks, how to efficiently and effectively incorporate them into sequence-to-sequence models and the corresponding text generation tasks remains a non-trivial problem. In this paper, we propose to address this problem by taking two different BERT models as the encoder and decoder respectively, and fine-tuning them by introducing simple and lightweight adapter modules, which are inserted between BERT layers and tuned on the task-specific dataset. In this way, we obtain a flexible and efficient model which is able to jointly leverage the information contained in the source-side and target-side BERT models, while bypassing the catastrophic forgetting problem. Each component in the framework can be considered as a plug-in unit, making the framework flexible and task agnostic. Our framework is based on a parallel sequence decoding algorithm named Mask-Predict considering the bi-directional and conditional independent nature of BERT, and can be adapted to traditional autoregressive decoding easily. We conduct extensive experiments on neural machine translation tasks where the proposed method consistently outperforms autoregressive baselines while reducing the inference latency by half, and achieves $36.49$/$33.57$ BLEU scores on IWSLT14 German-English/WMT14 German-English translation. When adapted to autoregressive decoding, the proposed method achieves $30.60$/$43.56$ BLEU scores on WMT14 English-German/English-French translation, on par with the state-of-the-art baseline models.

研究の動機と目的

探索: 軽量アダプタを用いて、2つの事前学習済みBERTモデルをエンコーダとデコーダとして seq2seq フレームワークで活用する。
catastrophe forgettingを抑制するためにBERTパラメータを凍結し、アダプタのみを訓練する。
並列デコーディング方式(Mask-Predict)を適用し、BERTの双方向文脈を活用しつつ条件付き生成を維持する。
複数の翻訳タスクと言語で自己回帰ベースラインより性能向上を実証する。

提案手法

エンコーダ側とデコーダ側の全BERT層にアダプタモジュールを挿入し、アダプタのみをファインチューニングする。
ソースサイドXbertとターゲットサイドYbertの2つのBERTモデルをエンコーダ/デコーダとしてseq2seq設定で使用する。
条件付きマスク済み言語モデリング目的関数L(y^m|y^r,x; Aenc, Adec) (式(3)に類似) で訓練する。
Mask-Predict parallel decodingを採用してBERTの双方向文脈を活用し、高速推論を実現する。必要に応じて自己回帰デコーディングへ拡張する。
ターゲット長を特別な[LENGTH]トークンで予測し、マスクして予測するデコーディングを反復的に refine する。
パフォーマンスとパラメータ効率のバランスを取るために、アダプタのアーキテクチャ(Aenc, Adec)やレイヤ配置を変えることを検討。

実験結果

リサーチクエスチョン

RQ1アダプタを用いて、BERTをエンコーダとデコーダの両方としてseq2seqフレームワークで jointly 利用できるか。
RQ2BERT層を凍結しつつアダプタモジュールのみを訓練することで、catastrophic forgettingを抑制し効率を改善できるか。
RQ3Mask-Predictによる並列デコーディングは、自己回帰ベースラインと比較して速度向上と翻訳品質の競合性を提供するか。
RQ4アダプタのスケールとアーキテクチャは性能と訓練効率にどのように影響するか。
RQ5このフレームワークは複数の言語ペアとリソース設定で有効か。

主な発見

Model	De-En (IWSLT14)	Ro-En (IWSLT14)	En-De (WMT16)	De-En (WMT14)	Latency	Parameters
Transformer-Base	33.59	34.46	28.04	32.69	778 ms	74 M
Mask-Predict	31.71	33.31	27.03	30.53	161 ms	75 M
BERT-Fused NAT	33.14	34.12	27.73	32.10	260 ms	90 M
AB-Net	36.49	35.63	28.69	33.57	327 ms	67 M
AB-Net-Enc	34.45	-	28.08	-	165 ms	78 M

AB-NetはIWSLT14 De-Enで36.49 BLEU、WMT14 De-Enで33.57 BLEUを並列デコードで達成し、Mask-Predictと自己回帰ベースラインを上回る。
AB-Netは類似の訓練可能パラメータ数でTransformer-Baseと比較してデコード遅延を約2倍削減。
AB-NetはデュアルサイドのBERT（エンコーダとデコーダ）を用いると、BERT-Fused NATよりも訓練可能パラメータが少なく、ベースラインよりBLEUが高い。
エンコーダ側とデコーダ側のアダプタはBERTモデル双方の情報と条件付き依存関係を活用でき、性能を向上させる。
AB-Net-Enc（エンコーダーのみのBERTにアダプタ）も強い結果を示し、上位層にアダプタを用いるとパラメータを減らして性能を保持できる。
リソースの少ないIWSLT14の言語ペアでは、AB-NetはEn-It、It-En、En-Es、Es-En、En-Nl、Nl-Enでベースラインを一貫して上回る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。