QUICK REVIEW

[論文レビュー] Rethinking Positional Encoding in Language Pre-training

Guolin Ke, Di He|arXiv (Cornell University)|Jun 28, 2020

Topic Modeling参考文献 31被引用数 67

ひとこと要約

TUPEは結び付けのない位置エンコーディングを提案し、単語の相関と位置の相関を分離し、CLSトークンを結び付けを解除することで、GLUEの性能を向上させ、事前学習をより高速化する。

ABSTRACT

In this work, we investigate the positional encoding methods used in language pre-training (e.g., BERT) and identify several problems in the existing formulations. First, we show that in the absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations between the two heterogeneous information resources. It may bring unnecessary randomness in the attention and further limit the expressiveness of the model. Second, we question whether treating the position of the symbol exttt{[CLS]} the same as other words is a reasonable design, considering its special role (the representation of the entire sentence) in the downstream tasks. Motivated from above analysis, we propose a new positional encoding method called extbf{T}ransformer with extbf{U}ntied extbf{P}ositional extbf{E}ncoding (TUPE). In the self-attention module, TUPE computes the word contextual correlation and positional correlation separately with different parameterizations and then adds them together. This design removes the mixed and noisy correlations over heterogeneous embeddings and offers more expressiveness by using different projection matrices. Furthermore, TUPE unties the exttt{[CLS]} symbol from other positions, making it easier to capture information from all positions. Extensive experiments and ablation studies on GLUE benchmark demonstrate the effectiveness of the proposed method. Codes and models are released at https://github.com/guolinke/TUPE.

研究の動機と目的

言語事前学習における絶対位置エンコーディングと相対位置エンコーディングの再検討を動機づける。
自己注意における単語と位置の相関を結び付けを解除する新しいTUPE手法を提案する。
CLSシンボルを通常の位置から切り離し、文全体情報をより適切に捉える。
BERT-Base設定全体でTUPEによるGLUEベンチマークの改善を示す。

提案手法

自己注意内で、語の文脈相関と位置相関を別個の射影を用いて分けて計算する。
語の埋め込みと絶対的位置埋め込みの入力レベルでの和を、注意機構内の別個の相関項に置き換える。
学習可能なパラメータを用いて位置関連の相関をリセットし、CLSを結び付けから解く。
TUPE-A（untied absolute）とTUPE-R（untied with relative）という変種を提供する。
効率のために層間で位置相関項を共有する。
BERT-BaseでGLUEを評価し、付録でBERT-LargeおよびELECTRAへの分析を拡張する。

実験結果

リサーチクエスチョン

RQ1結び付けを解除した別個の語と位置の相関は、標準の絶対/相対エンコーディングと比較してTransformerの事前学習を改善できるか。
RQ2CLSシンボルを通常の位置から切り離すことは、文レベルの表現を改善するか。
RQ3既存の相対エンコーディングと組み合わせた場合、TUPE-AとTUPE-Rは相補的な利点を提供するか。
RQ4TUPEがGLUEベンチマークの性能と事前学習の効率に与える影響は何か。

主な発見

手順	MNLI-m/mm	QNLI	QQP	SST	CoLA	MRPC	RTE	STS	平均
BERT-A	1 M	84.93/84.91	91.34	91.04	92.88	55.19	88.29	68.61	89.43	82.96
BERT-R	1 M	85.81/85.84	92.16	91.12	92.90	55.43	89.26	71.46	88.94	83.66
TUPE-A	1 M	86.05/85.99	91.92	91.16	93.19	63.09	88.37	71.61	88.88	84.47
TUPE-R	1 M	86.21/86.19	92.17	91.30	93.26	63.56	89.89	73.56	89.23	85.04
TUPE-A mid	300 k	84.76/84.83	90.96	91.00	92.25	62.13	87.1	68.79	88.16	83.33
TUPE-R mid	300 k	84.86/85.21	91.23	91.14	92.41	62.47	87.29	69.85	88.63	83.68
TUPE-A tie-cls	1 M	85.91/85.73	91.90	91.05	93.17	59.46	88.53	69.54	88.97	83.81
BERT-A d	1 M	85.26/85.28	91.56	91.02	92.70	59.73	88.46	71.31	87.47	83.64

TUPE-AとTUPE-RはGLUEタスクでBERT-AおよびBERT-Rのベースラインを上回る。
TUPE-RはGLUEの平均85.04を達成し、BERT-Rの83.66に対し約1.38ポイントの改善。
TUPE-RはTUPE-Aより平均で0.57ポイント上回る。
TUPE-AとTUPE-Rは事前学習中の収束が速く、約30%の事前学習ステップでより良い下流性能を達成できる。
CLSの結び付け解除は低リソースタスク（例：CoLA、RTE）で顕著な利得をもたらし、結合を解かれた相関は高リソースタスク（例：MNLI）を助ける。
TUPEはパラメータを最小限に追加し（約BERT-Baseの1%程度）、追加計算コストもごくわずか; 位置項は層間で再利用可能。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。