QUICK REVIEW

[論文レビュー] Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection

Guangxiang Zhao, Junyang Lin|arXiv (Cornell University)|Dec 25, 2019

Multimodal Machine Learning Applications参考文献 52被引用数 77

ひとこと要約

この論文は Explicit Sparse Transformer を紹介します。トップ-k 最も貢献する位置を選択して注意を集中させ、NLP とビジョンタスク全体で性能と効率を向上させます。

ABSTRACT

Self-attention based Transformer has demonstrated the state-of-the-art performances in a number of natural language processing tasks. Self-attention is able to model long-term dependencies, but it may suffer from the extraction of irrelevant information in the context. To tackle the problem, we propose a novel model called extbf{Explicit Sparse Transformer}. Explicit Sparse Transformer is able to improve the concentration of attention on the global context through an explicit selection of the most relevant segments. Extensive experimental results on a series of natural language processing and computer vision tasks, including neural machine translation, image captioning, and language modeling, all demonstrate the advantages of Explicit Sparse Transformer in model performance. We also show that our proposed sparse attention method achieves comparable or better results than the previous sparse attention method, but significantly reduces training and testing time. For example, the inference speed is twice that of sparsemax in Transformer model. Code will be available at \url{https://github.com/lancopku/Explicit-Sparse-Transformer}

研究の動機と目的

Transformerモデルにおける不必要な文脈からの注意の散漫を減らすため、より焦点を絞った注意の必要性を動機づける。
トップ-k選択的注意を用いたExplicit Sparse Transformer を提案し、グローバル文脈モデリングを鋭敏にする。
ニューラル機械翻訳、画像キャプション、言語モデリングで vanilla Transformer より改善を示す。
sparse attention は以前の sparse attention 手法よりも高速で、精度を維持または向上させることができる。

提案手法

標準の QK^T 注目スコアを計算し、各クエリ行ごとに top-k マスクを適用して最大の k スコアのみを保持。
softmax の前にトップ-k でないスコアを -infinity でマスクして、集中した注意分布を得る。
マスクされたスコア上で softmax して注意重みを正規化。
さらに sparse 重みに A と値 V を用いて C = AV のコンテキストを計算。
デコーディング状態から導出される Q に対して、コンテキスト attention への sparse メカニズムを拡張。
Self-attention および context attention に対応した、実装にやさしい単純なアプローチを提供。

実験結果

リサーチクエスチョン

RQ1Explicit top-k selective attention は vanilla Transformer と比べてモデルの焦点と性能を向上させるか？
RQ2ハイパーパラメータ k はタスクとデータセット全体でどう選択すべきか？
RQ3トップ-k sparse attention が他の sparse attention 手法と比べてトレーニングおよび推論の効率性にどのようなメリットをもたらすか？
RQ4sparse attention はモデルのアライメントを助け、無関係な文脈からの分散を減らすのに役立つか？
RQ5Explicit sparse attention を使用すると、注意分布にどのような定性的な違いが生じるか？

主な発見

モデル	En-De BLEU	En-Vi BLEU	De-En BLEU
Transformer	28.4	30.2	-
Explicit Sparse Transformer	29.4	31.1	35.6

Explicit Sparse Transformer は En-De で BLEU を高める（29.4 vs. 28.4 for Transformer）。
En-Vi で 31.1 BLEU、Transformer は 30.2。
De-En では 35.6 BLEU（top lines reported）。
画像キャプション（COCO）で、Transformer ベースラインより CIDEr や BLEU-4 をわずかに向上。
言語モデリング（enwiki8）で、Transformer-XL を、同等のパラメータ数で上回る。
Top-k sparse attention は prior sparse attention 手法に比べてトレーニング/推論時間を短縮し、いくつかの設定で約2倍速くなる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。