QUICK REVIEW

[論文レビュー] MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning

Guangxiang Zhao, Xu Sun|arXiv (Cornell University)|Nov 17, 2019

Topic Modeling参考文献 27被引用数 41

ひとこと要約

tldr: MUSEは自己注意、深さ別畳み込み、点ごとフィードフォワードネットワークを組み合わせた並列マルチスケール・アテンションを導入し、 seq2seq タスクでグローバル・ローカル・トークンレベルの文脈をより良くモデル化し、主要な翻訳データセットで最先端のBLEUを達成します。

ABSTRACT

In sequence to sequence learning, the self-attention mechanism proves to be highly effective, and achieves significant improvements in many tasks. However, the self-attention mechanism is not without its own flaws. Although self-attention can model extremely long dependencies, the attention in deep layers tends to overconcentrate on a single token, leading to insufficient use of local information and difficultly in representing long sequences. In this work, we explore parallel multi-scale representation learning on sequence data, striving to capture both long-range and short-range language structures. To this end, we propose the Parallel MUlti-Scale attEntion (MUSE) and MUSE-simple. MUSE-simple contains the basic idea of parallel multi-scale sequence representation learning, and it encodes the sequence in parallel, in terms of different scales with the help from self-attention, and pointwise transformation. MUSE builds on MUSE-simple and explores combining convolution and self-attention for learning sequence representations from more different scales. We focus on machine translation and the proposed approach achieves substantial performance improvements over Transformer, especially on long sequences. More importantly, we find that although conceptually simple, its success in practice requires intricate considerations, and the multi-scale attention must build on unified semantic space. Under common setting, the proposed model achieves substantial performance and outperforms all previous models on three main machine translation tasks. In addition, MUSE has potential for accelerating inference due to its parallelism. Code will be available at https://github.com/lancopku/MUSE

研究の動機と目的

Transformerベースのseq2seqタスクにおいて、純粋な自己注意を超えた長いシーケンスのモデリングの必要性を説く。
グローバル（自己注意）、ローカル（畳み込み）、トークンレベル（点ごと）表現を融合する並列マルチスケールアーキテクチャ（MUSE）を提案する。
主要な翻訳ベンチマークで最先端BLEUを実証的に示し、効果的なマルチスケール融合を可能にする要因を分析する。
並列化による計算上の利点を示し、カーネル選択と共有投影に関する洞察を提供する。

提案手法

MUSEをN個のスタックされたMUSEモジュールと残差接続を伴うエンコーダ/デコーダとして定義する。
各MUSEモジュール内で、Attention(X)、DepthConv(X)、Pointwise(X)を並列に計算し、X_i = X_{i-1} + Attention(X_{i-1}) + Conv(X_{i-1}) + Pointwise(X_{i-1})として融合する。
深さ方向分離可能畳み込みを、複数のカーネルサイズにわたる動的なカーネル選択とともに用い、自己注意と入力投影を共有する（V1 = V2 = V W^V）。
畳み込みを用いないMUSE-simpleを提供し、並列マルチスケール設計の効果を孤立させる。
大規模WMTデータセットでMUSE-base/Largeを訓練し、小規模IWSLTデータセットではMUSE-baseを標準のNMT評価設定で訓練する。

実験結果

リサーチクエスチョン

RQ1並列マルチスケール表現は、純粋な自己注意または純粋な畳み込みモデルよりseq2seqの性能を向上させるか。
RQ2自己注意と畳み込みの投影を共有することは、マルチスケールモジュールの学習に有益か。
RQ3長いシーケンスに対する性能は、動的カーネルサイズ選択（動的 vs 固定）によってどう影響を受けるか。
RQ4Transformerと比較した場合、MUSEモジュールを並列化することで推論速度に実用的な向上が得られるか。
RQ5大規模データセットと小規模データセットの両方で利点は一般化されるか。

主な発見

MUSE-largeはEn-Deで29.9 BLEU、En-Frで43.5 BLEUを達成し、同等の規模とデータ量の従来モデルを上回る。
MUSE-simpleはすでに強力な結果を示し、畳み込みなしでも最先端に近づく可能性があり、DepthConvを追加することでさらに改善する。
自己注意と畳み込みの共有投影は性能に顕著な利益をもたらし（別個の投影より+1.4 BLEU）。
動的に選択されるカーネルは固定の大きい/小さいカーネルより優れており、最良の設定は評価タスクでトップBLEUを達成する。
比較可能なパラメータ数の下で、MUSEはTransformerより約31％の推論速度向上を示した。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。