QUICK REVIEW

[論文レビュー] What Makes Convolutional Models Great on Long Sequence Modeling?

Yuhong Li, Tianle Cai|arXiv (Cornell University)|Oct 17, 2022

Speech Recognition and Synthesis被引用数 20

ひとこと要約

SGConv は、マルチスケールのサブカーネルと減衰重みを用いた単純で効率的なグローバル畳み込みカーネルを提示し、長距離依存のモデリングを強力に行い、Long Range Arena で S4 を上回りつつ、言語・視覚モデルのドロップインモジュールとしてより効率的で柔軟性がある。

ABSTRACT

Convolutional models have been widely used in multiple domains. However, most existing models only use local convolution, making the model unable to handle long-range dependency efficiently. Attention overcomes this problem by aggregating global information but also makes the computational complexity quadratic to the sequence length. Recently, Gu et al. [2021] proposed a model called S4 inspired by the state space model. S4 can be efficiently implemented as a global convolutional model whose kernel size equals the input sequence length. S4 can model much longer sequences than Transformers and achieve significant gains over SoTA on several long-range tasks. Despite its empirical success, S4 is involved. It requires sophisticated parameterization and initialization schemes. As a result, S4 is less intuitive and hard to use. Here we aim to demystify S4 and extract basic principles that contribute to the success of S4 as a global convolutional model. We focus on the structure of the convolution kernel and identify two critical but intuitive principles enjoyed by S4 that are sufficient to make up an effective global convolutional model: 1) The parameterization of the convolutional kernel needs to be efficient in the sense that the number of parameters should scale sub-linearly with sequence length. 2) The kernel needs to satisfy a decaying structure that the weights for convolving with closer neighbors are larger than the more distant ones. Based on the two principles, we propose a simple yet effective convolutional model called Structured Global Convolution (SGConv). SGConv exhibits strong empirical performance over several tasks: 1) With faster speed, SGConv surpasses S4 on Long Range Arena and Speech Command datasets. 2) When plugging SGConv into standard language and vision models, it shows the potential to improve both efficiency and performance.

研究の動機と目的

長距離依存モデリングにおける S4 の成功の背後にある最小限の原理を同定する。
長距離モデリング能力を保持する、よりシンプルで直感的なグローバル畳み込みカーネルを提案する。
長距離ベンチマークおよび一般的なダウンストリームタスクにおける SGConv の実証的性能を示す。
言語および視覚アーキテクチャで使用可能な汎用モジュールとして SGConv を示す。

提案手法

グローバル畳み込みの設計原理を二つ定義する：効率的なパラメーター化（パラメータは列長に対してサブリニアにスケール）と減衰するカーネル構造（近接する近傍ほど重みが大きい）
SGConv を導入する：固定小パラメータ集合からアップサンプリングされたマルチスケールのサブカーネルを使用し、減衰重みと組み合わせた構造化グローバル畳み込み。FFT で O(L log L) の計算量で計算する。
長さ L のカーネルを O(log L) のパラメータで生成する具体的なパラメータ化 Cat(S) を提供する。正規化 Z と減衰係数 alpha を含む。
長さ L のカーネルを O(log L) のパラメータで生成する具体的なパラメータ化 Cat(S) を提供する。正規化 Z と減衰係数 alpha を含む。
Long Range Arena (LRA) および Speech Commands で SGConv を S4 およびベースラインと経験的に比較する；減衰速度 t とスケール次元 d のアブレーション；言語および視覚タスクでドロップインモジュールとして評価。
言語モデリングブロックとしての SGConv の実演と、画像分類における ConvNeXt へのドロップインとしての実演を行う；注意機構ベースおよびS4ブロックと比較した速度とメモリを分析。

実験結果

リサーチクエスチョン

RQ1長距離シーケンスモデリングにおける S4 の成功の背後にある最小限の原理は何か？
RQ2単純で非 SSM 的なグローバル畳み込みカーネルは S4 に競合する性能を達成できるか、またはそれを上回ることができるか？
RQ3SGConv はパラメータ数と計算量でどのようにスケールし、LRA、音声、言語、視覚タスク全般でどう性能を示すか？
RQ4SGConv は NLP と CV の両方のアーキテクチャで汎用モジュールとして機能し得るか？

主な発見

モデル	ListOps	テキスト	検索	画像	Pathfinder	Path-X	平均
Transformer	36.37	64.27	57.46	42.44	71.40	✗	54.39
Sparse Trans.	17.07	63.58	59.59	44.24	71.71	✗	51.24
Linformer	35.70	53.94	52.27	38.56	76.34	✗	51.36
Reformer	37.27	56.10	53.40	38.07	68.50	✗	50.67
BigBird	36.05	64.02	59.29	40.83	74.87	✗	55.01
S4 (original)	58.35	76.02	87.09	87.26	86.05	88.10	80.48
S4 (Gu et al., 2022b)	59.60	86.82	90.90	88.65	94.20	96.35	86.09
SGConv	61.45	89.20	91.11	87.97	95.46	97.83	87.17

SGConv は二つの原理に導かれ、S4 を Long Range Arena および Speech Commands のベンチマークで上回りつつ、より高速である。
SGConv は LRA（表1）でより強い平均性能を達成し、音声タスクでは SoTA に対して競争力を保ちつつ、S4 より計算コストが低い。
マルチスケールのアップサンプリング済みサブカーネルと減衰結合を用いた単純な SGConv カーネルは、パラメータ数を O(log L) に、FFT ベースの O(L log L) 計算を実現する。
言語モデルの注意機構の一部を SGConv に置換することで、計算量を O(L^2) から O(L log L) に削減しつつ、特定の設定で性能を保持。
ConvNeXt で SGConv を用いる（SGConvNeXt）と、ImageNet-1k のいくつかの構成で SoTA モデルに一致または上回ることが示され、ドメイン横断の適用性を示す。
SGConv ブロックは、シーケンス長とハードウェア（CPU/ GPU）を跨って、最適化された S4 カーネルよりも速いことが示されている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。