QUICK REVIEW

[논문 리뷰] What Makes Convolutional Models Great on Long Sequence Modeling?

Yuhong Li, Tianle Cai|arXiv (Cornell University)|2022. 10. 17.

Speech Recognition and Synthesis인용 수 20

한 줄 요약

SGConv는 다중 규모의 서브 커널과 감소 가중치를 갖춘 단순하고 효율적인 전역 컨볼루션 커널을 제시하며, 강력한 장거리 의존성 모델링을 달성하고 Long Range Arena에서 S4를 능가하는 동시에 언어 및 비전 모델용 드롭인 모듈로서 더 효율적이고 다재다능함을 보여준다.

ABSTRACT

Convolutional models have been widely used in multiple domains. However, most existing models only use local convolution, making the model unable to handle long-range dependency efficiently. Attention overcomes this problem by aggregating global information but also makes the computational complexity quadratic to the sequence length. Recently, Gu et al. [2021] proposed a model called S4 inspired by the state space model. S4 can be efficiently implemented as a global convolutional model whose kernel size equals the input sequence length. S4 can model much longer sequences than Transformers and achieve significant gains over SoTA on several long-range tasks. Despite its empirical success, S4 is involved. It requires sophisticated parameterization and initialization schemes. As a result, S4 is less intuitive and hard to use. Here we aim to demystify S4 and extract basic principles that contribute to the success of S4 as a global convolutional model. We focus on the structure of the convolution kernel and identify two critical but intuitive principles enjoyed by S4 that are sufficient to make up an effective global convolutional model: 1) The parameterization of the convolutional kernel needs to be efficient in the sense that the number of parameters should scale sub-linearly with sequence length. 2) The kernel needs to satisfy a decaying structure that the weights for convolving with closer neighbors are larger than the more distant ones. Based on the two principles, we propose a simple yet effective convolutional model called Structured Global Convolution (SGConv). SGConv exhibits strong empirical performance over several tasks: 1) With faster speed, SGConv surpasses S4 on Long Range Arena and Speech Command datasets. 2) When plugging SGConv into standard language and vision models, it shows the potential to improve both efficiency and performance.

연구 동기 및 목표

장거리 의존성 모델링에서 S4의 성공을 뒷받침하는 최소 원리를 식별한다.
장거리 모델링 능력을 보존하는 더 간단하고 직관적인 전역 컨볼루션 커널을 제안한다.
장거리 벤치마크와 일반적인 다운스트림 작업에서 SGConv의 실제 성능을 보여준다.
SGConv를 언어 및 비전 아키텍처에서 범용 모듈로 사용할 수 있음을 보여준다.

제안 방법

전역 컨볼루션의 두 가지 설계 원칙을 정의한다: 효율적인 매개변수화(매개변수가 시퀀스 길이에 대해 부분선형적으로 증가)와 감소하는 커널 구조(가까운 이웃일수록 큰 가중치를 가진다).
SGConv를 도입한다: 고정된 소규모 매개변수 세트에서 업샘플링된 다중 규모 서브 커널을 감소 가중치와 결합하여 구성된 구조화된 전역 컨볼루션; O(L log L) 복잡도를 가지도록 FFT로 계산한다.
길이 L의 커널을 O(log L) 매개변수로 생성하는 구체적 매개변수화 Cat(S)를 제시한다; 정규화 Z와 감소 인자 alpha를 포함한다.
Long Range Arena(LRA) 및 Speech Commands에서 SGConv를 S4 및 베이스라인과 실험적으로 비교한다; 감쇠 속도 t와 규모 차원 d에 대한 제거 실험(ablation)을 수행한다; 언어 및 비전 과제에서 드롭인 모듈로서의 평가를 수행한다.
SGConv를 언어 모델링 블록으로, 이미지 분류에서 ConvNeXt의 드롭인으로서의 활용을 보여준다; 속도와 메모리를 주의 집중 기반(attention 기반) 및 S4 블록과 비교하여 분석한다.

실험 결과

연구 질문

RQ1S4가 장거리 시퀀스 모델링에서 성공하는 데에 필요한 최소 원칙은 무엇인가?
RQ2간단하고 비-SSM(global) 전역 컨볼루션 커널이 S4에 비해 경쟁력이 있거나 더 우수한 성능을 달성할 수 있는가?
RQ3SGConv의 매개변수 및 계산 크기가 어떻게 확장되는지, LRA, 음성, 언어 및 비전 과제에서 어떤 성능을 보이는지?
RQ4SGConv가 NLP와 CV 아키텍처에서 일반-purpose 모듈로 작동할 수 있는가?

주요 결과

두 원칙에 따라 안내된 SGConv는 Long Range Arena와 Speech Commands 벤치마크에서 빠른 속도와 함께 S4를 능가한다.
SGConv는 LRA의 평균 성능이 더 높고(Table 1) 음성 작업에서 SoTA에 경쟁력 있으며, S4보다 낮은 계산 비용을 유지한다.
다중 규모 업샘플 서브 커널과 감소 결합을 가진 간단한 SGConv 커널은 매개변수 수를 O(log L)로, FFT 기반으로 O(L log L) 계산을 달성한다.
언어 모델에서 Transformer 주의(attention) 일부를 SGConv로 대체하면 복잡도가 O(L^2)에서 O(L log L)로 감소하면서 특정 설정에서 성능을 유지한다.
ConvNeXt에서 SGConv를 사용한 SGConvNeXt가 일부 구성에서 ImageNet-1k의 SoTA 모델과 동등하거나 능가하는 성능을 보이며 도메인 간 적용 가능성을 보여준다.
SGConv 블록은 시퀀스 길이 및 하드웨어(CPU/GPU)에 따라 최적화된 S4 커널보다 빠른 것으로 나타났다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.