QUICK REVIEW

[Paper Review] Long-Short Transformer: Efficient Transformers for Language and Vision

Chen Zhu, Wei Ping|arXiv (Cornell University)|Jul 5, 2021

Multimodal Machine Learning Applications62 references56 citations

TL;DR

Transformer-LS combines a dynamic low-rank long-range attention with a local sliding-window attention to achieve linear-time self-attention for long sequences in both language and vision, outperforming state-of-the-art efficient transformers on multiple tasks.

ABSTRACT

Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic time and memory complexities with respect to the input sequence length. In this paper, we propose Long-Short Transformer (Transformer-LS), an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. We propose a dual normalization strategy to account for the scale mismatch between the two attention mechanisms. Transformer-LS can be applied to both autoregressive and bidirectional models without additional complexity. Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification. For instance, Transformer-LS achieves 0.97 test BPC on enwik8 using half the number of parameters than previous method, while being faster and is able to handle 3x as long sequences compared to its full-attention version on the same hardware. On ImageNet, it can obtain the state-of-the-art results (e.g., a moderate size of 55.8M model solely trained on 224x224 ImageNet-1K can obtain Top-1 accuracy 84.1%), while being more scalable on high-resolution images. The source code and models are released at https://github.com/NVIDIA/transformer-ls .

Motivation & Objective

Motivate the need for scalable Transformers that handle long language sequences and high-resolution vision inputs.
Propose a unified Long-Short Transformer (Transformer-LS) that combines long-range dynamic projection attention with short-term local window attention.
Introduce DualLN to address scale mismatch between long-range and short-term components.
Demonstrate state-of-the-art performance and efficiency on language and vision benchmarks.
Provide implementation details and show robustness and scalability across tasks.

Proposed method

Introduce a dual-attention scheme that aggregates a dynamic low-rank long-range attention with a local window short-term attention.
Define dynamic projection P_i derived from K to project K and V into low-rank bar{K}_i, bar{V}_i with complexity O(rn).
Compute long-range attention as bar{H}_i = A_i (P_i^T W^V V) where A_i is softmax(QW_i^Q bar{K}_i^T)/sqrt(d_k).
Aggregate long-range and short-term attentions per head by attending to [tilde{K}_t; bar{K}_i] and [tilde{V}_t; bar{V}_i], with a DualLN scheme to align norms.
Apply efficient attention to both autoregressive and bidirectional models with linear-time/space complexity.
Demonstrate robustness of Dynamic Projection to sequence-length variance and perturbations.

Experimental results

Research questions

RQ1Can a unified Long-Short Transformer achieve linear-time self-attention while preserving or improving performance on long-range language and high-resolution vision tasks?
RQ2Does combining dynamic long-range projections with local window attention outperform prior efficient Transformer approaches in diverse settings (LRA, IMDb, enwik8, text8, ImageNet)?
RQ3Is the proposed DualLN normalization effective in mitigating scale mismatch between long-range and short-term attentions?
RQ4How does Transformer-LS perform in autoregressive vs bidirectional modeling across language and vision benchmarks?
RQ5What is the impact of the proposed attention aggregation on robustness to input perturbations (insertions/deletions) and variable sequence lengths?

Key findings

Transformer-LS achieves state-of-the-art results on Long Range Arena benchmarks among efficient Transformers.
In autoregressive language modeling, Transformer-LS attains 0.97 test BPC on enwik8 with half the parameters of prior methods and handles sequences up to 3× longer than full-attention baselines on the same hardware.
In vision tasks, Transformer-LS-based CvT and ViL variants attain competitive or state-of-the-art ImageNet results with reduced or comparable FLOPs.
DualLN alignment significantly improves optimization and validation loss compared to models without DualLN.
Dynamic Projection demonstrates robustness to insertion/deletion perturbations and provides superior performance over fixed Linformer-like projections.
Across tasks, Transformer-LS with w and r configurations often achieves favorable trade-offs between accuracy, FLOPs, and sequence length support.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.