[Paper Review] DiSAN: Directional Self-Attention Network for RNN/CNN-Free Language Understanding
DiSAN introduces directional and multi-dimensional self-attention to encode sentences without RNN/CNNs, achieving state-of-the-art results on SNLI, SST, MultiNLI, SICK, and other benchmarks while improving efficiency.
Recurrent neural nets (RNN) and convolutional neural nets (CNN) are widely used on NLP tasks to capture the long-term and local dependencies, respectively. Attention mechanisms have recently attracted enormous interest due to their highly parallelizable computation, significantly less training time, and flexibility in modeling dependencies. We propose a novel attention mechanism in which the attention between elements from input sequence(s) is directional and multi-dimensional (i.e., feature-wise). A light-weight neural net, "Directional Self-Attention Network (DiSAN)", is then proposed to learn sentence embedding, based solely on the proposed attention without any RNN/CNN structure. DiSAN is only composed of a directional self-attention with temporal order encoded, followed by a multi-dimensional attention that compresses the sequence into a vector representation. Despite its simple form, DiSAN outperforms complicated RNN models on both prediction quality and time efficiency. It achieves the best test accuracy among all sentence encoding methods and improves the most recent best result by 1.02% on the Stanford Natural Language Inference (SNLI) dataset, and shows state-of-the-art test accuracy on the Stanford Sentiment Treebank (SST), Multi-Genre natural language inference (MultiNLI), Sentences Involving Compositional Knowledge (SICK), Customer Review, MPQA, TREC question-type classification and Subjectivity (SUBJ) datasets.
Motivation & Objective
- Motivate a unified, RNN/CNN-free attention model for diverse NLP tasks beyond seq2seq applications.
- Propose directional and multi-dimensional self-attention to preserve temporal order and feature-wise dependencies.
- Build a lightweight DiSAN that encodes sentences via forward/backward directional self-attention and a multi-dimensional source2token attention to produce a single vector.
- Demonstrate that DiSAN achieves superior accuracy and efficiency on SNLI, SST, MultiNLI, SICK, and other datasets.
Proposed method
- Introduce multi-dimensional attention that computes feature-wise scores rather than a single scalar score for each token.
- Extend multi-dimensional attention to token2token and source2token variants for self-attention.
- Develop Directional Self-Attention (DiSA) with masked token2token self-attention and a fusion gate to combine input and context.
- Construct the DiSAN architecture by applying forward and backward DiSA blocks, concatenating their outputs, and using a multi-dimensional source2token attention to produce the final sentence vector.
- Use masks (diag-disabled, forward, backward) to encode temporal order and directional information in attention.
- Train with cross-entropy loss plus L2 regularization, Adadelta optimizer, Glorot initialization, 300D GloVe embeddings, dropout, and task-specific classifiers.
Experimental results
Research questions
- RQ1Can an attention-only model without recurrence or convolution achieve competitive or superior performance on standard NLP benchmarks?
- RQ2Do directional (ordered) and multi-dimensional (feature-wise) attentions improve sentence encoding over traditional attention mechanisms?
- RQ3How does a lightweight DiSAN compare to RNN/CNN-based encoders in terms of accuracy and efficiency across tasks like NLI, sentiment, and classification?
- RQ4What is the impact of forward vs backward directional masks and their combination on context representation?
- RQ5Can DiSAN generalize across multiple NLP tasks beyond natural language inference?
Key findings
- DiSAN achieves highest test accuracy among sentence-encoding models on SNLI and improves the best result by 1.02%.
- DiSAN shows state-of-the-art performance on SST, MultiNLI, SICK, Customer Review, MPQA, SUBJ, and TREC datasets.
- DiSAN uses fewer parameters (2.35M) and is significantly faster than many RNN/CNN baselines (e.g., ×3 faster than Bi-LSTM on SNLI).
- Multi-dimensional and directional attention components contribute substantial gains over baselines, with directional masks encoding temporal order improving performance.
- A DiSA-based block plus a multi-dimensional source2token attention can outperform Bi-LSTM encoders and even models with tree-structured architectures on several tasks.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.