[Paper Review] Dynamic Self-Attention : Computing Attention over Words Dynamically for Sentence Embedding
The paper introduces Dynamic Self-Attention (DSA), a self-attention mechanism with dynamic weight vectors inspired by capsule networks, achieving state-of-the-art SNLI results with few parameters and competitive SST results.
In this paper, we propose Dynamic Self-Attention (DSA), a new self-attention mechanism for sentence embedding. We design DSA by modifying dynamic routing in capsule network (Sabouretal.,2017) for natural language processing. DSA attends to informative words with a dynamic weight vector. We achieve new state-of-the-art results among sentence encoding methods in Stanford Natural Language Inference (SNLI) dataset with the least number of parameters, while showing comparative results in Stanford Sentiment Treebank (SST) dataset.
Motivation & Objective
- Motivate a flexible attention mechanism for sentence embeddings beyond static weight vectors.
- Adapt dynamic routing concepts to create dynamic self-attention weights.
- Show that DSA can achieve strong SNLI results with fewer parameters and efficient computation.
Proposed method
- Builds a CNN with Dense Connections to encode word representations.
- Implements Dynamic Self-Attention (DSA) by projecting word embeddings with shared matrices across words and iteratively refining a dynamic weight vector through a process inspired by dynamic routing.
- Concatenates multiple attentions z1,...,zm to form the final sentence embedding z.
- Replaces capsule-specific components (like squashing) with tanh for scalar neurons and uses a single vector per word for attention.
- Uses 600-d and 300-d settings for single vs. multiple DSA with Leaky ReLU activations and dropout for regularization.
- Evaluates using cross-entropy on SNLI and SST tasks, with GloVe embeddings fixed during training.
Experimental results
Research questions
- RQ1Does a dynamic, input-dependent weight vector improve sentence embedding quality over static self-attention?
- RQ2Can DSA achieve competitive or state-of-the-art performance on SNLI and SST benchmarks with fewer parameters and faster training times?
- RQ3How does the number of attentions (m) and projection settings affect performance and efficiency?
Key findings
- Single DSA achieves state-of-the-art SNLI test accuracy of 86.8% with 2.1 million parameters.
- Multiple DSA improves SNLI performance further, with a notable relative gain over the baseline self-attention.
- On SST, single DSA achieves 88.5% on SST-2 and 50.6 on SST-5, showing competitive results.
- DSA outperforms several baselines in SNLI with reduced parameter counts and faster per-epoch training times (e.g., 135 s/epoch).
- The dynamic weight vectors exhibit diverse directions across sentences, illustrating adaptive attention.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.