Skip to main content
QUICK REVIEW

[Paper Review] Dynamic Self-Attention : Computing Attention over Words Dynamically for Sentence Embedding

Deunsol Yoon, Dongbok Lee|arXiv (Cornell University)|Aug 22, 2018
Topic Modeling14 references40 citations
TL;DR

The paper introduces Dynamic Self-Attention (DSA), a self-attention mechanism with dynamic weight vectors inspired by capsule networks, achieving state-of-the-art SNLI results with few parameters and competitive SST results.

ABSTRACT

In this paper, we propose Dynamic Self-Attention (DSA), a new self-attention mechanism for sentence embedding. We design DSA by modifying dynamic routing in capsule network (Sabouretal.,2017) for natural language processing. DSA attends to informative words with a dynamic weight vector. We achieve new state-of-the-art results among sentence encoding methods in Stanford Natural Language Inference (SNLI) dataset with the least number of parameters, while showing comparative results in Stanford Sentiment Treebank (SST) dataset.

Motivation & Objective

  • Motivate a flexible attention mechanism for sentence embeddings beyond static weight vectors.
  • Adapt dynamic routing concepts to create dynamic self-attention weights.
  • Show that DSA can achieve strong SNLI results with fewer parameters and efficient computation.

Proposed method

  • Builds a CNN with Dense Connections to encode word representations.
  • Implements Dynamic Self-Attention (DSA) by projecting word embeddings with shared matrices across words and iteratively refining a dynamic weight vector through a process inspired by dynamic routing.
  • Concatenates multiple attentions z1,...,zm to form the final sentence embedding z.
  • Replaces capsule-specific components (like squashing) with tanh for scalar neurons and uses a single vector per word for attention.
  • Uses 600-d and 300-d settings for single vs. multiple DSA with Leaky ReLU activations and dropout for regularization.
  • Evaluates using cross-entropy on SNLI and SST tasks, with GloVe embeddings fixed during training.

Experimental results

Research questions

  • RQ1Does a dynamic, input-dependent weight vector improve sentence embedding quality over static self-attention?
  • RQ2Can DSA achieve competitive or state-of-the-art performance on SNLI and SST benchmarks with fewer parameters and faster training times?
  • RQ3How does the number of attentions (m) and projection settings affect performance and efficiency?

Key findings

  • Single DSA achieves state-of-the-art SNLI test accuracy of 86.8% with 2.1 million parameters.
  • Multiple DSA improves SNLI performance further, with a notable relative gain over the baseline self-attention.
  • On SST, single DSA achieves 88.5% on SST-2 and 50.6 on SST-5, showing competitive results.
  • DSA outperforms several baselines in SNLI with reduced parameter counts and faster per-epoch training times (e.g., 135 s/epoch).
  • The dynamic weight vectors exhibit diverse directions across sentences, illustrating adaptive attention.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.