QUICK REVIEW

[Paper Review] Dynamic Self-Attention : Computing Attention over Words Dynamically for Sentence Embedding

Deunsol Yoon, Dongbok Lee|arXiv (Cornell University)|Aug 22, 2018

Topic Modeling14 references40 citations

TL;DR

The paper introduces Dynamic Self-Attention (DSA), a self-attention mechanism with dynamic weight vectors inspired by capsule networks, achieving state-of-the-art SNLI results with few parameters and competitive SST results.

ABSTRACT

In this paper, we propose Dynamic Self-Attention (DSA), a new self-attention mechanism for sentence embedding. We design DSA by modifying dynamic routing in capsule network (Sabouretal.,2017) for natural language processing. DSA attends to informative words with a dynamic weight vector. We achieve new state-of-the-art results among sentence encoding methods in Stanford Natural Language Inference (SNLI) dataset with the least number of parameters, while showing comparative results in Stanford Sentiment Treebank (SST) dataset.

Motivation & Objective

Motivate a flexible attention mechanism for sentence embeddings beyond static weight vectors.
Adapt dynamic routing concepts to create dynamic self-attention weights.
Show that DSA can achieve strong SNLI results with fewer parameters and efficient computation.

Proposed method

Builds a CNN with Dense Connections to encode word representations.
Implements Dynamic Self-Attention (DSA) by projecting word embeddings with shared matrices across words and iteratively refining a dynamic weight vector through a process inspired by dynamic routing.
Concatenates multiple attentions z1,...,zm to form the final sentence embedding z.
Replaces capsule-specific components (like squashing) with tanh for scalar neurons and uses a single vector per word for attention.
Uses 600-d and 300-d settings for single vs. multiple DSA with Leaky ReLU activations and dropout for regularization.
Evaluates using cross-entropy on SNLI and SST tasks, with GloVe embeddings fixed during training.

Experimental results

Research questions

RQ1Does a dynamic, input-dependent weight vector improve sentence embedding quality over static self-attention?
RQ2Can DSA achieve competitive or state-of-the-art performance on SNLI and SST benchmarks with fewer parameters and faster training times?
RQ3How does the number of attentions (m) and projection settings affect performance and efficiency?

Key findings

Single DSA achieves state-of-the-art SNLI test accuracy of 86.8% with 2.1 million parameters.
Multiple DSA improves SNLI performance further, with a notable relative gain over the baseline self-attention.
On SST, single DSA achieves 88.5% on SST-2 and 50.6 on SST-5, showing competitive results.
DSA outperforms several baselines in SNLI with reduced parameter counts and faster per-epoch training times (e.g., 135 s/epoch).
The dynamic weight vectors exhibit diverse directions across sentences, illustrating adaptive attention.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.