Skip to main content
QUICK REVIEW

[Paper Review] Distance-based Self-Attention Network for Natural Language Inference

Jinbae Im, Sungzoon Cho|arXiv (Cornell University)|Dec 6, 2017
Topic Modeling27 references69 citations
TL;DR

Introduces a Distance-based Self-Attention Network that adds a distance mask to multi-head attention to capture local dependencies while preserving global context, achieving state-of-the-art on SNLI and strong results on MultiNLI.

ABSTRACT

Attention mechanism has been used as an ancillary means to help RNN or CNN. However, the Transformer (Vaswani et al., 2017) recently recorded the state-of-the-art performance in machine translation with a dramatic reduction in training time by solely using attention. Motivated by the Transformer, Directional Self Attention Network (Shen et al., 2017), a fully attention-based sentence encoder, was proposed. It showed good performance with various data by using forward and backward directional information in a sentence. But in their study, not considered at all was the distance between words, an important feature when learning the local dependency to help understand the context of input text. We propose Distance-based Self-Attention Network, which considers the word distance by using a simple distance mask in order to model the local dependency without losing the ability of modeling global dependency which attention has inherent. Our model shows good performance with NLI data, and it records the new state-of-the-art result with SNLI data. Additionally, we show that our model has a strength in long sentences or documents.

Motivation & Objective

  • Motivate improving sentence encoders for natural language inference by capturing local word dependencies.
  • Incorporate word distance information into a fully attention-based encoder without sacrificing global context.
  • Evaluate the proposed distance-based attention on SNLI and MultiNLI datasets.
  • Provide analysis showing where and how the distance mask influences attention and performance.

Proposed method

  • Extend the Transformer-style attention with a distance mask to model relative word distances.
  • Incorporate a directional mask to encode forward and backward dependencies.
  • Introduce a fusion gate that combines projected word embeddings with masked attention outputs.
  • Use a position-wise feed-forward network with residual connections after the fusion stage.
  • Apply pooling via multi-dimensional self-attention and max pooling to obtain sentence representations.

Experimental results

Research questions

  • RQ1Does adding a distance mask to self-attention improve natural language inference performance compared to prior fully attention-based encoders?
  • RQ2How does the distance mask affect attention patterns for long versus short sentences?
  • RQ3What is the impact of the distance mask on SNLI and MultiNLI benchmarks?
  • RQ4How does the proposed model balance local dependency capture with global contextual modeling?

Key findings

  • The distance mask yields state-of-the-art results on SNLI when used with a fully attention-based encoder.
  • The distance mask particularly improves performance on longer sentences, with greater gains as average sentence length increases.
  • Ablation shows that including the distance mask improves accuracy without increasing model size or training time significantly.
  • On MultiNLI, the model is competitive, offering strong accuracy with a relatively simple inference layer compared to deeper LSTM-based models.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.