Skip to main content
QUICK REVIEW

[Paper Review] Augmenting Data with Mixup for Sentence Classification: An Empirical Study

Hongyu Guo, Yongyi Mao|arXiv (Cornell University)|May 22, 2019
Topic Modeling10 references146 citations
TL;DR

The paper adapts Mixup data augmentation to NLP by performing interpolation on word embeddings (wordMixup) and on sentence embeddings (senMixup), showing improved accuracy for CNN and LSTM on multiple sentence classification tasks.

ABSTRACT

Mixup, a recent proposed data augmentation method through linearly interpolating inputs and modeling targets of random samples, has demonstrated its capability of significantly improving the predictive accuracy of the state-of-the-art networks for image classification. However, how this technique can be applied to and what is its effectiveness on natural language processing (NLP) tasks have not been investigated. In this paper, we propose two strategies for the adaption of Mixup on sentence classification: one performs interpolation on word embeddings and another on sentence embeddings. We conduct experiments to evaluate our methods using several benchmark datasets. Our studies show that such interpolation strategies serve as an effective, domain independent data augmentation approach for sentence classification, and can result in significant accuracy improvement for both CNN and LSTM models.

Motivation & Objective

  • Motivate data augmentation to combat data hunger in NLP without relying on label-invariant text transformations.
  • Propose two Mixup adaptations for sentences: word-level interpolation in embedding space and sentence-level interpolation in hidden representations.
  • Empirically evaluate the proposed methods on multiple CNN and LSTM architectures across standard NLP benchmarks.
  • Assess whether Mixup acts as a domain-independent regularizer for sentence classification and analyze embedding-tuning effects.

Proposed method

  • Adapts Mixup by linearly interpolating inputs and targets: - wordMixup performs interpolation across word embeddings for each token in a sentence. - senMixup interpolates between final hidden-layer sentence representations produced by CNN or LSTM. The mixing ratio lambda is drawn from a Beta(alpha, alpha) distribution with alpha defaulting to 1. Labels are mixed as y-tilde = lambda y_i + (1 - lambda) y_j.
  • Applies to standard CNN (Kim 2014) or LSTM classifiers with a final softmax/ logistic regression classifier for prediction.
  • Evaluates under four embedding settings: RandomTune, RandomFix, PretrainTune, PretrainFix.
  • Trains with Adam optimizer; uses 20000 steps per run; reports mean accuracy over 10 runs with standard deviations.
  • Uses ten benchmark datasets: TREC, MR, SST-1, SST-2, Subj; compares against baseline CNN/LSTM and variants with wordMixup/senMixup.

Experimental results

Research questions

  • RQ1Can Mixup-inspired interpolation be effectively applied to natural language sentence classification tasks?
  • RQ2Do word-level and sentence-level Mixup provide regularization benefits across CNN and LSTM architectures?
  • RQ3How does embedding initialization and tunability (random vs pre-trained) influence Mixup effectiveness?
  • RQ4Is the performance gain consistent across multiple datasets including SST-2 and SST-1?
  • RQ5What is the impact of Mixup on training dynamics and regularization compared to traditional dropout/L2 penalties?

Key findings

  • WordMixup and senMixup improve CNN performance on all five datasets under the RandomTune setting, with notable gains on SST-1 and MR (over 3% relative).
  • On SST-2, Mixup benefits are limited and sometimes negligible when embeddings are trainable; with fixed embeddings, effects vary and can be neutral or negative.
  • LSTM with wordMixup/senMixup also shows improvements on several datasets, including substantial gains on TREC and SST-1 (4.6% and 5.2% relative, respectively).
  • When using pre-trained embeddings with tuning, Mixup variants generally maintain or improve accuracy (e.g., SST-1, SST-2, MR).
  • Mixup acts as a regularizer evidenced by training loss staying above zero for Mixup methods, contrasting with rapid loss drop for baseline CNN without Mixup.
  • Across settings, Mixup is described as domain-independent, low-cost data augmentation that helps mitigate overfitting in sentence classification.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.