QUICK REVIEW

[Paper Review] A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation

Dinghan Shen, Mingzhi Zheng|arXiv (Cornell University)|Sep 29, 2020

Topic Modeling45 references94 citations

TL;DR

The paper introduces Cutoff, a simple data augmentation method that erases parts of input embeddings to create partial views, coupled with a Jensen-Shannon divergence consistency loss, achieving competitive or state-of-the-art results on GLUE and machine translation with lower overhead than adversarial training.

ABSTRACT

Adversarial training has been shown effective at endowing the learned representations with stronger generalization ability. However, it typically requires expensive computation to determine the direction of the injected perturbations. In this paper, we introduce a set of simple yet effective data augmentation strategies dubbed cutoff, where part of the information within an input sentence is erased to yield its restricted views (during the fine-tuning stage). Notably, this process relies merely on stochastic sampling and thus adds little computational overhead. A Jensen-Shannon Divergence consistency loss is further utilized to incorporate these augmented samples into the training objective in a principled manner. To verify the effectiveness of the proposed strategies, we apply cutoff to both natural language understanding and generation problems. On the GLUE benchmark, it is demonstrated that cutoff, in spite of its simplicity, performs on par or better than several competitive adversarial-based approaches. We further extend cutoff to machine translation and observe significant gains in BLEU scores (based upon the Transformer Base model). Moreover, cutoff consistently outperforms adversarial training and achieves state-of-the-art results on the IWSLT2014 German-English dataset.

Motivation & Objective

Motivate robust fine-tuning of large pre-trained language models by enhancing generalization without heavy computational cost.
Develop simple, structured augmentation strategies that erase information at the input embedding level.
Integrate augmented samples via a principled consistency objective to improve predictions across views.
Demonstrate effectiveness on natural language understanding benchmarks and machine translation tasks.

Proposed method

Propose Cutoff to create partial views by erasing: token cutoff (zeroing token embeddings), feature cutoff (zeroing embedding dimensions), and span cutoff (zeroing a contiguous span).
Use a Jensen-Shannon divergence consistency loss to align predictions across original and multiple augmented views.
Combine cross-entropy losses on augmented samples with a JS-divergence term in the training objective.
Extend the approach to conditional text generation by augmenting both inputs and outputs.
Compare computational overhead against adversarial training, highlighting fewer backward passes required.

Experimental results

Research questions

RQ1Does the Cutoff augmentation improve generalization on NLU tasks compared to adversarial methods and other data augmentation techniques?
RQ2Can Cutoff be effectively extended to neural machine translation and yield state-of-the-art results?
RQ3What is the impact of different cutoff types and augmentation strength on performance?
RQ4Does incorporating a JS-divergence consistency loss provide additional gains over standard CE losses?
RQ5Is Cutoff computationally more efficient than typical adversarial training approaches?

Key findings

Cutoff variants consistently outperform ALUM on RoBERTa-base and RoBERTa-large baselines on the GLUE dev sets.
Span cutoff often yields the strongest performance across GLUE tasks.
In machine translation, Cutoff with JS loss achieves higher BLEU scores than several adversarial baselines on WMT14 English-German and IWSLT2014 German-English.
Token cutoff achieves the best BLEU on WMT14 English-German among Cutoff variants; with JS loss, overall BLEU improves further.
JS divergence loss generally improves MNLI dev accuracy, with beta about 1.0 giving best results in ablations.
Cutoff requires no extra backward passes and introduces modest forward-time overhead, making it more efficient than many adversarial methods.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.