Skip to main content
QUICK REVIEW

[论文解读] Not Enough Data? Deep Learning to the Rescue!

Ateret Anaby-Tavor, Boaz Carmeli|arXiv (Cornell University)|Nov 8, 2019
Topic Modeling参考文献 13被引用 32
一句话总结

LAMBADA 在小型带标签文本数据集上对 GPT-2 语言模型进行微调,以生成带标签的合成数据;用基线分类器筛选后再训练,以提升文本分类准确性。

ABSTRACT

Based on recent advances in natural language modeling and those in text generation capabilities, we propose a novel data augmentation method for text classification tasks. We use a powerful pre-trained neural network model to artificially synthesize new labeled data for supervised learning. We mainly focus on cases with scarce labeled data. Our method, referred to as language-model-based data augmentation (LAMBADA), involves fine-tuning a state-of-the-art language generator to a specific task through an initial training phase on the existing (usually small) labeled data. Using the fine-tuned model and given a class label, new sentences for the class are generated. Our process then filters these new sentences by using a classifier trained on the original data. In a series of experiments, we show that LAMBADA improves classifiers' performance on a variety of datasets. Moreover, LAMBADA significantly improves upon the state-of-the-art techniques for data augmentation, specifically those applicable to text classification tasks with little data.

研究动机与目标

  • Motivate the problem of scarce labeled data in text classification and the need for effective data augmentation.
  • Introduce LAMBADA, a language-model-based augmentation pipeline that synthesizes labeled sentences.
  • Show that LAMBADA improves classification accuracy and surpasses state-of-the-art augmentation methods on small datasets.
  • Demonstrate that LAMBADA can outperform baselines and alternative semi-supervised approaches when unlabeled data is unavailable.

提出的方法

  • Fine-tune GPT-2 on the small labeled dataset D_train to create a task-adapted generator G_tuned.
  • Synthesize a labeled sentence set D* by prompting G_tuned with class labels and separators to generate sentences per class.
  • Filter D* using a baseline classifier h trained on D_train, keeping the top-N_y high-confidence examples per class to form D_synthesized.
  • Retrain the target classifier A on D_train ∪ D_synthesized to obtain an improved classifier.
  • Compare LAMBADA to other augmentation methods (EDA, CVAE, CBERT) and to baselines, using McNemar tests for statistical significance.
  • Note that LAMBADA does not require unlabeled data and can be iterated or adapted to zero-shot class scenarios.

实验结果

研究问题

  • RQ1Can LAMBADA improve text classification performance when training data per class is very small?
  • RQ2How does LAMBADA compare to existing text augmentation methods across multiple classifiers and datasets?
  • RQ3Is LAMBADA effective without leveraging unlabeled data, and how does it perform relative to semi-supervised approaches?
  • RQ4Does LAMBADA provide benefits across different classifier families (e.g., BERT, SVM, LSTM) and datasets with varying characteristics?

主要发现

DatasetClassifierBaseline AccuracyLAMBADA AccuracyImprovement (%)
ATISBERT53.375.758.5
ATISSVM35.656.558.7
ATISLSTM29.033.716.2
TRECBERT60.364.36.6
TRECSVM42.743.92.8
TRECLSTM17.725.845.0
WVABERT67.268.62.1
WVASVM60.262.94.5
WVALSTM26.032.023.0
  • On ATIS with five samples per class, LAMBADA substantially improves all classifiers (BERT, SVM, LSTM) over the baseline and outperforms other augmentation methods (statistically significant, p<0.01).
  • Across three datasets (ATIS, TREC, WVA) and three classifiers, LAMBADA yields higher accuracy than the baselines for all combinations, with notable gains on ATIS especially for BERT and SVM.
  • Table 4 shows per-classifier gains when comparing Baseline vs. LAMBADA: ATIS (BERT 53.3 → 75.7; improvement 58.5%), ATIS (SVM 35.6 → 56.5; 58.7%), ATIS (LSTM 29.0 → 33.7; 16.2%), TREC (BERT 60.3 → 64.3; 6.6%), TREC (SVM 42.7 → 43.9; 2.8%), TREC (LSTM 17.7 → 25.8; 45.0%), WVA (BERT 67.2 → 68.6; 2.1%), WVA (SVM 60.2 → 62.9; 4.5%), WVA (LSTM 26.0 → 32.0; 23.0%).
  • LAMBADA outperforms EDA, CVAE, and CBERT across ATIS, TREC, and WVA for all classifiers in Table 5 (McNemar p<0.01).
  • Compared to a weak-labeling semi-supervised baseline, LAMBADA with GPT-2 labeling yields higher accuracy on ATIS for BERT and SVM, demonstrating the value of synthesized labeled data when unlabeled data is limited

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。