QUICK REVIEW

[论文解读] Not Enough Data? Deep Learning to the Rescue!

Ateret Anaby-Tavor, Boaz Carmeli|arXiv (Cornell University)|Nov 8, 2019

Topic Modeling参考文献 13被引用 32

一句话总结

LAMBADA 在小型带标签文本数据集上对 GPT-2 语言模型进行微调，以生成带标签的合成数据；用基线分类器筛选后再训练，以提升文本分类准确性。

ABSTRACT

Based on recent advances in natural language modeling and those in text generation capabilities, we propose a novel data augmentation method for text classification tasks. We use a powerful pre-trained neural network model to artificially synthesize new labeled data for supervised learning. We mainly focus on cases with scarce labeled data. Our method, referred to as language-model-based data augmentation (LAMBADA), involves fine-tuning a state-of-the-art language generator to a specific task through an initial training phase on the existing (usually small) labeled data. Using the fine-tuned model and given a class label, new sentences for the class are generated. Our process then filters these new sentences by using a classifier trained on the original data. In a series of experiments, we show that LAMBADA improves classifiers' performance on a variety of datasets. Moreover, LAMBADA significantly improves upon the state-of-the-art techniques for data augmentation, specifically those applicable to text classification tasks with little data.

研究动机与目标

Motivate the problem of scarce labeled data in text classification and the need for effective data augmentation.
Introduce LAMBADA, a language-model-based augmentation pipeline that synthesizes labeled sentences.
Show that LAMBADA improves classification accuracy and surpasses state-of-the-art augmentation methods on small datasets.
Demonstrate that LAMBADA can outperform baselines and alternative semi-supervised approaches when unlabeled data is unavailable.

提出的方法

Fine-tune GPT-2 on the small labeled dataset D_train to create a task-adapted generator G_tuned.
Synthesize a labeled sentence set D* by prompting G_tuned with class labels and separators to generate sentences per class.
Filter D* using a baseline classifier h trained on D_train, keeping the top-N_y high-confidence examples per class to form D_synthesized.
Retrain the target classifier A on D_train ∪ D_synthesized to obtain an improved classifier.
Compare LAMBADA to other augmentation methods (EDA, CVAE, CBERT) and to baselines, using McNemar tests for statistical significance.
Note that LAMBADA does not require unlabeled data and can be iterated or adapted to zero-shot class scenarios.

实验结果

研究问题

RQ1Can LAMBADA improve text classification performance when training data per class is very small?
RQ2How does LAMBADA compare to existing text augmentation methods across multiple classifiers and datasets?
RQ3Is LAMBADA effective without leveraging unlabeled data, and how does it perform relative to semi-supervised approaches?
RQ4Does LAMBADA provide benefits across different classifier families (e.g., BERT, SVM, LSTM) and datasets with varying characteristics?

主要发现

Dataset	Classifier	Baseline Accuracy	LAMBADA Accuracy	Improvement (%)
ATIS	BERT	53.3	75.7	58.5
ATIS	SVM	35.6	56.5	58.7
ATIS	LSTM	29.0	33.7	16.2
TREC	BERT	60.3	64.3	6.6
TREC	SVM	42.7	43.9	2.8
TREC	LSTM	17.7	25.8	45.0
WVA	BERT	67.2	68.6	2.1
WVA	SVM	60.2	62.9	4.5
WVA	LSTM	26.0	32.0	23.0

On ATIS with five samples per class, LAMBADA substantially improves all classifiers (BERT, SVM, LSTM) over the baseline and outperforms other augmentation methods (statistically significant, p<0.01).
Across three datasets (ATIS, TREC, WVA) and three classifiers, LAMBADA yields higher accuracy than the baselines for all combinations, with notable gains on ATIS especially for BERT and SVM.
Table 4 shows per-classifier gains when comparing Baseline vs. LAMBADA: ATIS (BERT 53.3 → 75.7; improvement 58.5%), ATIS (SVM 35.6 → 56.5; 58.7%), ATIS (LSTM 29.0 → 33.7; 16.2%), TREC (BERT 60.3 → 64.3; 6.6%), TREC (SVM 42.7 → 43.9; 2.8%), TREC (LSTM 17.7 → 25.8; 45.0%), WVA (BERT 67.2 → 68.6; 2.1%), WVA (SVM 60.2 → 62.9; 4.5%), WVA (LSTM 26.0 → 32.0; 23.0%).
LAMBADA outperforms EDA, CVAE, and CBERT across ATIS, TREC, and WVA for all classifiers in Table 5 (McNemar p<0.01).
Compared to a weak-labeling semi-supervised baseline, LAMBADA with GPT-2 labeling yields higher accuracy on ATIS for BERT and SVM, demonstrating the value of synthesized labeled data when unlabeled data is limited

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。