[论文解读] Not Enough Data? Deep Learning to the Rescue!
LAMBADA 在小型带标签文本数据集上对 GPT-2 语言模型进行微调,以生成带标签的合成数据;用基线分类器筛选后再训练,以提升文本分类准确性。
Based on recent advances in natural language modeling and those in text generation capabilities, we propose a novel data augmentation method for text classification tasks. We use a powerful pre-trained neural network model to artificially synthesize new labeled data for supervised learning. We mainly focus on cases with scarce labeled data. Our method, referred to as language-model-based data augmentation (LAMBADA), involves fine-tuning a state-of-the-art language generator to a specific task through an initial training phase on the existing (usually small) labeled data. Using the fine-tuned model and given a class label, new sentences for the class are generated. Our process then filters these new sentences by using a classifier trained on the original data. In a series of experiments, we show that LAMBADA improves classifiers' performance on a variety of datasets. Moreover, LAMBADA significantly improves upon the state-of-the-art techniques for data augmentation, specifically those applicable to text classification tasks with little data.
研究动机与目标
- Motivate the problem of scarce labeled data in text classification and the need for effective data augmentation.
- Introduce LAMBADA, a language-model-based augmentation pipeline that synthesizes labeled sentences.
- Show that LAMBADA improves classification accuracy and surpasses state-of-the-art augmentation methods on small datasets.
- Demonstrate that LAMBADA can outperform baselines and alternative semi-supervised approaches when unlabeled data is unavailable.
提出的方法
- Fine-tune GPT-2 on the small labeled dataset D_train to create a task-adapted generator G_tuned.
- Synthesize a labeled sentence set D* by prompting G_tuned with class labels and separators to generate sentences per class.
- Filter D* using a baseline classifier h trained on D_train, keeping the top-N_y high-confidence examples per class to form D_synthesized.
- Retrain the target classifier A on D_train ∪ D_synthesized to obtain an improved classifier.
- Compare LAMBADA to other augmentation methods (EDA, CVAE, CBERT) and to baselines, using McNemar tests for statistical significance.
- Note that LAMBADA does not require unlabeled data and can be iterated or adapted to zero-shot class scenarios.
实验结果
研究问题
- RQ1Can LAMBADA improve text classification performance when training data per class is very small?
- RQ2How does LAMBADA compare to existing text augmentation methods across multiple classifiers and datasets?
- RQ3Is LAMBADA effective without leveraging unlabeled data, and how does it perform relative to semi-supervised approaches?
- RQ4Does LAMBADA provide benefits across different classifier families (e.g., BERT, SVM, LSTM) and datasets with varying characteristics?
主要发现
| Dataset | Classifier | Baseline Accuracy | LAMBADA Accuracy | Improvement (%) |
|---|---|---|---|---|
| ATIS | BERT | 53.3 | 75.7 | 58.5 |
| ATIS | SVM | 35.6 | 56.5 | 58.7 |
| ATIS | LSTM | 29.0 | 33.7 | 16.2 |
| TREC | BERT | 60.3 | 64.3 | 6.6 |
| TREC | SVM | 42.7 | 43.9 | 2.8 |
| TREC | LSTM | 17.7 | 25.8 | 45.0 |
| WVA | BERT | 67.2 | 68.6 | 2.1 |
| WVA | SVM | 60.2 | 62.9 | 4.5 |
| WVA | LSTM | 26.0 | 32.0 | 23.0 |
- On ATIS with five samples per class, LAMBADA substantially improves all classifiers (BERT, SVM, LSTM) over the baseline and outperforms other augmentation methods (statistically significant, p<0.01).
- Across three datasets (ATIS, TREC, WVA) and three classifiers, LAMBADA yields higher accuracy than the baselines for all combinations, with notable gains on ATIS especially for BERT and SVM.
- Table 4 shows per-classifier gains when comparing Baseline vs. LAMBADA: ATIS (BERT 53.3 → 75.7; improvement 58.5%), ATIS (SVM 35.6 → 56.5; 58.7%), ATIS (LSTM 29.0 → 33.7; 16.2%), TREC (BERT 60.3 → 64.3; 6.6%), TREC (SVM 42.7 → 43.9; 2.8%), TREC (LSTM 17.7 → 25.8; 45.0%), WVA (BERT 67.2 → 68.6; 2.1%), WVA (SVM 60.2 → 62.9; 4.5%), WVA (LSTM 26.0 → 32.0; 23.0%).
- LAMBADA outperforms EDA, CVAE, and CBERT across ATIS, TREC, and WVA for all classifiers in Table 5 (McNemar p<0.01).
- Compared to a weak-labeling semi-supervised baseline, LAMBADA with GPT-2 labeling yields higher accuracy on ATIS for BERT and SVM, demonstrating the value of synthesized labeled data when unlabeled data is limited
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。