[논문 리뷰] Variational Information Bottleneck for Effective Low-Resource Fine-Tuning
VIBERT는 Variational Information Bottleneck를 적용하여 미세 조정 중에 사전 학습된 문장 표현을 압축하고, 저자원 NLP 설정에서 과적합을 줄이며 도메인 외 일반화를 향상시킨다.
While large-scale pretrained language models have obtained impressive results when fine-tuned on a wide variety of tasks, they still often suffer from overfitting in low-resource scenarios. Since such models are general-purpose feature extractors, many of these features are inevitably irrelevant for a given target task. We propose to use Variational Information Bottleneck (VIB) to suppress irrelevant features when fine-tuning on low-resource target tasks, and show that our method successfully reduces overfitting. Moreover, we show that our VIB model finds sentence representations that are more robust to biases in natural language inference datasets, and thereby obtains better generalization to out-of-domain datasets. Evaluation on seven low-resource datasets in different tasks shows that our method significantly improves transfer learning in low-resource scenarios, surpassing prior work. Moreover, it improves generalization on 13 out of 15 out-of-domain natural language inference benchmarks. Our code is publicly available in https://github.com/rabeehk/vibert.
연구 동기 및 목표
- Motivate and address overfitting in fine-tuning large pretrained language models on low-resource data.
- Introduce Variational Information Bottleneck (VIB) to compress sentence representations before task-specific classification.
- Demonstrate that VIB reduces reliance on superficial biases and improves out-of-domain generalization.
- Show empirical gains across seven low-resource datasets and multiple NLP tasks.
제안 방법
- Integrate a VIB module on top of a pretrained encoder (BERT) to map sentence embeddings to a latent z used by the task classifier.
- Use a variational objective that minimizes KL(pθ(z|x) || r(z)) plus a reconstruction term for y (qφ(y|z)) as in L_VIB = β E_x[KL(pθ(z|x), r(z))] + E_z~pθ(z|x)[-log qφ(y|z)].
- Assume Gaussian priors r(z) and posteriors pθ(z|x) with diagonal covariances to enable analytic KL computations.
- Estimate μ(x) and Σ(x) via a shallow MLP from fφ(x) (the pretrained encoder’s sentence embedding).
- Train end-to-end with reparameterization z = μ(x) + Σ(x) ⊙ ε, ε ~ N(0, I).
- Treat z as the sole input to the task-specific classifier qφ(y|z).
- Experiment with bottleneck size K and regularization weight β to control information compression.
실험 결과
연구 질문
- RQ1Does incorporating Variational Information Bottleneck during fine-tuning reduce overfitting on low-resource NLP tasks?
- RQ2Does VIBETR improve robustness to dataset biases and generalize better to out-of-domain NLI datasets?
- RQ3How does VIBERT compare to standard regularization techniques (Dropout, Mixout, Weight Decay) on low-resource and out-of-domain settings?
- RQ4What is the impact of VIB on training efficiency and model size?
주요 결과
- VIBERT substantially improves accuracy on seven low-resource datasets compared with baselines.
- VIBERT provides notable gains over Dropout, Mixout, and Weight Decay across BERT-Base and BERT-Large in low-resource settings.
- VIBERT reduces reliance on superficial biases, leading to better generalization to out-of-domain NLI datasets.
- Hypothesis-only bias analysis shows VIBERT yields much lower hypothesis-only accuracy, indicating debiased representations.
- VIBERT demonstrates a controllable trade-off between information compression (β) and predictive performance, with improved generalization when β is balanced.
- Ablation without the compression loss (β=0) degrades performance, evidencing the benefit of the VIB objective.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.