[논문 리뷰] Revisiting Few-sample BERT Fine-tuning
이 논문은 소수 샘플에서의 BERT 파인튜닝 불안정성을 분석하고, 편향된 Adam으로 인한 편향된 그래디언트 추정, 해로운 고층 초기화, 고정된 반복 횟수를 식별하며, 편향 제거 Adam과 층 재초기화와 같은 해결책을 제시하고, 이전의 안정성 방법들을 재평가한다.
This paper is a study of fine-tuning of BERT contextual representations, with focus on commonly observed instabilities in few-sample scenarios. We identify several factors that cause this instability: the common use of a non-standard optimization method with biased gradient estimation; the limited applicability of significant parts of the BERT network for down-stream tasks; and the prevalent practice of using a pre-determined, and small number of training iterations. We empirically test the impact of these factors, and identify alternative practices that resolve the commonly observed instability of the process. In light of these observations, we re-visit recently proposed methods to improve few-sample fine-tuning with BERT and re-evaluate their effectiveness. Generally, we observe the impact of these methods diminishes significantly with our modified process.
연구 동기 및 목표
- Understand causes of instability in few-sample BERT fine-tuning on small datasets.
- Evaluate how optimization choices, initialization, and iteration counts affect stability and performance.
- Propose practical remedies to reduce degenerate runs and improve convergence.
- Re-evaluate existing stabilization methods under corrected optimization settings.
제안 방법
- Compare standard BERT fine-tuning with debiased Adam versus biased BERT Adam.
- Investigate re-initialization of top pre-trained layers (pooler and Transformer blocks) during fine-tuning.
- Evaluate impact of training longer than the commonly used three epochs.
- Assess how debiasing and re-init interact with existing stabilization methods (e.g., Mixout, weight decay, intermediate-task transfer).
- Analyze layer-wise parameter changes (L2 distance from initialization) during fine-tuning.
실험 결과
연구 질문
- RQ1What causes instability in few-sample BERT fine-tuning on small datasets?
- RQ2Can debiasing in Adam reduce degenerate runs and variance across seeds?
- RQ3Does re-initializing top BERT layers improve convergence and performance?
- RQ4How does training longer than three epochs affect stability and results?
- RQ5Do existing stabilization methods retain usefulness under unbiased optimization?
주요 결과
- Debiased Adam significantly reduces variability and degenerate runs across small datasets.
- Re-initializing top pre-trained layers speeds up convergence and often improves mean performance.
- Longer training beyond three epochs improves stability and performance on several datasets.
- Top-layer overspecialization to pre-training can hinder fine-tuning, and re-init mitigates this by providing a better initialization.
- With debiased optimization, the relative gains of several previously proposed stabilization methods diminish.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.