Skip to main content
QUICK REVIEW

[论文解读] Surgical Fine-Tuning Improves Adaptation to Distribution Shifts

Yoonho Lee, Annie S. Chen|arXiv (Cornell University)|Oct 20, 2022
Domain Adaptation and Few-Shot Learning被引用 47
一句话总结

本文提出手术式微调,在一个小的目标数据集上仅微调神经网络的一个连续子集的少量层,结果表明在多种分布转变情况下着比完整微调更具优势。最佳的层子集依赖于转变类型,理论结果支持输入转变时对第一层进行微调,输出转变时对最后一层进行微调。

ABSTRACT

A common approach to transfer learning under distribution shift is to fine-tune the last few layers of a pre-trained model, preserving learned features while also adapting to the new task. This paper shows that in such settings, selectively fine-tuning a subset of layers (which we term surgical fine-tuning) matches or outperforms commonly used fine-tuning approaches. Moreover, the type of distribution shift influences which subset is more effective to tune: for example, for image corruptions, fine-tuning only the first few layers works best. We validate our findings systematically across seven real-world data tasks spanning three types of distribution shifts. Theoretically, we prove that for two-layer neural networks in an idealized setting, first-layer tuning can outperform fine-tuning all layers. Intuitively, fine-tuning more parameters on a small target dataset can cause information learned during pre-training to be forgotten, and the relevant information depends on the type of shift.

研究动机与目标

  • Motivate and analyze fine-tuning under distribution shift, highlighting limitations of standard approaches that fine-tune all or last layers.
  • Propose surgical fine-tuning: freezing most layers and tuning a small contiguous subset to improve adaptation with limited target data.
  • Systematically evaluate across seven real-world tasks spanning three shift types to identify which layer subsets are most effective.
  • Provide theoretical insights explaining why tuning different layers benefits different shift types, including two-layer network analyses.
  • Explore automatic criteria for selecting which layers to fine-tune and validate their effectiveness.

提出的方法

  • Define surgical fine-tuning as optimizing parameters only for a chosen subset S of layers while freezing others.
  • Experiment with various S choices, including first block, middle block, last block, or single blocks, across nine real-world datasets.
  • Compare surgical fine-tuning against full fine-tuning and other baselines on target-domain accuracy after fine-tuning with limited target data.
  • Theoretically analyze two-layer networks to show when first-layer or last-layer tuning can better handle input vs. output perturbations.
  • Introduce automatic layer-selection criteria (Auto-RGN, Auto-SNR) based on gradient statistics to choose which layers to tune.
  • Assess unsupervised adaptation settings (test-time) showing early-layer tuning benefits under online updates.
  • Use standard training procedures (pre-train on source, fine-tune on target) with early stopping based on target data.

实验结果

研究问题

  • RQ1Does surgical fine-tuning (tuning a small subset of layers) outperform full fine-tuning across diverse distribution shifts?
  • RQ2Which layer subset (first block, middle block, last block) is most effective for different shift types (input-level, feature-level, output-level)?
  • RQ3Can automatic layer-selection criteria reliably identify the layers to tune to match or exceed full fine-tuning performance?
  • RQ4What theoretical explanations account for when tuning early vs. late layers is advantageous under specific distribution shifts?
  • RQ5Do unsupervised/test-time adaptation scenarios also benefit from surgical fine-tuning of early layers?

主要发现

参数Camelyon17FMoW
无微调86.235.5
全部92.3 (1.7)38.9 (0.5)
嵌入95.6 (0.4)36.0 (0.1)
前3层92.5 (0.5)39.8 (1.0)
后3层87.5 (4.1)44.9 (2.6)
最后一层90.1 (1.5)36.9 (5.5)
  • Surgical fine-tuning with one block of layers consistently outperforms full fine-tuning across all tested domains.
  • Best-tuned block varies by shift type: earlier layers excel for input-level shifts, middle blocks for feature-level shifts, and later layers for output-level shifts.
  • On CIFAR-10/ CIFAR-10-C, first-block fine-tuning can match or exceed full fine-tuning with varying target data amounts.
  • Across seven real-world datasets, dynamic layer choice according to shift type yields superior performance compared to tuning all parameters.
  • Automatic selection using Relative Gradient Norm (Auto-RGN) often matches or beats full fine-tuning and remains competitive with cross-validated block selection.
  • Theoretical results show conditions where tuning only the first layer can achieve zero target loss while full fine-tuning fails, and cases where last-layer tuning handles label perturbations better.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。