[Paper Review] Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution
The paper shows that fine-tuning pretrained features can hurt OOD accuracy compared to linear probing, and proposes LP-FT (linear probing followed by fine-tuning) as a simple method that improves both ID and OOD performance.
When transferring a pretrained model to a downstream task, two popular methods are full fine-tuning (updating all the model parameters) and linear probing (updating only the last linear layer -- the "head"). It is well known that fine-tuning leads to better accuracy in-distribution (ID). However, in this paper, we find that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (OOD) when the pretrained features are good and the distribution shift is large. On 10 distribution shift datasets (Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR $ o$ STL, CIFAR10.1, FMoW, ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains on average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We show theoretically that this tradeoff between ID and OOD accuracy arises even in a simple setting: fine-tuning overparameterized two-layer linear networks. We prove that the OOD error of fine-tuning is high when we initialize with a fixed or random head -- this is because while fine-tuning learns the head, the lower layers of the neural network change simultaneously and distort the pretrained features. Our analysis suggests that the easy two-step strategy of linear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuning heuristic, combines the benefits of both fine-tuning and linear probing. Empirically, LP-FT outperforms both fine-tuning and linear probing on the above datasets (1% better ID, 10% better OOD than full fine-tuning).
Motivation & Objective
- Investigate how fine-tuning versus linear probing affects in-distribution (ID) and out-of-distribution (OOD) generalization.
- Characterize conditions under which fine-tuning hurts OOD performance despite strong ID gains.
- Propose and evaluate the LP-FT strategy to combine benefits of both approaches.
- Provide theoretical insight into feature distortion during fine-tuning and its impact on OOD error.
Proposed method
- Theoretically analyze fine-tuning versus linear probing in an overparameterized two-layer linear network with pretrained features.
- Define and measure feature extractor distance d(B,B′) and largest principal angle to study alignment between pretrained features and data subspaces.
- Derive a lower bound on OOD error for fine-tuning showing distortion of pretrained features outside the training span.
- Compare ID and OOD performance of LP, FT, and LP-FT both theoretically and empirically on 10 distribution-shift benchmarks.
- Empirically validate that LP-FT outperforms FT and LP on ID and OOD across datasets, and FT distorts features as predicted.
Experimental results
Research questions
- RQ1Under what conditions does fine-tuning underperform linear probing in OOD generalization?
- RQ2How does the alignment between pretrained features and the training data subspace affect ID and OOD performance during fine-tuning?
- RQ3Can a two-step LP-FT strategy mitigate the ID-OOD trade-off observed with standard fine-tuning?
- RQ4Do empirical results on diverse distribution shifts corroborate the theoretical distortion of pretrained features during fine-tuning?
Key findings
- Fine-tuning yields higher ID accuracy on average but lower OOD accuracy across 10 distribution shifts (2% higher ID, 7% lower OOD than linear probing).
- Fine-tuning distorts pretrained features, updating ID directions more than OOD directions, leading to worse OOD performance when distribution shift is large.
- Initializing with a good head from linear probing and then fine-tuning (LP-FT) yields better ID and OOD accuracy than both fine-tuning and linear probing alone (about 1% better ID and 10% better OOD than FT).
- Theoretical results show that with good pretrained features, linear probing extrapolates better OOD because it preserves features, while fine-tuning adapts to ID but distorts features for OOD points.
- Empirical results on ten distribution shifts (e.g., DomainNet, CIFAR→STL, ImageNet variants) align with the theory and favor LP-FT as a robust strategy.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.