QUICK REVIEW

[论文解读] Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

Ananya Kumar, Aditi Raghunathan|arXiv (Cornell University)|Feb 21, 2022

Advanced Neural Network Applications被引用 158

一句话总结

该论文表明，相比于线性探测，微调预训练特征可能会降低 OOD 准确性，并提出 LP-FT（先线性探测再微调）作为一种简单方法，能同时提升 ID 和 OOD 的性能。

ABSTRACT

When transferring a pretrained model to a downstream task, two popular methods are full fine-tuning (updating all the model parameters) and linear probing (updating only the last linear layer -- the "head"). It is well known that fine-tuning leads to better accuracy in-distribution (ID). However, in this paper, we find that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (OOD) when the pretrained features are good and the distribution shift is large. On 10 distribution shift datasets (Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR $ o$ STL, CIFAR10.1, FMoW, ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains on average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We show theoretically that this tradeoff between ID and OOD accuracy arises even in a simple setting: fine-tuning overparameterized two-layer linear networks. We prove that the OOD error of fine-tuning is high when we initialize with a fixed or random head -- this is because while fine-tuning learns the head, the lower layers of the neural network change simultaneously and distort the pretrained features. Our analysis suggests that the easy two-step strategy of linear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuning heuristic, combines the benefits of both fine-tuning and linear probing. Empirically, LP-FT outperforms both fine-tuning and linear probing on the above datasets (1% better ID, 10% better OOD than full fine-tuning).

研究动机与目标

研究微调与线性探测在同分布（ID）和分布外（OOD）泛化方面的影响。
刻画在虽获得强烈的 ID 增益但对 OOD 性能造成损害的条件。
提出并评估 LP-FT 策略以结合两种方法的优点。
提供关于微调过程中特征扭曲及其对 OOD 误差影响的理论洞见。

提出的方法

在具有预训练特征的过参数化两层线性网络中，从理论上分析微调与线性探测。
定义并测量特征提取器距离 d(B,B′) 和最大主角角来研究预训练特征与数据子空间之间的对齐。
推导出微调在训练 span 之外扭曲预训练特征导致的 OOD 误差下界。
在 10 个分布转移基准上，理论上和经验性地比较 LP、FT 和 LP-FT 的 ID 与 OOD 性能。
通过实证验证 LP-FT 在不同数据集上在 ID 和 OOD 上均优于 FT 和 LP，且 FT 确如预测那样扭曲特征。

实验结果

研究问题

RQ1在什么条件下，微调在 OOD 泛化方面不及线性探测？
RQ2预训练特征与训练数据子空间之间的对齐在微调过程中如何影响 ID 和 OOD 性能？
RQ3两步式 LP-FT 策略是否能缓解标准微调所观察到的 ID-OOD 权衡？
RQ4在多样的分布转移上的经验结果是否支持微调期间对预训练特征的理论扭曲？

主要发现

微调在平均上获得更高的 ID 准确率，但在跨 10 个分布转移的 OOD 准确率较低（比线性探测高 2% 的 ID，低 7% 的 OOD）。
微调扭曲预训练特征，使 ID 方向的更新多于 OOD 方向，导致在分布转移较大时 OOD 性能下降。
用线性探测得到的良好头部初始化后再进行微调（LP-FT）在 ID 和 OOD 上的准确率均优于仅用微调或线性探测的组合（比 FT 高约 1% ID，10% OOD）。
理论结果表明，拥有良好预训练特征时，线性探测由于保留特征而在 OOD 外推更好；而微调更适应 ID，但会扭曲对 OOD 点的特征。
在十个分布转移上的实证结果（例如 DomainNet、CIFAR→STL、ImageNet 变体）与理论一致，并支持 LP-FT 作为稳健策略。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。