QUICK REVIEW

[論文レビュー] Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

Ananya Kumar, Aditi Raghunathan|arXiv (Cornell University)|Feb 21, 2022

Advanced Neural Network Applications被引用数 158

ひとこと要約

本論文は、事前学習済みの特徴を微調整することが線形予測に比べてOOD精度を低下させる可能性があることを示し、LP-FT（線形予測に続く微調整）を、IDとOODの両方の性能を改善する簡便な方法として提案します。

ABSTRACT

When transferring a pretrained model to a downstream task, two popular methods are full fine-tuning (updating all the model parameters) and linear probing (updating only the last linear layer -- the "head"). It is well known that fine-tuning leads to better accuracy in-distribution (ID). However, in this paper, we find that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (OOD) when the pretrained features are good and the distribution shift is large. On 10 distribution shift datasets (Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR $ o$ STL, CIFAR10.1, FMoW, ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains on average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We show theoretically that this tradeoff between ID and OOD accuracy arises even in a simple setting: fine-tuning overparameterized two-layer linear networks. We prove that the OOD error of fine-tuning is high when we initialize with a fixed or random head -- this is because while fine-tuning learns the head, the lower layers of the neural network change simultaneously and distort the pretrained features. Our analysis suggests that the easy two-step strategy of linear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuning heuristic, combines the benefits of both fine-tuning and linear probing. Empirically, LP-FT outperforms both fine-tuning and linear probing on the above datasets (1% better ID, 10% better OOD than full fine-tuning).

研究の動機と目的

微調整と線形予測がIDおよびOOD一般化に与える影響を調査する。
微調整が強いID利得にもかかわらずOOD性能を損なう条件を特徴づける。
LP-FT戦略を提案・評価し、両手法の利点を組み合わせる。
微調整時の特徴歪みとOOD誤差への影響について理論的洞察を提供する。

提案手法

事前学習済み特徴を用いた過 parametrized な二層線形ネットワークにおける微調整と線形予測を理論的に分析する。
特徴抽出器距離 d(B,B′) と最大主位角を定義・測定し、事前学習特徴とデータサブスペースの整列を研究する。
訓練範囲外での特徴の歪みを示す微調整のOOD誤差の下界を導出する。
LP，FT，LP-FT のIDとOODの性能を、理論的にも経験的にも10個の分布シフトベンチマークで比較する。
データセット全体でLP-FTがFTおよびLPを上回ることを経験的に検証し、FTが予測通り特徴を歪めることを示す。

実験結果

リサーチクエスチョン

RQ1微調整はOOD一般化において線形予測より劣る条件は何か。
RQ2事前学習特徴と訓練データサブスペースの整列が、微調整中のIDおよびOOD性能にどのように影響するか。
RQ3標準的な微調整で観察されるID-OODトレードオフを緩和する二段階LP-FT戦略は実現可能か。
RQ4多様な分布シフトでの経験的結果は、微調整時の事前学習特徴の歪みを理論と一致して裏付けるか。

主な発見

微調整は平均的にID精度を高める一方で、10の分布シフト全体でOOD精度を低下させる（IDが約2%高く、OODが約7%低い、線形予測と比較）。
微調整は事前学習済み特徴を歪め、ID方向をOOD方向よりも多く更新するため、分布シフトが大きい場合にOOD性能が悪化する。
線形予測からよいヘッドを初期化してから微調整する（LP-FT）と、FTとLPのいずれよりもIDとOODの精度が良くなる（およそIDで約1%、OODで約10% FTより改善）。
良好な事前学習特徴がある場合、線形予測は特徴を保持してOODをより外挿するために有利であり、微調整はIDには適応するがOODポイントでの特徴を歪める。
10の分布シフト（例：DomainNet、CIFAR→STL、ImageNet変種）での経験的結果は理論と整合し、LP-FTを頑健な戦略として支持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。