QUICK REVIEW

[論文レビュー] Facing the Elephant in the Room: Visual Prompt Tuning or Full Finetuning?

Cheng Han, Qifan Wang|arXiv (Cornell University)|Jan 23, 2024

Domain Adaptation and Few-Shot Learning被引用数 6

ひとこと要約

論文は 19 個の VTAB-1k タスクに渡って Visual Prompt Tuning (VPT) と Full Finetuning (FT) を比較し、いつ VPT が好ましいか、なぜそうなるのかを特定するとともに、データ分布とタスクの乖離に関する洞察を提供します。

ABSTRACT

As the scale of vision models continues to grow, the emergence of Visual Prompt Tuning (VPT) as a parameter-efficient transfer learning technique has gained attention due to its superior performance compared to traditional full-finetuning. However, the conditions favoring VPT (the ``when") and the underlying rationale (the ``why") remain unclear. In this paper, we conduct a comprehensive analysis across 19 distinct datasets and tasks. To understand the ``when" aspect, we identify the scenarios where VPT proves favorable by two dimensions: task objectives and data distributions. We find that VPT is preferrable when there is 1) a substantial disparity between the original and the downstream task objectives (e.g., transitioning from classification to counting), or 2) a similarity in data distributions between the two tasks (e.g., both involve natural images). In exploring the ``why" dimension, our results indicate VPT's success cannot be attributed solely to overfitting and optimization considerations. The unique way VPT preserves original features and adds parameters appears to be a pivotal factor. Our study provides insights into VPT's mechanisms, and offers guidance for its optimal utilization.

研究の動機と目的

Visual prompt tuning (VPT) がさまざまな下流タスクで full finetuning (FT) を上回る状況を評価する。
タスクの目的差とデータ分布の類似性が転移学習性能に与える影響を特徴づける。
過学習とパラメータ数を超えた VPT の成功の根本的な理由を調査する。
事前学習-ファインチューニング・パイプラインにおける prompts の適用に関する実践的な指針を提供する。

提案手法

ImageNet-21k で事前学習された ViT-B/16 に対して 19 件の VTAB-1k タスク（Natural、Specialized、Structured）で FT と VPT を比較する。
Fréchet Inception Distance (FID) を用いて事前学習データ分布と下流データ分布の乖離を測定する。
Mixed および FT-then-PT を含むアブレーションを実施し、最適化と特徴保存の役割を分析する。
Downstream データサイズを 400 から 20,000 まで変化させ、データセット規模の変化に伴う性能を検証する。
GradCAM および他の視覚的説明を用いて、 prompts が特徴学習に与える影響を解釈する。

実験結果

リサーチクエスチョン

RQ1どの転移学習シナリオで VPT が FT を上回るのか（タスク目的とデータ分布の観点）？
RQ2データ分布の乖離とタスク目的の乖離は VPT の成功をどの程度説明するか？
RQ3VPT の優位性は過学習、追加パラメータ、または事前学習特徴の保持のどれに起因するのか？
RQ4下流データサイズは FT と VPT の性能差にどう影響するか？
RQ5視覚的プロンプトは FT と比較して異なるまたはより強力な特徴の学習をモデルに促すのか？

主な発見

VPT は 19 タスク中 16 タスクで FT を上回る。特に下流タスクの目的乖離が大きい場合、またはデータ分布が類似している場合に顕著。
下流データサイズが増加すると FT の相対的な利益が高まり、ギャップが縮小、場合によっては高資源設定で VPT を上回る。
過学習は FT の劣後の主因ではなく、乖離の大きいタスクでは過学習がより顕著になる一方、類似分布のタスクでは両手法とも過学習が小さい。
プロンプトによる追加パラメータは局所極小からの脱出に寄与するが、それが VPT の優位性の主因ではなく、元の特徴を保持することが重要である。
視覚化（GradCAM）はプロンプトが意味のある領域に焦点を当てさせ、FT を超える特徴学習を促す可能性を示唆する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。