QUICK REVIEW

[論文レビュー] Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Mitchell Wortsman, Gabriel Ilharco|arXiv (Cornell University)|Mar 10, 2022

Domain Adaptation and Few-Shot Learning被引用数 205

ひとこと要約

複数の独立した微調整モデル（モデルスープ）からウェイトを平均化することで、多くの場合、最高の単一モデルを上回り、追加の推論コストなしでほぼエンサンブル性能を達成する。Greedy soup が特に効果的であることが強調される。

ABSTRACT

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs -- we call the results "model soups." When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. The resulting ViT-G model, which attains 90.94% top-1 accuracy on ImageNet, achieved a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically. Code is available at https://github.com/mlfoundations/model-soups.

研究の動機と目的

ハイパーパラメータの多様性を活用したウェイト平均化によって、単一の最良モデルを選ぶ代替案を動機づける。
モデルスープがイン・ディストリビューションデータおよび分布シフトデータで精度と頑健性を向上させることを示す。
Greedy soups が uniform averaging を上回り、追加推論コストなしでエンサンブル性能に近づくことを示す。
視覚タスク（CLIP/ALIGN/VIT-G）とNLPタスク、GLUEベンチマークを含む適用可能性を探る。
ウェイト平均化がロジットアンサンブルと損失景観の平坦性といつ整合するかについて分析的洞察を提供する。

提案手法

大規模事前学習モデル（例：CLIP、ALIGN、ViT-G）を広範なハイパーパラメータスイープでファインチューニングする。
選択したファインチューニングモデルのウェイトを平均化してモデルスープを作成する（uniform soup）または検証精度が改善される場合にのみモデルを追加する貪欲法（greedy procedure）を用いる。
スープをイン・ディストリビューションと分布シフトの下でエンサンブルおよび最高の個別モデルと比較する。
2モデルのウェイト平均化を分析して、スープの性能とロジットアンサンブルおよび損失の平坦性の関係を示す。
画像分類、分布シフトデータセット、および初期NLP GLUEタスクを横断的に評価する。
参照されたGitHubリポジトリでモデルスープのためのオープンソースコードを提供する。

実験結果

リサーチクエスチョン

RQ1独立してファインチューニングされたモデルのウェイトを平均化すると、最高の個別モデルを選ぶより精度が高くなるか。
RQ2uniformおよびgreedyスープは、視覚およびNLPタスクで精度と頑健性の面でどう比較されるか。
RQ3モデルスープの性能はエンサンブル性能および損失景観の平坦性とどのように関連するか。
RQ4モデルスープは画像分類を超えた大規模ビジョントランスフォーマーや言語モデルにも適用可能か。
RQ5データセットとタスク全体におけるモデルスープのキャリブレーションと適用性の制限は何か。

主な発見

Greedy soups は追加のトレーニングや推論コストなしで ImageNet および分布シフトデータセットで最良の個別モデルを上回ることが多い。
ImageNet 上で CLIP および ALIGN のファインチューニングに対して、greedy soup は最良の単一モデルをそれぞれ0.7–0.5ポイント改善する。
ImageNetで ViT-G/14 のモデルスープは 90.94% top-1 に達し、以前の手法より少ない FLOP で最先端に近づく。
モデルスープは画像分類、分布シフト、そしていくつかのNLP GLUEタスクで性能を向上させるが、キャリブレーションの向上には限界がある。
分析的近似はスープの性能が損失景観の平坦性と予測信頼度に関連することを示し、実証的検証を行った。
Greedy soups は資源が制約される場合にエンsemblingの実用的な代替案を提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。