QUICK REVIEW

[論文レビュー] Dataset Pruning: Reducing Training Data by Examining Generalization Influence

Shuo Yang, Zeke Xie|arXiv (Cornell University)|May 19, 2022

Machine Learning and Data Classification被引用数 21

ひとこと要約

事前に定められた境界内で一般化ギャップを維持するように、トレーニングデータの最大部分集合を選択する、最適化ベースのデータセット剪定手法を提案し、学習効率を向上させる。

ABSTRACT

The great success of deep learning heavily relies on increasingly larger training data, which comes at a price of huge computational and infrastructural costs. This poses crucial questions that, do all training data contribute to model's performance? How much does each individual training sample or a sub-training-set affect the model's generalization, and how to construct the smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance? To answer these, we propose dataset pruning, an optimization-based sample selection method that can (1) examine the influence of removing a particular set of training samples on model's generalization ability with theoretical guarantee, and (2) construct the smallest subset of training data that yields strictly constrained generalization gap. The empirically observed generalization gap of dataset pruning is substantially consistent with our theoretical expectations. Furthermore, the proposed method prunes 40% training examples on the CIFAR-10 dataset, halves the convergence time with only 1.3% test accuracy decrease, which is superior to previous score-based sample selection methods.

研究の動機と目的

データセット剪定問題を、保証された一般化境界を満たすようにトレーニングデータを削減することとして動機づけ、定義する。
データ点を削除したときのパラメータ変化を近似するために影響関数を活用する。
パラメータ変化を制約しつつ剪定データを最大化する離散最適化を定式化し解く。
理論的な一般化保証と、データセットとアーキテクチャを横断した実証的検証を提供する。

提案手法

完全モデルと剪定後モデル間のパラメータ変化ノルムに基づいて ε-冗長サブセットを定義する。
影響関数とヘシアンの逆行列を用いてサンプルごとのパラメータ影響度を推定する: I_param(z) = -H_theta^{-1} grad_theta L(z, theta)。
影響度を総和してサブセットの影響を近似する: sum_{z in D_hat} I_param(z) のL2ノルムを ε 以下に制約する。
2つの離散最適化問題を定式化する: (a) ε 制約の下でサブセットサイズを最大化する一般化保証付き剪定; (b) 固定された m に対してパラメータ変化を最小化する基数保証付き剪定。

実験結果

リサーチクエスチョン

RQ1事前に定義された一般化ギャップを超えずに、トレーニングデータからどれだけ大きなサブセットを削除できるか？
RQ2サンプルのグループを削除したとき、影響関数に基づく近似は一般化への影響を信頼できる範囲で予測できるか？
RQ3剪定されたデータセットはアーキテクチャ間の性能を保持し、ニューラルアーキテクチャ探索の効率を支援するか？
RQ4剪定比率と一般化、学習効率の間の経験的トレードオフはどのようになるか？

主な発見

ランダム	Herding	忘却	私たちの手法	全データセット
Performance (%)	79.4	80.1	82.5	85.7	85.9
Correlation	0.21	0.23	0.79	0.94	1.00
Time cost (min)	113	113	113	113	3029
Storage (imgs)	10^3	10^3	10^3	10^3	5×10^4

この剪定手法は CIFAR-10 で 40% の削減を達成し、テスト精度の損失はわずか 1.3% である。
観測された一般化ギャップは ε に基づく理論的境界と一致しており、経験的な整合性が高い。
ベースラインと比較して、最適化ベースの剪定はランダム、Herding、Forgetting、GraNd、EL2N、および単純な影響度スコア法を上回り、特に高い剪定比で優れている。
剪定データは未知のアーキテクチャにも一般化する（例：小さなネットワークでの剪定がResNet18/50へ転送される）。
剪定データセットは学習時間を大幅に削減する（例：CIFAR-10 で収束時間 nearly halved）最小の性能低下で。
NAS様の実験では、剪定された代理データセットがフルデータセットと同程度の性能のアーキテクチャを生み出しつつ、探索時間とストレージを大幅に削減する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。