[論文レビュー] Dataset Distillation using Neural Feature Regression
tldr: FRePo はモデルプールを用いたニューラル特徴回帰によって小さな合成データセットを学習し、従来法よりはるかに低いメモリと時間コストで最先端の成果を達成します。高解像度データへスケールし、アーキテクチャ間での転移も可能です。
Dataset distillation aims to learn a small synthetic dataset that preserves most of the information from the original dataset. Dataset distillation can be formulated as a bi-level meta-learning problem where the outer loop optimizes the meta-dataset and the inner loop trains a model on the distilled data. Meta-gradient computation is one of the key challenges in this formulation, as differentiating through the inner loop learning procedure introduces significant computation and memory costs. In this paper, we address these challenges using neural Feature Regression with Pooling (FRePo), achieving the state-of-the-art performance with an order of magnitude less memory requirement and two orders of magnitude faster training than previous methods. The proposed algorithm is analogous to truncated backpropagation through time with a pool of models to alleviate various types of overfitting in dataset distillation. FRePo significantly outperforms the previous methods on CIFAR100, Tiny ImageNet, and ImageNet-1K. Furthermore, we show that high-quality distilled data can greatly improve various downstream applications, such as continual learning and membership inference defense. Please check out our webpage at https://sites.google.com/view/frepo.
研究の動機と目的
- Motivate and formalize dataset distillation as a bi-level meta-learning problem.
- Develop an efficient meta-gradient computation that avoids unrolled inner optimization.
- Introduce a model pool to mitigate various overfitting modes in distillation.
- Demonstrate that distilled data transfer across architectures and support downstream tasks like continual learning and privacy defense.
提案手法
- Formulate S* as the outer optimization minimizing the expected meta-training loss over model initializations.
- Fix a feature extractor and train a linear classifier on distilled data, enabling kernel ridge regression (KRR) with a conjugate kernel to compute meta-gradients efficiently.
- Compute meta-gradients by backpropagating through the kernel matrices K_theta_{X_t X_s} and K_theta_{X_s X_s} instead of unrolling inner optimization.
- Maintain a diverse model pool to increase task diversity and reduce overfitting, reinitializing models after K updates and sampling from the pool at each step.
- Iteratively update distilled data S via gradient steps using the meta-gradient from Eq. (2) and update the online model on S for one step per iteration.
実験結果
リサーチクエスチョン
- RQ1Can neural feature regression with pooling (FRePo) outperform prior dataset distillation methods in accuracy and efficiency?
- RQ2Does decoupling meta-gradient computation via kernel ridge regression enable scalable distillation on high-resolution and complex-label datasets?
- RQ3How does maintaining a diverse model pool affect overfitting and generalization of distilled data across architectures?
- RQ4Do distilled data transfer well to unseen architectures and support downstream tasks such as continual learning and privacy defense?
主な発見
| 画像/分類 | DSA [7] | DM [8] | KIP [23] | MTT [20] | FRePo |
|---|---|---|---|---|---|
| MNIST | 89.0–93.0 | 89.9 | 90.1 | 91.4 | 93.0 (92.6) |
| 10 | 97.6–97.9 | 97.6 | 97.5 | 97.3 | 98.6 (98.6) |
| 50 | 98.6 | 98.6 | 98.3 | 98.5 | 99.2 (99.2) |
| F-MNIST | 1 | 71.5–75.6 (77.1) | 73.5 | 75.1 | 75.6 (77.1) |
| 10 | 83.6–87.2 (86.2–86.8) | 86.8 | 86.8 | 87.2 | 86.2 (86.8) |
| 50 | 88.2–89.6 (89.9) | 88.0 | 88.3 | 88.3 | 89.6 (89.9) |
| CIFAR10 | 31.0–36.7 (46.8) | 49.9 | 62.7 | 65.3 | 65.5 (68.0) |
| 1 | 36.7–46.8 (47.9) | 31.0 | 62.7 | 65.5 | 46.8 (47.9) |
| 10 | 53.2–65.5 (68.0) | 49.2 | 64.4 | 65.3 | 65.5 (68.0) |
| 50 | 66.8–71.7 (74.4) | 63.7 | 68.3–68.6 | 71.6 | 71.7 (74.4) |
| CIFAR100 | 1 | 12.2–16.8 (32.3) | 12.2 | 15.7 | 28.7 (32.3) |
| 10 | 29.7–40.1 (44.9) | 29.7 | 28.3 | 40.1 | 42.5 (44.9) |
| 50 | 43.6–44.3 (43.0) | 43.6 | - | 47.7 (43.0) | 44.3 (43.0) |
| T-ImageNet | 1 | 3.9 | - | 8.8 | 15.4 (19.1) |
| 10 | 12.9 | - | - | 23.2 | 25.4 (26.5) |
| CUB-200 | 1 | 1.6 | - | 2.2 | 12.4 (13.7) |
| 10 | 4.4 | - | - | 16.8 | 16.1 (16.7) |
- FRePo achieves state-of-the-art results on standard benchmarks with substantially lower training time and memory (about 100x faster and 10x less GPU memory than prior methods).
- On CIFAR100, Tiny ImageNet, and CUB-200 with one image per class, FRePo improves test accuracies significantly (e.g., CIFAR100 28.7% vs 24.3% prior; Tiny ImageNet 15.4% vs 8.8%; CUB-200 12.4% vs 2.2%).
- FRePo scales to high-resolution data, achieving 7.5% Top-1 on ImageNet-1K with one image per class after resizing to 64x64, outperforming prior methods.
- FRePo’s distilled data transfer well across architectures and does not rely on a single architecture’s inductive biases, unlike several baselines.
- Distilled data improve continual learning performance and privacy defenses, showing higher final accuracy and lower attack success (MIA AUC around random guessing).
- The KRR-based meta-gradient can be computed efficiently via backpropagation through kernel matrices, removing the need to backpropagate through full unrolled inner optimization.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。