QUICK REVIEW

[论文解读] Dataset Distillation via Factorization

Songhua Liu, Kai Wang|arXiv (Cornell University)|Oct 30, 2022

Image Processing Techniques and Applications被引用 59

一句话总结

HaBa 引入了用于数据集蒸馏的幻象化基分解，能够以更少的参数实现表达力强的合成数据，并提升下游性能，包括跨架构的收益。它加入对抗对比约束以提升多样性和信息量，并且可以与现有 DD 基线实现即插即用。

ABSTRACT

In this paper, we study \xw{dataset distillation (DD)}, from a novel perspective and introduce a \emph{dataset factorization} approach, termed \emph{HaBa}, which is a plug-and-play strategy portable to any existing DD baseline. Unlike conventional DD approaches that aim to produce distilled and representative samples, \emph{HaBa} explores decomposing a dataset into two components: data \emph{Ha}llucination networks and \emph{Ba}ses, where the latter is fed into the former to reconstruct image samples. The flexible combinations between bases and hallucination networks, therefore, equip the distilled data with exponential informativeness gain, which largely increase the representation capability of distilled datasets. To furthermore increase the data efficiency of compression results, we further introduce a pair of adversarial contrastive constraints on the resultant hallucination networks and bases, which increase the diversity of generated images and inject more discriminant information into the factorization. Extensive comparisons and experiments demonstrate that our method can yield significant improvement on downstream classification tasks compared with previous state of the arts, while reducing the total number of compressed parameters by up to 65\%. Moreover, distilled datasets by our approach also achieve extasciitilde10\% higher accuracy than baseline methods in cross-architecture generalization. Our code is available \href{https://github.com/Huage001/DatasetFactorization}{here}.

研究动机与目标

促使并解决数据集蒸馏（DD）中的数据/存储效率问题。
提出将合成数据分解为基与幻象器，以增加信息量。
引入对抗对比约束以使生成数据更加多样化。
展示与现有 DD 基线的即插即用兼容性并显示性能提升。

提出的方法

将合成数据分解为一组基 B 和幻象器 H，使 S = {H_theta_j} ∪ {(x_hat_i, y_hat_i)}。
每个幻象器以一个基作为输入 Through带仿射缩放和平移的编码器-变换器-解码器管线输出幻象图像。
引入对抗对比损失 L_cos 和一个（可选监督）对比损失 L_con，以最大化基内样本的多样性并减少冗余。
结合任务损失 L_task 和 DD 目标 L_DD；在一个交替、端到端可微分的流程中进行训练；HaBa 作为现有 DD 目标的插件兼容。
可选地与并行的高效数据参数化（IDC）结合，并评估跨架构泛化能力。

实验结果

研究问题

RQ1在相同存储预算下，HaBa 是否能比最先进的 DD 基线提升下游模型性能？
RQ2将数据分解为基与幻象器是否在不增加存储的情况下提高数据的多样性和信息量？
RQ3HaBa 如何影响跨架构泛化（在一个架构上训练，在其他架构上评估）？
RQ4对抗对比约束对性能和多样性有何影响？

主要发现

HaBa 在 SVHN、CIFAR10 和 CIFAR100 基准上显著优于先前的 DD 方法。
在相同存储预算下，HaBa 将压缩参数总量最多减少 65%。
在跨架构泛化情景中，HaBa 的准确率比基线方法高约 10%。
基存储核心结构，而幻象器呈现多样风格，在不增加额外存储的情况下提高数据多样性。
在多种 DD 基线（DC、DM、MTT）之上建立时，HaBa 显示出持续的提升，并支持在多种网络（ConvNet、ResNet、VGG、AlexNet）上实现跨架构增益。
定性可视化显示，不同的幻象器能够从共享的基生成多样化的图像，提升数据集的信息性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。