[论文解读] TADS: Task-Aware Data Selection for Multi-Task Multimodal Pre-Training
TADS 引入一个可学习、任务感知的数据选择框架,联合优化内在质量、任务相关性和多样性,以为多任务预训练选择高效用的多模态数据,在使用更少数据的情况下提升零-shot 性能。
Large-scale multimodal pre-trained models like CLIP rely heavily on high-quality training data, yet raw web-crawled datasets are often noisy, misaligned, and redundant, leading to inefficient training and suboptimal generalization. Existing data selection methods are either heuristic-based, suffering from bias and limited diversity, or data-driven but task-agnostic, failing to optimize for multi-task scenarios. To address these gaps, we introduce TADS (Task-Aware Data Selection), a novel framework for multi-task multimodal pre-training that integrates Intrinsic Quality, Task Relevance, and Distributional Diversity into a learnable value function. TADS employs a comprehensive quality assessment system with unimodal and cross-modal operators, quantifies task relevance via interpretable similarity vectors, and optimizes diversity through cluster-based weighting. A feedback-driven meta-learning mechanism adaptively refines the selection strategy based on proxy model performance across multiple downstream tasks. Experiments on CC12M demonstrate that TADS achieves superior zero-shot performance on benchmarks like ImageNet, CIFAR-100, MS-COCO, and Flickr30K, using only 36% of the data while outperforming baselines by an average of 1.0%. This highlights that TADS significantly enhances data efficiency by curating a high-utility subset that yields a much higher performance ceiling within the same computational constraints.
研究动机与目标
- 为超大规模多模态预训练的数据选择提供动机,超越简单的数据规模扩展。
- 提出一个统一的、可学习的框架,整合内在质量、任务相关性和多样性。
- 开发一个反馈驱动的元学习循环,以在多个下游任务中优化子集选择。
- 提供一个全面的去重与质量评估流程,以生成可靠的质量信号。
提出的方法
- 多层数据去重以降低冗余,同时保留信息样本。
- 三维数据价值表征:内在质量、任务相关性与分布多样性。
- 一个数据价值网络(DVN),将质量、相关性和多样性信号聚合为一个选择分数。
- 一个双层、反馈驱动的优化,使用代理模型模拟下游性能并引导基于梯度的策略更新。
- 基于聚类的梯度估计,以处理不可微的子集选择并与多任务目标对齐。
实验结果
研究问题
- RQ1如何在一个统一框架中量化多下游任务的样本效用?
- RQ2在固定预训练预算下,带有质量与多样性的任务感知选择是否优于任务无关与单任务方法?
- RQ3一个反馈驱动的元学习循环能否有效地将选择策略自适应到多任务目标?
- RQ4去重与多样性机制对视觉-语言基准的零-shot 性能有何影响?
主要发现
| Method | Type | Data Size | ImageNet-1K Top-1 | CIFAR-100 Top-1 | MS-COCO TR@1 | Flickr30K TR@1 | AVG. | Top-1 | Top-5 | Top-1 | Top-5 | IR@1 | TR@1 | IR@1 | TR@1 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| No Filtering (Baseline) | - | ~10.97M | 28.2 | 53.1 | 25.4 | 58.1 | 27.3 | 18.3 | 46.5 | 35.2 | 36.5 | 18.3 | 46.5 | 35.2 | 36.5 |
| Text Complexity | Task-Agnostic | ~8.56M | 28.9 | 54.3 | 26.0 | 58.8 | 27.4 | 18.8 | 47.4 | 35.8 | 37.2 | 18.8 | 47.4 | 35.8 | 37.2 |
| SemDeDup | Task-Agnostic | ~4.39M | 29.6 | 54.9 | 26.5 | 59.2 | 28.9 | 19.2 | 48.1 | 36.1 | 37.8 | 19.2 | 48.1 | 36.1 | 37.8 |
| CLIP-Score | Task-Agnostic | ~6.91M | 30.1 | 55.3 | 27.2 | 60.5 | 30.7 | 20.6 | 51.9 | 38.8 | 39.4 | 20.6 | 51.9 | 38.8 | 39.4 |
| T-MARS | Task-Agnostic | ~5.49M | 30.8 | 56.4 | 27.8 | 61.0 | 30.2 | 20.2 | 50.8 | 38.3 | 39.4 | 20.2 | 50.8 | 38.3 | 39.4 |
| SIEVE | Task-Agnostic | ~3.29M | 31.7 | 57.0 | 28.5 | 62.5 | 26.6 | 19.0 | 45.2 | 36.7 | 38.4 | 19.0 | 45.2 | 36.7 | 38.4 |
| s-CLIPLoss | Task-Agnostic | ~6.58M | 32.3 | 58.5 | 29.7 | 64.1 | 32.4 | 21.8 | 54.7 | 40.5 | 41.8 | 21.8 | 54.7 | 40.5 | 41.8 |
| EcoDatum | Task-Agnostic | ~4.39M | 36.2 | 62.2 | 34.0 | 69.3 | 35.5 | 24.1 | 58.4 | 43.1 | 45.4 | 24.1 | 58.4 | 43.1 | 45.4 |
| HYPE | Task-Aware | ~3.29M | 36.5 | 62.1 | 32.5 | 67.4 | 32.1 | 22.0 | 53.2 | 40.1 | 43.2 | 22.0 | 53.2 | 40.1 | 43.2 |
| HYPE + s-CLIPLoss | Task-Aware | ~2.52M | 38.2 | 63.8 | 33.8 | 68.9 | 34.2 | 23.1 | 56.5 | 42.0 | 45.1 | 23.1 | 56.5 | 42.0 | 45.1 |
| FLYT + SCS | Task-Aware | ~10.97M | 39.5 | 66.5 | 36.8 | 72.6 | 36.9 | 25.2 | 59.8 | 45.5 | 47.9 | 25.2 | 59.8 | 45.5 | 47.9 |
| TADS (Ours) | Task-Aware | ~3.95M | 40.7 | 66.1 | 38.6 | 72.1 | 38.1 | 26.8 | 60.9 | 47.5 | 48.9 | 26.8 | 60.9 | 47.5 | 48.9 |
- 与基线相比,TADS 在 ImageNet-1K、CIFAR-100、MS-COCO 和 Flickr30K 上实现更优的零-shot 性能,且仅使用 36% 的数据。
- 在固定预训练预算下,TADS 平均将多任务性能提升约 1.0%。
- 任务感知的相关性与多样性使数据效率优于任务无关的方法,打破了浪费的噪声瓶颈。
- 消融研究显示完整的 TADS 流程在 ImageNet-1K 上达到最佳 Top-1(40.7%),通过增加质量、相关性、多样性和需求感知优化获得显著提升。
- 去重(元数据、语义和质量引导)显著降低数据规模,同时提升下游准确率。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。