QUICK REVIEW

[论文解读] Less Data, Faster Convergence: Goal-Driven Data Optimization for Multimodal Instruction Tuning

Rujie Wu, Haozhe Zhao|arXiv (Cornell University)|Mar 12, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

GDO 从固定的多模态池中构建优化的 1x 数据子集，以在 MVBench、VideoMME、MLVU 和 LVBench 上实现比固定 512k-样本基线更快的收敛和更高的准确性。

ABSTRACT

Multimodal instruction tuning is often compute-inefficient because training budgets are spread across large mixed image-video pools whose utility is highly uneven. We present Goal-Driven Data Optimization (GDO), a framework that computes six sample descriptors for each candidate and constructs optimized 1$ imes$ training subsets for different goals. Under a fixed one-epoch Qwen3-VL-8B-Instruct training and evaluation recipe on 8 H20 GPUs, GDO uses far fewer training samples than the Uni-10x baseline while converging faster and achieving higher accuracy. Relative to the fixed 512k-sample Uni-10x baseline, GDO reaches the Uni-10x reference after 35.4k samples on MVBench, 26.6k on VideoMME, 27.3k on MLVU, and 34.7k on LVBench, while improving Accuracy by +1.38, +1.67, +3.08, and +0.84 percentage points, respectively. The gains are largest on MVBench and MLVU, while LVBench improves more modestly, consistent with its ultra-long-video setting and the mismatch between that benchmark and the short-video/image-dominant training pool. Across MinLoss, Diverse, Temp, and Temp+, stronger temporal emphasis yields steadily better long-video understanding behavior. Overall, GDO provides a goal-driven data optimization framework that enables faster convergence with fewer training samples under a fixed training protocol. Code is available at https://github.com/rujiewu/GDO.

研究动机与目标

在固定多模态指令微调设置中，数据分配如何影响性能和收敛性进行识别。
提出一个可复用的流水线（GDO），利用六个样本描述符构建优化的 1x 子集。
证明在固定算力下，GDO 能以显著更少的训练样本实现等效或更高的性能。
分析不同目标配置如何影响能力与收敛，并解释为何在不同基准上收益会有所不同。

提出的方法

定义固定的骨干网络、训练方案、检查点和评估，以隔离数据优化的影响。
对候选样本的六个样本描述符进行计算，以捕捉运动、视频依赖、时间需求、稳定性、难度和覆盖率。
计算共享分数 ρ(x) 并应用目标特定的可行性预设 Cg 来构建优化的 1x 子集 Sg，与均匀对照 Ug 进行比较。
使用分层子集构建、每层定额、去重和逐视频 QA 上限来维持模态混合与来源广度。
在 MVBench、VideoMME、MLVU 和 LVBench 上对四种 GDO 配置（MinLoss、Diverse、Temp、Temp+）与固定 Uni-10x 基线进行评估。
报告轨迹与消融分析，展示时间强调如何在子任务上改变性能。

实验结果

研究问题

RQ1在固定训练约束下，是否可以通过更好的数据优化实现更少的数据和更快的收敛？
RQ2六个样本描述符如何捕捉多模态样本在子集构建中的价值？
RQ3不同的目标配置（MinLoss、Diverse、Temp、Temp+）是否产生不同的能力与收敛轨迹？
RQ4为何在具有不同时间需求的基准上，数据优化的收益会有所不同？

主要发现

Benchmark	Uni-10x	GDO	Δ (pp)	Peak Match	Reduction
MVBench	62.27	63.65	+1.38	35.4k	14.5x
VideoMME	61.22	62.89	+1.67	26.6k	19.2x
MLVU	43.81	46.89	+3.08	27.3k	18.8x
LVBench	40.22	41.06	+0.84	34.7k	14.8x

GDO 在所有基准上都以 far fewer 的样本达到固定 Uni-10x 的参考水平。
GDO 提供性能提升：MVBench +1.38 pp，VideoMME +1.67 pp，MLVU +3.08 pp，LVBench +0.84 pp。
GDO 相对于 512k Uni-10x 参考实现峰值匹配的数据压缩比为 14.5x–19.2x。
时间强调带来更强的长视频理解，特别是在 MVBench 和 MLVU 上。
四种配置形成一个连贯的前沿：MinLoss 在数据效率方面最优；Temp/Temp+ 最大化时间理解；Diverse 强调覆盖范围。
消融研究表明 Temp+ 的收益源自多个描述符项的组合，而非单一因素。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。