QUICK REVIEW

[论文解读] On the effectiveness of task granularity for transfer learning

Farzaneh Mahdisoltani, Guillaume Berger|arXiv (Cornell University)|Apr 24, 2018

Human Pose and Action Recognition参考文献 31被引用 51

一句话总结

The paper investigates how the level of granularity in a source task (coarse to fine captions) affects the quality of learned features for transfer learning in video understanding, showing that finer-grained tasks yield better transfer performance, and that captioning can serve as an effective source task.

ABSTRACT

We describe a DNN for video classification and captioning, trained end-to-end, with shared features, to solve tasks at different levels of granularity, exploring the link between granularity in a source task and the quality of learned features for transfer learning. For solving the new task domain in transfer learning, we freeze the trained encoder and fine-tune a neural net on the target domain. We train on the Something-Something dataset with over 220, 000 videos, and multiple levels of target granularity, including 50 action groups, 174 fine-grained action categories and captions. Classification and captioning with Something-Something are challenging because of the subtle differences between actions, applied to thousands of different object classes, and the diversity of captions penned by crowd actors. Our model performs better than existing classification baselines for SomethingSomething, with impressive fine-grained results. And it yields a strong baseline on the new Something-Something captioning task. Experiments reveal that training with more fine-grained tasks tends to produce better features for transfer learning.

研究动机与目标

研究源任务标签粒度与可迁移特征质量之间的关系。
开发一个用于视频分类和字幕生成的统一编码-解码模型，具有共享表示。
评估从 Something-Something 特征到新领域的迁移学习，包括一个厨房行动数据集。
引入 20bn-kitchenware 作为用于细粒度任务的迁移学习基准。

提出的方法

使用一个两通道视频编码器（2D 空间卷积神经网络和 3D 时空卷积神经网络），再输入到共享的 LSTM 编码器。
联合训练一个分类头和一个字幕解码器，使用加权损失：loss = lambda * classification_loss + (1 - lambda) * captioning_loss。
训练四个任务：粗粒度动作分组、细粒度动作类别、简化对象占位符字幕，以及完整对象占位符字幕。
字幕解码器在编码后的视频表示上生成字幕；训练使用教师强制，固定字幕长度（14 个词）。
评估包括迁移学习：冻结编码器，在目标数据上训练分类器，比较在不同源粒度水平下学习到的特征。

实验结果

研究问题

RQ1在更细粒度的源任务上训练是否会产生更丰富的用于迁移学习的特征？
RQ2分类与字幕生成的联合训练相较于单任务训练在迁移性能上有何差异？
RQ3不同粒度水平（粗粒度分组、细粒度动作、简化字幕、完整字幕）对分类和字幕生成性能有何影响？
RQ4Something-Something 推导的特征在一个新的细粒度厨房动作数据集（20bn-kitchenware）上的迁移效果如何？

主要发现

使用更多细粒度任务进行训练往往会产生更适合迁移学习的特征。
被训练成同时执行分类和字幕生成的模型学习到的特征对新任务的迁移更有效。
在粗粒度与细粒度分类之间，细粒度训练在测试中的准确率更高（例如在所述设置中为 50.44% 对 41.7%）。
将字幕生成作为源任务是可行且有益的；结合字幕生成和动作分类的训练提升了迁移性能。
提出的 20bn-kitchenware 基准显示，在迁移到细粒度厨房动作时，Something-Something 预训练特征和带有递归的时序模型优于基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。