QUICK REVIEW

[论文解读] Language-based Action Concept Spaces Improve Video Self-Supervised Learning

Kanchana Ranasinghe, Michael S. Ryoo|arXiv (Cornell University)|Jul 20, 2023

Multimodal Machine Learning Applications被引用 7

一句话总结

本文将图像 CLIP 适配到视频，通过语言约束的行动概念空间与自蒸馏方法，在动作识别基准上显著提升零-shot、线性探测以及传导零-shot 性能。

ABSTRACT

Recent contrastive language image pre-training has led to learning highly transferable and robust image representations. However, adapting these models to video domains with minimal supervision remains an open problem. We explore a simple step in that direction, using language tied self-supervised learning to adapt an image CLIP model to the video domain. A backbone modified for temporal modeling is trained under self-distillation settings with train objectives operating in an action concept space. Feature vectors of various action concepts extracted from a language encoder using relevant textual prompts construct this space. We introduce two train objectives, concept distillation and concept alignment, that retain generality of original representations while enforcing relations between actions and their attributes. Our approach improves zero-shot and linear probing performance on three action recognition benchmarks.

研究动机与目标

通过利用语言条件的行动概念来实现无需逐视频标注的鲁棒视频表征学习。
提出从语言表示中推导出的行动概念空间（类别与描述）以引导自监督学习。
开发两种自监督目标（概念蒸馏与概念对齐），在保持 CLIP 风格泛化能力的同时捕捉行动属性关系。
使从视频训练的模型能够直接进行下游零-shot动作识别，无需字幕或逐视频标注。

提出的方法

使用基于 CLIP 的视觉主干，并通过分解时空注意力来增强时间建模。
引入一个文本分类器（来自 CLIP 文本编码器的冻结嵌入）将视觉特征投射到行动概念空间。
从语言构建两个概念空间：类别概念空间（行动标签）和描述概念空间（描述/属性）。
定义概念蒸馏损失，在每个概念空间中对两种增强视图进行对齐，采用 EMA teacher-student 设置以及对概念分数的 sharpened softmax。
加入均匀分布先验以防止崩塌，并添加概念对齐损失以跨视图将类别空间和描述空间绑定起来。
支持类别概念空间的变体（LSS-A、LSS-B、LSS-C），以在不泄露下游标签的情况下探索标签生成与可扩展性。

Figure 1: Our overall setup contains three components: visual teacher model (green), visual student model (red), and language model (blue). We utilize the text encoder of CLIP as our language model and extract concept vectors relevant to action labels and descriptions of those actions. A visual enco

实验结果

研究问题

RQ1语言衍生的行动概念是否能提升视频自监督学习，使 CLIP 风格的表征更好地迁移到视频任务？
RQ2在语言对齐的概念空间中进行概念蒸馏与概念对齐，是否能在标准基准上获得更好的零-shot与线性探测性能？
RQ3不同构建的类别/描述概念空间（包括大语言模型生成的标签）如何影响自监督学习结果与可扩展性？

主要发现

在 HMDB-51 与 UCF-101 上实现了与以往自监督方法相比的最优线性探测结果（未使用视频级标签或字幕）。
在零-shot 迁移到 HMDB-51 与 UCF-101 上超越基线的 CLIP 风格模型，并在 HMDB-51、UCF-101 及 Kinetics-400 的跨传导零-shot 性能上表现强劲。
两种新颖的自监督目标（概念蒸馏与概念对齐）持续优于基线，即使在结合了均匀分布先验时亦如此。
表明语言对齐的行动概念空间能够保留并提升图像 CLIP 表征向视频领域的迁移性，具备零-shot 能力。

Figure 2: We illustrate a toy concept space constructed with the three action concepts: run, swim, and walk. In this example, the text classifier projects visual feature $f_{i}$ into the 3-dimensional toy concept space to produce $\tilde{f}_{i}$ .

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。