QUICK REVIEW

[论文解读] ActionCLIP: A New Paradigm for Video Action Recognition

Mengmeng Wang, Jiazheng Xing|arXiv (Cornell University)|Sep 17, 2021

Human Pose and Action Recognition参考文献 53被引用 189

一句话总结

ActionCLIP 将视频动作识别重新框架为视频–文本匹配，提出一个预训练、提示、和微调的范式，在 Kinetics-400 上实现最先进的结果，并在零-shot/少-shot迁移方面表现出色。

ABSTRACT

The canonical approach to video action recognition dictates a neural model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferable ability on new datasets with unseen concepts. In this paper, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, prompt and fine-tune". This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then it makes the action recognition task to act more like pre-training problems via prompt engineering. Finally, it end-to-end fine-tunes on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at https://github.com/sallymmx/ActionCLIP.git

研究动机与目标

将动作识别重新框定为一个视频–文本匹配问题，以利用语义标签文本。
提出一个可扩展的范式——预训练、提示和微调——以重用大规模网页数据预训练模型。
在标准基准上展示零-shot 和少-shot 迁移能力。
展示文本提示和精心设计的视觉提示在提高性能的同时，避免灾难性遗忘。

提出的方法

通过视频编码器 gV 和语言编码器 gW 将动作识别表述为 P(f(x,y)|θ)，以最大化跨模态相似度 s(x,y)。
使用余弦相似性以及带温标 τ 的对称 p_x2y、p_y2x 在视频–文本对上定义对比 KL 损失。
采用多模态训练目标（视频–文本对比损失）将相关的视频与标签表示拉到一起。
提出一个新范式：在网页数据上进行预训练，应用文本提示和视觉提示使下游任务与预训练目标对齐，然后在目标数据集上端到端微调。
以 CLIP 作为基础模型实例化 ActionCLIP，使用标签句子的文本提示以及各种视觉提示（前网络、内网络、后网络）来建模时间信息。
在零-shot 和少-shot 条件下进行评估，并与单模态基线和现有方法进行比较。

实验结果

研究问题

RQ1标签文本中的语义信息是否能相比传统的单模态分类提升动作识别？
RQ2在标准基准上，预训练、提示和微调范式是否能实现有效的零-shot 和少-shot 动作识别？
RQ3文本提示和不同的视觉提示在将预训练模型适配到视频动作时，如何影响性能和知识保留（灾难性遗忘）？

主要发现

Backbone	Frames	Top-1	Top-5	GFLOPs	Params	Runtime
TimeSformer-L	96	80.7	-	7140	-	-
ViViT-L/16x2	320	32	81.3	3992	-	4.2V/s
ViT-B/32	8	78.4	35.4	144.1M	144.7V/s	-
ViT-B/32	8	81.1	140.8	141.7M	43.2V/s	-
ViT-B/16	8	82.3	563.1	141.7M	13.0V/s	-
ViT-B/16	16	81.7	281.6	141.7M	21.2V/s	-
ViT-B/16	32	82.3	563.1	141.7M	13.0V/s	-

多模态框架在所用实例下相较于单模态基线将 top-1 精度提升了 2.91 个百分点（Kinetics-400 从 75.45% 提升到 78.36%）。
ActionCLIP 在 Kinetics-400 上使用 ViT-B/16，16 帧时达到 82.6% top-1 和 96.2% top-5；16 帧时的 82.6%/96.2%；在 32 帧时达到 83.8% top-1（ViT-B/16）。
零-shot/少-shot 结果显示 ActionCLIP 在数据稀缺情境下处于领先，并在 Kinetics-400、HMDB-51 和 UCF-101 上实现零-shot 识别，而某些基线方法困难。
对文本标签进行提示相较于仅使用标签词提高了性能（77.82% -> 78.36% top-1）。
视觉提示影响性能；后网络提示（MeanP、LSTM、Conv1D、Transf）取得强劲结果，而前网络 Joint 与 in-network Shift 可能降低性能，表明提示设计在防止灾难性遗忘中的重要性。
对所有组件进行微调可获得最佳结果；冻结编码器会降低性能（如消融中的 V1–V4 比较）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。