QUICK REVIEW

[论文解读] TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition

Mina Bishay, Georgios Zoumpourlis|arXiv (Cornell University)|Jul 21, 2019

Human Pose and Action Recognition被引用 84

一句话总结

tldr: TARN 引入了一种时序注意关系网络，用于 few-shot 和 zero-shot 动作识别。它使用分段级注意来对齐视频分段，并学习用于视频匹配的深度度量，在 FSL 方面达到最先进的结果，在 ZSL 方面也取得竞争力的结果，且无需微调或额外的内存模块。

ABSTRACT

In this paper we propose a novel Temporal Attentive Relation Network (TARN) for the problems of few-shot and zero-shot action recognition. At the heart of our network is a meta-learning approach that learns to compare representations of variable temporal length, that is, either two videos of different length (in the case of few-shot action recognition) or a video and a semantic representation such as word vector (in the case of zero-shot action recognition). By contrast to other works in few-shot and zero-shot action recognition, we a) utilise attention mechanisms so as to perform temporal alignment, and b) learn a deep-distance measure on the aligned representations at video segment level. We adopt an episode-based training scheme and train our network in an end-to-end manner. The proposed method does not require any fine-tuning in the target domain or maintaining additional representations as is the case of memory networks. Experimental results show that the proposed architecture outperforms the state of the art in few-shot action recognition, and achieves competitive results in zero-shot action recognition.

研究动机与目标

通过比较视频分段而非整段视频来解决 few-shot 动作识别。
通过将视频分段与语义类别表示关联，扩展到 zero-shot 动作识别。
开发一个端到端可训练的体系结构，不需要内存网络或目标域微调。

提出的方法

嵌入模块使用 C3D 特征对视频分段进行处理，随后通过双向 GRU 生成分段嵌入。
关系模块应用逐分段注意以对齐样本和查询分段，并将表示变换为相等的分段长度。
逐分段的比较输入到一个深度度量学习网络，以为每对视频生成一个关系分数。
对关系分数的 softmax 产生类别概率；对于 K-shot，按类别对分数进行平均。

实验结果

研究问题

RQ1分段级注意是否能够改善 few-shot 动作识别中的时间对齐与匹配？
RQ2在 FSL 中，使用学习的深度距离度量进行分段级比较是否优于对整段视频或固定距离的做法？
RQ3该框架能否扩展到零样本动作识别，使用语义向量作为与视频分段对齐的目标？

主要发现

在 1 至 5-shot 设置下，采用分段逐段注意和深度度量学习的 TARN 在 few-shot 动作识别上超越了现有最先进方法。
在比较层使用 EucCos 作为相似性度量，在测试选项中取得了最佳结果。
基于注意力的多分段比较在不同数据集和特征类型上都优于单向量基线（TARN-single）。
在零样本设置中，TARN 取得了竞争力的结果，特别是在 UCF-101 的划分上，多分段到属性的比较提供了最佳性能。
在该框架中，基于 C3D 的特征通常优于 ResNet-50 特征，用于 few-shot 动作识别。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。