QUICK REVIEW

[论文解读] Few-shot Action Recognition via Improved Attention with Self-supervision

Hongguang Zhang, Li Zhang|arXiv (Cornell University)|Jan 12, 2020

Human Pose and Action Recognition参考文献 25被引用 3

一句话总结

该论文提出了一种基于C3D编码器、排列不变池化和自监督时空注意力机制的 few-shot 视频动作识别方法，以提升对可变动作长度和时序分布偏移的鲁棒性。通过自监督方法使注意力机制对块排列保持不变，模型在HMDB51、UCF101和miniImageNet-101（miniMIT）上取得了最先进性能。

ABSTRACT

Many few-shot learning models focus on recognising images. In contrast, we tackle a challenging task of few-shot action recognition from videos. We build on a C3D encoder for spatio-temporal video blocks to capture short-range action patterns. Such encoded blocks are aggregated by permutation-invariant pooling to make our approach robust to varying action lengths and long-range temporal dependencies whose patterns are unlikely to repeat even in clips of the same class. Subsequently, the pooled representations are combined into simple relation descriptors which encode so-called query and support clips. Finally, relation descriptors are fed to the comparator with the goal of similarity learning between query and support clips. Importantly, to re-weight block contributions during pooling, we exploit spatial and temporal attention modules and self-supervision. In naturalistic clips (of the same class) there exists a temporal distribution shift--the locations of discriminative temporal action hotspots vary. Thus, we permute blocks of a clip and align the resulting attention regions with similarly permuted attention regions of non-permuted clip to train the attention mechanism invariant to block (and thus long-term hotspot) permutations. Our method outperforms the state of the art on the HMDB51, UCF101, miniMIT datasets.

研究动机与目标

解决视频中 few-shot 动作识别的挑战，即每类动作的标注样本极为有限。
克服自然视频片段中可变动作长度和时序分布偏移的问题，其中判别性动作热点位置不固定。
通过将聚合的时空特征组合为关系描述符，改进查询和支撑片段的表征学习。
通过自监督对比学习，使注意力机制对块顺序保持不变，提升鲁棒性。
在基准 few-shot 视频动作识别数据集上实现最先进性能。

提出的方法

使用 C3D 编码器从视频片段中提取时空特征，捕捉短程动作模式。
应用排列不变池化聚合编码后的视频块，使模型对不同动作时长和长程依赖具有鲁棒性。
引入空间和时间注意力模块，在池化过程中重新加权块的贡献，聚焦于判别性区域。
通过置换视频片段中的块并对齐置换前后片段的注意力图，利用自监督方法训练注意力机制。
通过组合查询和支撑片段的表征构建关系描述符，以支持相似性学习。
使用比较器基于学习到的关系描述符预测查询与支撑片段之间的相似性得分。

实验结果

研究问题

RQ1通过块置换进行自监督训练的注意力机制，能否提升 few-shot 视频动作识别中对时序分布偏移的鲁棒性？
RQ2排列不变池化在多大程度上提升了模型对可变长度动作的泛化能力？
RQ3所提出的的关系描述符在捕捉 few-shot 动作分类所需的判别性时空模式方面有多有效？
RQ4与基线方法相比，空间与时间注意力的集成是否能提升 few-shot 视频基准上的性能？
RQ5该模型能否在极少标注样本下泛化到多样化动作类别，同时保持对动作长度变化的鲁棒性？

主要发现

所提方法在 HMDB51 数据集上实现了最先进性能，优于先前方法在 few-shot 动作识别中的表现。
在 UCF101 数据集上，模型在支持样本有限的情况下，展现出对多样化动作类别的优越泛化能力。
在 miniMIT（miniImageNet-101）基准上，模型取得了新的最先进结果，表明其在视频 few-shot 学习中具有强大的迁移能力。
通过块置换进行自监督训练的注意力机制，显著提升了模型在自然视频片段中对时序分布偏移的鲁棒性。
排列不变池化与关系描述符学习的结合，使查询-支撑匹配的特征表示更具判别性。
通过自监督方法训练的注意力机制对块顺序保持不变，从而增强了对可变长度动作的泛化能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。