QUICK REVIEW

[论文解读] Attention is All We Need: Nailing Down Object-centric Attention for Egocentric Activity Recognition

Swathikiran Sudhakaran, Oswald Lanz|arXiv (Cornell University)|Jul 31, 2018

Human Pose and Action Recognition被引用 40

一句话总结

作者提出一个端到端的 CNN-RNN 模型，使用类激活映射作为空间注意力以聚焦对象区域，从而实现带弱监督的自我视角活动识别，使用 convLSTM 时序编码；在多个基准数据集上达到最先进结果。

ABSTRACT

In this paper we propose an end-to-end trainable deep neural network model for egocentric activity recognition. Our model is built on the observation that egocentric activities are highly characterized by the objects and their locations in the video. Based on this, we develop a spatial attention mechanism that enables the network to attend to regions containing objects that are correlated with the activity under consideration. We learn highly specialized attention maps for each frame using class-specific activations from a CNN pre-trained for generic image recognition, and use them for spatio-temporal encoding of the video with a convolutional LSTM. Our model is trained in a weakly supervised setting using raw video-level activity-class labels. Nonetheless, on standard egocentric activity benchmarks our model surpasses by up to +6% points recognition accuracy the currently best performing method that leverages hand segmentation and object location strong supervision for training. We visually analyze attention maps generated by the network, revealing that the network successfully identifies the relevant objects present in the video frames which may explain the strong recognition performance. We also discuss an extensive ablation analysis regarding the design choices.

研究动机与目标

通过利用对象位置和手部来提升细粒度的第一人称视角活动识别。
开发一个端到端的架构，在没有强监督的情况下学习空间注意力图。
在保持空间结构的同时，使用 ConvLSTM 对时空信息进行编码。
通过消融实验和可视化，展示注意力图如何与与活动相关的对象对齐。

提出的方法

使用在 ImageNet 上预训练的 ResNet-34 提取帧特征并计算类激活映射（CAMs）。
将 CAM 转换为空间概率图，并通过 Hadamard 乘积对帧特征进行加权以作为注意力（f_SA(i)=f(i) ⊙ softmax(M_c(i))）。
采用卷积长短时记忆网络（ConvLSTM）在保持空间结构的同时对注意的帧特征进行时序编码。
分两阶段训练：阶段1 训练分类器和 ConvLSTM 层；阶段2 进一步微调 ResNet 的最终层和 FC 分类器，以使注意力专门化。
结合带堆叠光流（warp flow）的时序流，并通过平均融合或联合训练方法将时空流融合（联合训练相比简单平均提升约 10% 的相对增益）。
在 GTEA 61、GTEA 71、GTEA Gaze+ 和 EGTEA Gaze+ 上评估，每个视频 25 帧，光流堆栈为 5 帧；与基于手工分割和凝视的监督方法进行比较。

实验结果

研究问题

RQ1是否可以通过从弱视频级标签学习的以对象为中心的空间注意力，在没有手标注的情况下改善自我视角活动识别？
RQ2基于 ConvLSTM 的时空编码是否能保留并利用学到的空间注意力以实现细粒度活动？
RQ3端到端的 CAM 基注意力与依赖手/对象定位的强监督方法在标准基准上的对比如何？

主要发现

所提出的方法在四个自我视角数据集上达到最先进的结果，在标准基准上比先前的最佳方法的准确率提升多达 6 个百分点。
消融实验显示增加空间注意力比无注意力的基线提高约 12% 的准确率。
联合训练空间和时序流相较于简单平均融合带来约 10% 的提升。
可视化结果表明学到的注意力图能定位出与活动相关的对象，无需手分割或显式对象监督。
通过补偿相机运动的 warp 光流增强，性能提升约 4%。
基于 ConvLSTM 的架构在时间上保留了空间结构，使得对象位置的时空编码进入视频描述符成为可能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。