QUICK REVIEW

[论文解读] Actor-Action Semantic Segmentation with Region Masks.

Kang Dang, Chunluan Zhou|arXiv (Cornell University)|Jan 1, 2018

Human Pose and Action Recognition被引用 2

一句话总结

该论文提出了一种基于区域的演员-动作语义分割方法，通过为区域掩码内的所有像素分配单一动作标签，确保动作标注的一致性，采用双流网络进行特征融合和基于区域的分割头。该方法在A2D数据集上的平均类别准确率（mCA）比SOTA高出8.1%，平均交并比（mIoU）高出5.3%。

ABSTRACT

In this paper, we study the actor-action semantic segmentation problem, which requires joint labeling of both actor and action categories in video frames. One major challenge for this task is that when an actor performs an action, different body parts of the actor provide different types of cues for the action category and may receive inconsistent action labeling when they are labeled independently. To address this issue, we propose an end-to-end region-based actor-action segmentation approach which relies on region masks from an instance segmentation algorithm. Our main novelty is to avoid labeling pixels in a region mask independently - instead we assign a single action label to these pixels to achieve consistent action labeling. When a pixel belongs to multiple region masks, max pooling is applied to resolve labeling conflicts. Our approach uses a two-stream network as the front-end (which learns features capturing both appearance and motion information), and uses two region-based segmentation networks as the back-end (which takes the fused features from the two-stream network as the input and predicts actor-action labeling). Experiments on the A2D dataset demonstrate that both the region-based segmentation strategy and the fused features from the two-stream network contribute to the performance improvements. The proposed approach outperforms the state-of-the-art results by more than 8% in mean class accuracy, and more than 5% in mean class IOU, which validates its effectiveness.

研究动机与目标

为解决演员-动作分割中身体部位间动作标注不一致的问题，通过在区域掩码内强制实现统一标注。
通过双流网络融合外观和运动特征，提升演员-动作语义分割性能。
通过最大池化操作解决像素同时属于多个区域掩码时的标注冲突问题。
开发一个端到端框架，联合预测演员和动作类别，并保证空间一致性。

提出的方法

以实例分割算法生成的区域掩码为基础进行动作标注，每个掩码仅分配一个动作标签，以确保一致性。
采用双流卷积网络提取外观特征和运动特征，并进行特征融合以提升表征能力。
对像素同时属于多个区域掩码的情况，应用最大池化操作，选择置信度最高的动作标签以解决冲突。
使用两个基于区域的分割头，以融合后的特征作为输入，为每个区域预测演员和动作类别。
端到端训练整个模型，以优化联合演员-动作标注的空间一致性。

实验结果

研究问题

RQ1在区域掩码内强制实现一致的动作标注是否能提升演员-动作语义分割的性能？
RQ2融合外观和运动特征对演员-动作标注准确率有何影响？
RQ3与逐像素独立标注相比，基于区域的标注方式对动作一致性有何影响？
RQ4最大池化在解决重叠区域标注冲突方面效果如何？

主要发现

所提出的基于区域的标注策略显著提升了性能，通过确保区域掩码内所有像素的动作标注一致。
利用双流网络融合的特征显著提升了演员-动作分割的性能。
该方法达到72.4%的平均类别准确率（mCA），比之前SOTA高出8.1个百分点。
平均交并比（mIoU）达到58.9%，相比之前SOTA提升5.3个百分点。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。