QUICK REVIEW

[论文解读] Exploring Temporal Preservation Networks for Precise Temporal Action Localization

Ke Yang, Peng Qiao|arXiv (Cornell University)|Aug 10, 2017

Human Pose and Action Recognition被引用 33

一句话总结

本文提出时序保留卷积（TPC）网络，通过在3D ConvNet推理过程中保持完整的时序分辨率，实现在未剪辑视频中的精确帧级动作定位。与以往使用转置卷积上采样的方法（如CDC）不同，TPC滤波器通过扩张感受野的步长大Temporal卷积来保持时序信息，实现了在帧级和段级时序动作定位任务上的最先进性能，且时序信息损失最小。

ABSTRACT

Temporal action localization is an important task of computer vision. Though a variety of methods have been proposed, it still remains an open question how to predict the temporal boundaries of action segments precisely. Most works use segment-level classifiers to select video segments pre-determined by action proposal or dense sliding windows. However, in order to achieve more precise action boundaries, a temporal localization system should make dense predictions at a fine granularity. A newly proposed work exploits Convolutional-Deconvolutional-Convolutional (CDC) filters to upsample the predictions of 3D ConvNets, making it possible to perform per-frame action predictions and achieving promising performance in terms of temporal action localization. However, CDC network loses temporal information partially due to the temporal downsampling operation. In this paper, we propose an elegant and powerful Temporal Preservation Convolutional (TPC) Network that equips 3D ConvNets with TPC filters. TPC network can fully preserve temporal resolution and downsample the spatial resolution simultaneously, enabling frame-level granularity action localization. TPC network can be trained in an end-to-end manner. Experiment results on public datasets show that TPC network achieves significant improvement on per-frame action prediction and competing results on segment-level temporal action localization.

研究动机与目标

解决在未剪辑视频中精确时序动作定位的挑战，现有方法因下采样导致时序信息丢失。
克服卷积-转置卷积（CDC）网络因转置卷积导致的时序信息丢失和棋盘格伪影的局限性。
实现在无需后处理上采样或转置卷积层的情况下，3D ConvNet的端到端训练以进行帧级动作预测。
在推理过程中保持预训练模型的时序感受野，同时维持完整的时序分辨率。
通过最小但有效的架构修改，在帧级和段级动作定位任务中均实现卓越性能。

提出的方法

引入时序保留卷积（TPC）滤波器，使卷积和池化操作后仍能保持输入的时序长度，从而实现完整时序分辨率的保持。
设计TPC滤波器以扩展标准3D卷积的感受野，而无需增加卷积核大小，从而在完整时序分辨率下实现有效的上下文建模。
将C3D中的标准3D卷积层替换为TPC滤波器，构建TPC网络，该网络可端到端训练以进行帧级动作分类。
利用TPC的帧级预测结果，对S-CNN生成的动作段边界进行优化，提升段级定位的准确性。
实现一种变体TPC-GAP，用全局平均池化替代最后的全连接层，使参数量减少5倍，同时保持竞争力的性能。
通过消除转置卷积的使用，避免基于转置卷积的上采样，从而消除棋盘格伪影并简化训练过程。

实验结果

研究问题

RQ1我们能否在不降低时序感受野或无需重新预训练的情况下，保持3D ConvNets中的完整时序分辨率？
RQ2与基于CDC的方法相比，用TPC滤波器替换标准3D卷积是否能提升帧级动作定位性能？
RQ3通过更精确的帧级预测来优化提议，TPC网络能否在段级定位中实现更好的性能？
RQ4与CDC相比，完全去除转置卷积层在多大程度上减少了伪影并提升了泛化能力？
RQ5轻量级变体TPC-GAP能否在参数量显著减少的情况下实现具有竞争力的性能？

主要发现

TPC网络在帧级动作定位（THUMOS’14）上达到47.2%的mAP，显著优于CDC及其他基线方法。
在IoU阈值为0.5时，TPC在段级动作定位上达到23.6%的mAP，表现出强大的泛化能力和边界优化能力。
在所有IoU阈值（0.3–0.7）下，TPC在帧级预测中均优于CDC，表明其具有更高的逐帧定位精度。
TPC相较于CDC的性能提升在提议中的误报帧上最为显著，表明TPC在处理模糊或背景片段时更具优势。
TPC-GAP变体仅使用CDC 1/5的参数量，但性能仍具竞争力，展现出出色的效率而不损失准确性。
TPC避免了转置卷积网络中常见的棋盘格伪影，因为它完全消除了对转置卷积的需求。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。