QUICK REVIEW

[论文解读] TransNet V2: An effective deep network architecture for fast shot transition detection

Tomáš Souček, Jakub Lokoč|arXiv (Cornell University)|Aug 11, 2020

Video Analysis and Summarization参考文献 13被引用 55

一句话总结

TransNet V2 是一种基于 3D CNN 的改进型镜头转场检测器，使用带有核分解的膨胀 DCNN 块和帧相似性特征，在 ClipShots、BBC 上达到最先进的 F1，并在 RAI 上取得具有竞争力的结果，并提供一个开源训练模型和简易使用接口。

ABSTRACT

Although automatic shot transition detection approaches are already investigated for more than two decades, an effective universal human-level model was not proposed yet. Even for common shot transitions like hard cuts or simple gradual changes, the potential diversity of analyzed video contents may still lead to both false hits and false dismissals. Recently, deep learning-based approaches significantly improved the accuracy of shot transition detection using 3D convolutional architectures and artificially created training data. Nevertheless, one hundred percent accuracy is still an unreachable ideal. In this paper, we share the current version of our deep network TransNet V2 that reaches state-of-the-art performance on respected benchmarks. A trained instance of the model is provided so it can be instantly utilized by the community for a highly efficient analysis of large video archives. Furthermore, the network architecture, as well as our experience with the training process, are detailed, including simple code snippets for convenient usage of the proposed model and visualization of results.

研究动机与目标

在多样化的视频内容上，提升镜头转场检测的准确性，超越以往的深度学习方法。
提供一个开源、易于使用的模型以及面向大规模视频分析的训练/评估流水线。
探索能够稳定训练并降低对合成数据过拟合的架构改进。

提出的方法

在 TransNet 的基础上，使用带有批量归一化和跳跃连接的膨胀 DCNN 单元。
将 3D 卷积分解为空间的 2D 卷积加上时间的 1D 卷积以减少参数量（卷积核因式分解）。
通过 RGB 直方图和由相似性网络处理的学习特征来纳入帧相似性。
使用两个预测头：一个用于单帧/中间帧的转场预测头，和一个全帧预测头以引导训练。
使用来自 IACC.3 和 ClipShots 的合成转场以及真实转场进行训练，采用带动量的 SGD 且学习率固定。
提供可直接使用的训练模型和轻量级推理 API，实现即时镜头检测。

Figure 1. TransNet V2 Architecture (left), DDCNN V2 cell (right top), and learnable frame similarities computation (right bottom) with visualization of Pad + Gather operation.

实验结果

研究问题

RQ1TransNet V2 是否能够在多个基准测试（ClipShots、BBC、RAI）上超越先前的最先进镜头边界检测器？
RQ2哪些架构变更（核分解、帧相似性、双头结构）最能提升检测性能和训练稳定性？
RQ3合成转场数据与真实转场数据对多样数据集上的模型性能有何影响？

主要发现

TransNet V2 在 ClipShots、BBC 上比若干基线获得更高的 F1 分数，并在 RAI 的评估设置中与顶尖结果接近。
在 ClipShots 上，TransNet V2 达到 77.9，Compared to 73.5（TransNet 2019）和 75.9/76.1（其他基线）。
在 BBC 上，TransNet V2 达到 96.2，优于先前方法（如 TransNet 92.9；Hassanien 92.6；Tang 89.3）。
在 RAI 上，TransNet V2 达到 93.9，与重新评估协议中的 DeepSBD 和 DSM 基线相当。
合成转场显著提升了训练性能，相较仅使用真实转场，能够在各数据集上实现更好的泛化。
作者提供了开源的训练模型和代码，便于集成到视频预处理管线中。

Figure 2. Visualized predictions from both classification heads with a corresponding list of scenes. The original video authored by Blender Foundation licensed under CC-BY. Sequences with no transitions shortened due to limited space.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。