QUICK REVIEW

[论文解读] Towards Automatic Learning of Procedures from Web Instructional Videos

Luowei Zhou, Chenliang Xu|arXiv (Cornell University)|Mar 28, 2017

Multimodal Machine Learning Applications被引用 222

一句话总结

本论文为无约束视频定义程序分割，推出 YouCook2 数据集，并提出 ProcNets，一种在段级上递归模型，能够将长的教学视频分割成与类别无关的程序步骤，效果优于基线。

ABSTRACT

The potential for agents, whether embodied or software, to learn by observing other agents performing procedures involving objects and actions is rich. Current research on automatic procedure learning heavily relies on action labels or video subtitles, even during the evaluation phase, which makes them infeasible in real-world scenarios. This leads to our question: can the human-consensus structure of a procedure be learned from a large set of long, unconstrained videos (e.g., instructional videos from YouTube) with only visual evidence? To answer this question, we introduce the problem of procedure segmentation--to segment a video procedure into category-independent procedure segments. Given that no large-scale dataset is available for this problem, we collect a large-scale procedure segmentation dataset with procedure segments temporally localized and described; we use cooking videos and name the dataset YouCook2. We propose a segment-level recurrent network for generating procedure segments by modeling the dependencies across segments. The generated segments can be used as pre-processing for other tasks, such as dense video captioning and event parsing. We show in our experiments that the proposed model outperforms competitive baselines in procedure segmentation.

研究动机与目标

从长时间、无约束的教学视频（例如 YouTube）中学习人类对程序结构的共识。
定义并解决程序分割问题，将视频划分为与类别无关的片段。
创建一个大规模、富注释的数据集，以促进研究程序分割（YouCook2）。
开发一个端到端模型（ProcNets），能够定位片段提案并学习片段级的时间依赖。
证明片段级建模相较于帧级基线和不带字幕的基线具有改进。

提出的方法

通过 ResNet 特征引入带上下文的帧编码，随后使用 Bi-LSTM 生成带上下文的帧表示。
提出一个带 K 个锚点的片段提案模块（基于锚点的提案），以具有起始/结束偏移的候选程序片段，使用二分类和偏移回归进行训练。
使用一个序列预测模块（LSTM），对段级依赖进行建模，以选择并输出最终的程序片段序列，输入包括 Proposal Vector、Location Embedding 和 Segment Content。
用一个综合损失 L = L_cla + alpha_r L_reg + alpha_s L_seq 进行训练，其中 L_cla 是用于判定是否为过程性的二元交叉熵，L_reg 是偏移的平滑 L1，L_seq 是对序列预测的交叉熵。
推理时通过束搜索输出一个连贯的程序片段序列，不需要固定数量的片段。

实验结果

研究问题

RQ1是否仅通过可视证据就能从长时、无约束的视频中学习到程序的人工共识结构？
RQ2相比于帧级方法或非序列化提案，片段级的序列模型是否能够更好地捕捉程序步骤之间的长程依赖？
RQ3是否有一个大规模、富注释的数据集能实现对类别无关的程序分割的鲁棒学习和评估？
RQ4程序分割的输出是否能提升下游任务，如密集描述生成或教学视频中的事件解析？

主要发现

ProcNets在Jaccard和mIoU指标上的对比基线显著优越（在验证和测试集）。
ProcNets-LSTM 取得最高分：验证集 Jaccard 51.5，验证集 mIoU 37.5，测试集 Jaccard 50.6，测试集 mIoU 37.0。
ProcNets-NMS 相较于仅依赖非极大值抑制的基线有改进，显示出强烈的片段定位能力。
Location Embedding 是学习程序结构中最关键的组件，移除后有显著下降。
模型能够自适应每个视频的片段数量，并展示对程序结构的定性理解，包括处理未标注但在语义上有意义的片段。
YouCook2 数据集提供 2000 条视频，涉及 89 个配方，具备时间性的程序注释和祈使句，能进行鲁棒评估。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。