QUICK REVIEW

[论文解读] Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling

Yu Zhao, Hao Fei|arXiv (Cornell University)|Aug 9, 2023

Multimodal Machine Learning Applications被引用 16

一句话总结

本文提出 HostSG，一种面向视频语义角色标注（VidSRL）的整体时空场景图，将细粒度的空间线索与时间动态相结合，并提供端到端的场景-事件融合，以在动词预测、SRL 和事件关系任务上提升 VidSRL 的性能。

ABSTRACT

Video Semantic Role Labeling (VidSRL) aims to detect the salient events from given videos, by recognizing the predict-argument event structures and the interrelationships between events. While recent endeavors have put forth methods for VidSRL, they can be mostly subject to two key drawbacks, including the lack of fine-grained spatial scene perception and the insufficiently modeling of video temporality. Towards this end, this work explores a novel holistic spatio-temporal scene graph (namely HostSG) representation based on the existing dynamic scene graph structures, which well model both the fine-grained spatial semantics and temporal dynamics of videos for VidSRL. Built upon the HostSG, we present a nichetargeting VidSRL framework. A scene-event mapping mechanism is first designed to bridge the gap between the underlying scene structure and the high-level event semantic structure, resulting in an overall hierarchical scene-event (termed ICE) graph structure. We further perform iterative structure refinement to optimize the ICE graph, such that the overall structure representation can best coincide with end task demand. Finally, three subtask predictions of VidSRL are jointly decoded, where the end-to-end paradigm effectively avoids error propagation. On the benchmark dataset, our framework boosts significantly over the current best-performing model. Further analyses are shown for a better understanding of the advances of our methods.

研究动机与目标

激励 VidSRL 捕捉视频中的细粒度空间语义和时间动态。
提出 HostSG 将每个片段的动态场景图统一为一个覆盖整视频的时空图。
通过场景-事件映射将场景结构与高层事件语义连接到 ICE 图。
使用图信息瓶颈对 ICE 图进行迭代性细化，以使表示与最终任务预测对齐。
在一个端到端框架中联合解码动词预测、论元生成和事件关系。

提出的方法

通过生成每个片段的 DSGs，将它们合并为 Temporal DSG (TSG)，并通过跨片段共指边在片段之间建立连接来构建 HostSG。
通过将 HostSG 中的场景节点与事件谓词-论元节点连接，形成 ICE，以使场景结构与事件语义保持一致。
在 ICE 上执行时空传播，使用多路径图注意力网络进行片内空间更新，使用 GGNN 进行事件的时间演化。
使用图信息瓶颈目标对 ICE 结构和边权进行迭代细化，以在保留与任务相关信息的同时去除噪声边。
联合解码 VidSRL 的三个子任务：通过 MLP 头进行动词预测和事件关系解码，通过 Transformer 解码器进行论元生成，所有都利用 ICE 表征。

实验结果

研究问题

RQ1一个面向 VidSRL 的整体时空场景图（HostSG）是否能够比帧级特征更好地捕捉细粒度的时空线索？
RQ2将 HostSG 与事件级 ICE 图连接是否能提升跨事件建模与长距离依赖的 VidSRL 表现？
RQ3在图信息瓶颈引导下的迭代结构细化是否能提升终任务预测并抑制噪声结构？
RQ4端到端的动词、论元和事件关系联合解码是否优于流水线或部分联合的方法？

主要发现

方法	Acc@1(%)	Acc@5(%)	Rec@5(%)	CIDEr	Rouge-L	CIDEr-Vb	CIDEr-Arg	Lea	Lea-S	Macro-Acc(%)
HostSG (Ours)	56.15	86.33	29.38	55.09	43.13	64.24	47.68	55.70	35.01	35.97

HostSG 加上 ICE 并进行迭代细化在多个指标上显著超过 VidSRL 基准的 SoTA。
消融研究显示场景图特征对动词分类、SRL 和论元生成贡献最大，场景-事件映射与细化提供了显著提升。
跨片段共指边提升了性能，强调了跨帧连结对时序一致性的重要性。
场景-事件映射弥合了低级场景图与高级事件语义之间的差距，通过实现跨事件对象关联来改进预测。
端到端联合解码避免了流水线方法常见的错误传播，获得更好的 VidSRL 总体分数。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。